AI Evaluation & Testing for Engineers
Stop shipping on gut feel. Build the eval system that catches regressions before users do.
The discipline that separates teams that ship AI features confidently from those that debug in production. Golden datasets, deterministic evals, LLM-as-judge with calibration, CI regression gates, RAG evaluation, and continuous production monitoring — all with runnable code.
7h
Duration
8
Lessons
0
Learners
Course map
Lessons unlock as you complete the previous one. Your progress is saved on this device.
Lesson 1
Why evals are not optional
9m35 XPLesson 2
Building your first eval dataset
11m40 XPLesson 3
LLM-as-judge — when and how
12m40 XPLesson 4
Deterministic evals — structured output and tool use
10m38 XPLesson 5
Regression testing in CI
11m42 XPLesson 6
RAG evaluation — retrieval and answer quality
12m42 XPLesson 7
Production monitoring — catching drift before users do
10m40 XPLesson 8
Capstone — production eval system end-to-end
14m55 XP
Take next
Courses that pair well after — or alongside — AI Evaluation & Testing for Engineers.
Structured Outputs & Tool Use in Production
Stop parsing free text. Make the model return exactly what your code expects.
intermediate · 6h
Context Window Engineering
The context window is the computer. Learn to use it deliberately.
intermediate · 6h
RAG & Vector Databases
Make models answer from your data, not their guesses.
intermediate · 8h