Lesson 9 · 11 min
Evaluating agents
Agents don't have a single right answer. Eval is about success rates and trace quality.
Why eval is harder for agents
A prompt has one input and one output. You can grade with regex, JSON-validity, or LLM-as-judge.
An agent has one input, an unknown number of intermediate steps, and (usually) a long-form output. Two agents can succeed on the same task while taking entirely different paths.