Lesson 9 · 11 min

Evaluating agents

Agents don't have a single right answer. Eval is about success rates and trace quality.

Why eval is harder for agents

A prompt has one input and one output. You can grade with regex, JSON-validity, or LLM-as-judge.

An agent has one input, an unknown number of intermediate steps, and (usually) a long-form output. Two agents can succeed on the same task while taking entirely different paths.