Skip to main content

Lesson 3 · 11 min

Eval-design — the most discriminating round

"How would you know your prompt was good?" is the question that filters most candidates. The full answer in 3 minutes.

The question, the answer

Interviewer: "How would you know your prompt was good?"

Weak answer: "I'd test it on a few examples and check the output looks right."

Strong answer (the full version):

"I'd build a 30-50 case eval set mixing real production traces with synthetic edge cases and known historical failures. The scoring depends on the task — for extraction, I'd use exact-match on JSON-validity plus per-field accuracy; for open-ended generation, an LLM-as-judge against a written rubric, validated against a 20-case human-graded subset to keep the judge honest. I'd run case-level diffs on every prompt change so I catch silent regressions where aggregate accuracy stays flat but three cases got worse and three got better. Per-segment metrics if my data has axes that matter (language, expertise level). And a CI gate so the eval runs on every PR that touches the prompt."

That's the full answer. Memorize the structure; you don't need to deliver it word-for-word.