Lesson 2 · 11 min

Building your first eval dataset

Golden datasets are the foundation of every serious eval suite. How to collect representative cases, write ground truth without bias, and avoid the three dataset traps that make evals useless.

Ground truth is the hard part

A model output is only evaluable if you know what a good output looks like. That's your ground truth — and creating it is the bottleneck most teams skip.

The shortcut is tempting: generate examples from the model you're testing, then use the same model to judge them. This is circular and catches nothing. Ground truth must be independent of the model under test.