Lesson 4 · 10 min

Deterministic evals — structured output and tool use

When your model must produce JSON, call the right tool, or extract a specific field — LLM-as-judge is overkill. Deterministic evals are faster, cheaper, and more reliable. The patterns that cover 80% of use cases.

When deterministic beats probabilistic

If there's a correct answer you can compute, use it. LLM-as-judge costs 5–20× more per eval and introduces noise. Deterministic evals run in milliseconds and never hallucinate a score.

The cases where deterministic wins:

Structured output — does the JSON parse? Are required fields present? Are values in range?
Classification — does the label match ground truth? Precision / recall / F1 are deterministic.
Tool use — did the model call the right tool? With the right arguments?
Extraction — is the extracted entity in the source text? Exact match or normalized match.
Format constraints — is the response under N tokens? In the required language?