Lesson 4 · 9 min

Drift detection on retrieval and outputs

The probe set: a small, labelled, frozen evaluation that runs hourly against production. The single best tool for catching silent decay.

The probe set

30 cases. Frozen. Labelled. Runs against your production pipeline (not staging) every hour. Compare against yesterday's results.

For RAG: 30 representative queries with known-good document IDs. Compute precision@5 each run.
For agents: 30 representative tasks with known-good final answers (or trajectories). Compute success rate.
For prompts: 30 input/expected-output pairs. Compute exact-match or rubric-judge score.

When a probe-set metric drops more than 10% vs the prior 24-hour baseline, page on-call.