Lesson 3 · 10 min

Metrics that correlate with user complaints

Not every metric is worth alerting on. The five that correlate most reliably with 'this feature feels worse than yesterday'.

The five with proven signal

From incident-review data across many teams shipping LLM features:

Refusal rate. Sudden upward shift = model behavior changed (provider-side update, or your prompt now triggers safety). Catches both upstream and self-inflicted regressions.
Response length p95 / p99. Sudden upward shift = cost going up; sudden downward = quality dropping (truncation, refusal). Either direction is bad.
Retrieval precision@k against a probe set. A 30-case probe set run hourly catches the silent-decay class of failures.
Tool-call distribution entropy. When the model starts calling one tool way more than usual, or never calling a tool that used to be common, something shifted.
Schema-validation failure rate. Output formatting drift is a common silent regression.