Lesson 7 · 10 min
Production monitoring — catching drift before users do
Your eval suite passes in CI. Then the model provider updates their weights. Or real-world input distribution shifts. Production monitoring runs your evals continuously against live traffic so you catch silent degradation.
The gap between CI and production
CI evals run on a static, curated dataset. Production inputs are messy, evolving, and never quite what you curated. The two failure modes CI misses:
- Input distribution shift — users start asking different kinds of questions. Your eval dataset doesn't cover the new pattern.
- Model drift — provider updates model weights without a version bump. Behavior changes silently.
Production monitoring catches both by running eval logic against a sample of live traffic.