Lesson 5 · 10 min
The on-call playbook for AI features
A real on-call playbook for AI features. What you check first, what you can fix in 5 minutes, what requires a postmortem.
First 5 minutes
A page fires. Open the dashboard. Walk this order:
- Check the four LLM signals — refusal rate, response length, retrieval precision, tool-call distribution. Which one tripped?
- Check the model. Did the provider release a new model version recently? Did your code path's
modelparameter change? - Check recent deploys. Any prompt edit, retrieval change, tool definition change in the last 24h?
- Check upstream. Provider status page, vector DB status, any tool dependency.
- Sample 5 traces from the affected window. Eyeball them. The bug is often visible in the first one.