Skip to main content

Lesson 5 · 10 min

The on-call playbook for AI features

A real on-call playbook for AI features. What you check first, what you can fix in 5 minutes, what requires a postmortem.

First 5 minutes

A page fires. Open the dashboard. Walk this order:

  1. Check the four LLM signals — refusal rate, response length, retrieval precision, tool-call distribution. Which one tripped?
  2. Check the model. Did the provider release a new model version recently? Did your code path's model parameter change?
  3. Check recent deploys. Any prompt edit, retrieval change, tool definition change in the last 24h?
  4. Check upstream. Provider status page, vector DB status, any tool dependency.
  5. Sample 5 traces from the affected window. Eyeball them. The bug is often visible in the first one.