Lesson 1 · 9 min
Why observability is different for LLMs
Standard SRE observability — request-rate, error-rate, latency — is necessary but nowhere near sufficient. The four LLM-specific signals you need on day one.
What standard observability misses
A classic three-pillar setup (logs, metrics, traces) catches when your service is down. It does not catch:
- The model started refusing 12% of requests after a prompt edit.
- Average response length silently doubled, taking the cost with it.
- Retrieval precision@5 dropped from 0.78 to 0.52 over the past two weeks.
- 4% of tool calls now go to a tool that was never useful before.
All four are real production failures that look fine from a Datadog dashboard. None of them set off your error-rate alert. All of them produce angry users.