Lesson 1 · 9 min

Why observability is different for LLMs

Standard SRE observability — request-rate, error-rate, latency — is necessary but nowhere near sufficient. The four LLM-specific signals you need on day one.

What standard observability misses

A classic three-pillar setup (logs, metrics, traces) catches when your service is down. It does not catch:

The model started refusing 12% of requests after a prompt edit.
Average response length silently doubled, taking the cost with it.
Retrieval precision@5 dropped from 0.78 to 0.52 over the past two weeks.
4% of tool calls now go to a tool that was never useful before.

All four are real production failures that look fine from a Datadog dashboard. None of them set off your error-rate alert. All of them produce angry users.