Lesson 7 · 10 min
Monitoring — what to actually watch
Prometheus dashboards lie. The right four metrics catch 90% of incidents.
The four golden signals (LLM edition)
Google's SRE book has "latency, traffic, errors, saturation". For LLM serving, expand to:
- Latency — p50, p95, p99. Track first-token latency separately from total — they have different causes.
- Throughput — requests/sec, tokens/sec.
- Errors — 4xx (client), 5xx (server), timeouts, OOMs.
- Saturation — GPU util, memory util, queue depth.
- Cost — $ per 1k requests, $ per user/day.
- Quality — accuracy on a held-out probe set, hallucination rate, refusal rate.
- Drift — input distribution shift detection.