Lesson 7 · 10 min

Monitoring — what to actually watch

Prometheus dashboards lie. The right four metrics catch 90% of incidents.

The four golden signals (LLM edition)

Google's SRE book has "latency, traffic, errors, saturation". For LLM serving, expand to:

Latency — p50, p95, p99. Track first-token latency separately from total — they have different causes.
Throughput — requests/sec, tokens/sec.
Errors — 4xx (client), 5xx (server), timeouts, OOMs.
Saturation — GPU util, memory util, queue depth.
Cost — $ per 1k requests, $ per user/day.
Quality — accuracy on a held-out probe set, hallucination rate, refusal rate.
Drift — input distribution shift detection.