Skip to main content

Lesson 7 · 10 min

Monitoring — what to actually watch

Prometheus dashboards lie. The right four metrics catch 90% of incidents.

The four golden signals (LLM edition)

Google's SRE book has "latency, traffic, errors, saturation". For LLM serving, expand to:

  1. Latency — p50, p95, p99. Track first-token latency separately from total — they have different causes.
  2. Throughput — requests/sec, tokens/sec.
  3. Errors — 4xx (client), 5xx (server), timeouts, OOMs.
  4. Saturation — GPU util, memory util, queue depth.
  5. Cost — $ per 1k requests, $ per user/day.
  6. Quality — accuracy on a held-out probe set, hallucination rate, refusal rate.
  7. Drift — input distribution shift detection.