Lesson 5 · 10 min
Autoscaling & traffic patterns
Bursty traffic + slow GPU cold-starts = the canonical MLOps headache.
The asymmetry
A web app autoscales in seconds. A GPU container autoscales in minutes — 30s to provision a node, 60-120s to pull weights, 30s to warm vLLM. That delay is brutal under sudden load.
Mitigations:
- Floor at 1 instance — never scale to zero in production unless cold-start is sub-15s.
- Pre-warm before peaks — schedule scale-up before known traffic surges.
- Buffer with queues — accept the request, return a job ID, process async. Trades latency for capacity.
- Multi-tier serving — small model handles overflow when the big one is saturated.