Lesson 5 · 10 min

Autoscaling & traffic patterns

Bursty traffic + slow GPU cold-starts = the canonical MLOps headache.

The asymmetry

A web app autoscales in seconds. A GPU container autoscales in minutes — 30s to provision a node, 60-120s to pull weights, 30s to warm vLLM. That delay is brutal under sudden load.

Mitigations:

Floor at 1 instance — never scale to zero in production unless cold-start is sub-15s.
Pre-warm before peaks — schedule scale-up before known traffic surges.
Buffer with queues — accept the request, return a job ID, process async. Trades latency for capacity.
Multi-tier serving — small model handles overflow when the big one is saturated.