Lesson 3 · 10 min
Token budgets per request
Every customer-facing LLM call gets a written budget — input tokens, output tokens, max latency. Reviewed in PR like any other resource concern.
A real budget for a summarization call
A typical customer-facing summarization call:
- System prompt: 800 tokens (cached, $0.0008/1k input)
- Retrieved context: max 4,000 tokens
- User input: max 500 tokens
- Output: max 600 tokens
- Total budget: ~5,900 tokens, ~$0.012 per call at frontier-tier prices
- p99 latency budget: 4 seconds
When the budget is violated in production (long retrieved context, abusive user input), the request either truncates with a logged warning or routes to a different model.
Without budgets you don't have a feature; you have an unbounded cost surface.