Skip to main content

Lesson 4 · 10 min

Prompt caching at every stable prefix

The single largest cost lever modern providers offer: 80-90% discounts on repeated context prefixes. The catch: prefixes have to be byte-identical.

The mechanic

When you send the same prefix tokens across requests, the provider can skip recomputing the K/V cache for those tokens — which is the expensive part of inference. They pass that saving back to you as a discount.

  • Anthropic: cache_control: ephemeral, 90% discount on cached input tokens, 5-minute TTL.
  • OpenAI: automatic for prefixes ≥1024 tokens.
  • Google Gemini: explicit cachedContent resource you create and reference.
  • Bedrock: per-model implementation.

The rules across all of them: prefix has to be byte-identical, and the cache has a TTL (5min default; long-lived caches are paid extra).