What changed
The big one: 1M-token context window (up from 200k on Sonnet 4.6, 500k on Opus 4.6). For most teams, that's the entire monorepo, the full corpus of internal docs, or a full-day support transcript window.
What it means in practice
- RAG calculus shifts. If your corpus fits in 1M tokens (it probably does for most teams under 50 engineers), the RAG-vs-stuff-the-context decision now leans harder toward "just stuff it" with prompt caching.
- Cost discipline matters more. A 1M-token call is expensive. The cached-prefix trick (stable system + corpus first, variable user input last) becomes essential, not optional.
- Latency caveat. First-token latency on 1M-token contexts is real — plan for it in user-facing flows.
What to watch
Whether retrieval-quality (does the model actually use what's deep in the context) holds up at 1M scale. Anthropic's "needle in a haystack" eval is good; production traffic with messier prompts will tell us more.