Lesson 4 · 11 min
State and memory architecture
An LLM has no memory — everything the model knows must be in the request. Choosing where to store what (context window vs. cache vs. database vs. vector store) determines your application's cost, latency, and correctness.
The four storage tiers
AI applications have four places to keep state, each with different trade-offs:
| Tier | What lives here | Latency | Cost | Persistence |
|---|---|---|---|---|
| Context window | Current turn, retrieved docs, instructions | 0ms | Per-token | Ephemeral |
| Prompt cache | Stable prefix, system docs | 0ms (hit) | 10% of normal | 5 min TTL |
| In-memory / Redis | Session state, rate limit counters, job queue | <1ms | Low | Hours/days |
| Database | User profile, conversation history, preferences | 1–5ms | Lowest | Permanent |
| Vector store | Semantic knowledge base | 5–50ms | Low | Permanent |
Most bugs in production AI apps come from putting the wrong thing in the wrong tier.