Skip to main content

Lesson 9 · 10 min

RAG in production: cost, latency, freshness

A working RAG demo is 10% of the work. The rest is keeping it healthy.

The four production levers

1. Latency

Total latency = embed query + ANN search + (rerank?) + LLM generation.

  • Cache query embeddings (LRU on the literal query string).
  • Use a small fast embedding model for queries (BGE-small, voyage-light).
  • Pre-compute and cache the system prompt + few-shot prefix → prompt caching for 80% cost cut.

2. Cost

  • Embedding cost is one-time per chunk + recurring per query.
  • LLM cost dominates at scale. Reduce by: shorter chunks, fewer chunks, smaller LLM for easy queries, cache.

3. Freshness

When the corpus changes:

  • Append-only: cheap, but old vectors stay forever — periodically re-index.
  • Re-embed all: expensive but clean. Schedule weekly/monthly.
  • Detect changes via hash, re-embed only changed chunks.

4. Observability

Log every query: top-k chunk IDs, scores, final answer, user feedback. Without this, debugging is impossible.