RAG & Vector DatabasesBack

40 XP

Lesson 9 · 10 min

RAG in production: cost, latency, freshness

A working RAG demo is 10% of the work. The rest is keeping it healthy.

The four production levers

1. Latency

Total latency = embed query + ANN search + (rerank?) + LLM generation.

Cache query embeddings (LRU on the literal query string).
Use a small fast embedding model for queries (BGE-small, voyage-light).
Pre-compute and cache the system prompt + few-shot prefix → prompt caching for 80% cost cut.

2. Cost

Embedding cost is one-time per chunk + recurring per query.
LLM cost dominates at scale. Reduce by: shorter chunks, fewer chunks, smaller LLM for easy queries, cache.

3. Freshness

When the corpus changes:

Append-only: cheap, but old vectors stay forever — periodically re-index.
Re-embed all: expensive but clean. Schedule weekly/monthly.
Detect changes via hash, re-embed only changed chunks.

4. Observability

Log every query: top-k chunk IDs, scores, final answer, user feedback. Without this, debugging is impossible.

1 / 3 · 0/1 checks