Lesson 7 · 10 min
Embedding evaluation: offline metrics and production quality signals
Embedding quality has to be measured, not assumed. nDCG@k and MRR are the standard offline metrics; production proxies (answer faithfulness, user query reformulation rate) tell you if the retrieval system is working end-to-end.
The two evaluation layers
Layer 1: Retrieval metrics — does the embedding system return the right documents?
- nDCG@k (Normalized Discounted Cumulative Gain): measures ranking quality, accounting for position — a relevant document at rank 1 is worth more than at rank 5
- MRR (Mean Reciprocal Rank): average of 1/rank-of-first-relevant-result. Simple and interpretable
- Recall@k: fraction of relevant documents found in the top-k results. Measures coverage
Layer 2: End-to-end quality — does better retrieval produce better LLM answers?
- Faithfulness: does the LLM answer use retrieved content correctly?
- Answer relevance: does the answer address the query?
- Query reformulation rate: how often do users rephrase a query immediately after getting an answer (implicit signal: answer was bad)