Skip to main content

Lesson 7 · 10 min

Embedding evaluation: offline metrics and production quality signals

Embedding quality has to be measured, not assumed. nDCG@k and MRR are the standard offline metrics; production proxies (answer faithfulness, user query reformulation rate) tell you if the retrieval system is working end-to-end.

The two evaluation layers

Layer 1: Retrieval metrics — does the embedding system return the right documents?

  • nDCG@k (Normalized Discounted Cumulative Gain): measures ranking quality, accounting for position — a relevant document at rank 1 is worth more than at rank 5
  • MRR (Mean Reciprocal Rank): average of 1/rank-of-first-relevant-result. Simple and interpretable
  • Recall@k: fraction of relevant documents found in the top-k results. Measures coverage

Layer 2: End-to-end quality — does better retrieval produce better LLM answers?

  • Faithfulness: does the LLM answer use retrieved content correctly?
  • Answer relevance: does the answer address the query?
  • Query reformulation rate: how often do users rephrase a query immediately after getting an answer (implicit signal: answer was bad)