Lesson 7 · 9 min
Evaluation — precision, recall, faithfulness
Three metrics that tell you whether your retrieval pipeline is working, with the labelling effort that's actually realistic.
The three metrics
- Precision@k. Of the k chunks retrieved, what fraction are actually relevant? The metric for hallucination prevention — irrelevant chunks confuse the LLM.
- Recall@k. Of all relevant chunks in the corpus, what fraction did we retrieve in top-k? The metric for coverage — missing chunks mean wrong answers.
- Faithfulness. Does the generated answer cite only retrieved chunks? Computed by LLM-as-judge against the chunks. Measures whether the model is grounding properly.
For a typical RAG eval set: 30-50 queries, each with 3-5 hand-labelled relevant chunks. Yes, that's hours of labelling. Worth it; everything else flows from it.