Lesson 6 · 10 min
Hybrid retrieval: dense + sparse
Dense embeddings excel at semantic similarity. BM25-style sparse retrieval excels at exact keyword matching. Hybrid retrieval combines both and consistently outperforms either alone by 5–15% on nDCG@10.
The dense-sparse complementarity
Dense embeddings and sparse (BM25) retrieval have complementary failure modes:
Dense fails at:
- Exact keyword matching: "CVE-2024-1234" returns unrelated security content because the embedding focuses on the topic, not the identifier
- Named entities with no semantic context: product names, IDs, error codes, acronyms
- Out-of-vocabulary terms: new product names, emerging jargon the model wasn't trained on
Sparse fails at:
- Semantic equivalence: "end the contract" doesn't match "terminate the agreement" if those exact words don't appear
- Paraphrase retrieval: any query phrased differently from the indexed text
- Cross-lingual: BM25 requires lexical overlap, which cross-language queries don't have
Hybrid retrieval uses both: dense for semantic similarity, sparse for exact matching, combined via Reciprocal Rank Fusion (RRF).