Skip to main content

Lesson 6 · 10 min

Hybrid retrieval: dense + sparse

Dense embeddings excel at semantic similarity. BM25-style sparse retrieval excels at exact keyword matching. Hybrid retrieval combines both and consistently outperforms either alone by 5–15% on nDCG@10.

The dense-sparse complementarity

Dense embeddings and sparse (BM25) retrieval have complementary failure modes:

Dense fails at:

  • Exact keyword matching: "CVE-2024-1234" returns unrelated security content because the embedding focuses on the topic, not the identifier
  • Named entities with no semantic context: product names, IDs, error codes, acronyms
  • Out-of-vocabulary terms: new product names, emerging jargon the model wasn't trained on

Sparse fails at:

  • Semantic equivalence: "end the contract" doesn't match "terminate the agreement" if those exact words don't appear
  • Paraphrase retrieval: any query phrased differently from the indexed text
  • Cross-lingual: BM25 requires lexical overlap, which cross-language queries don't have

Hybrid retrieval uses both: dense for semantic similarity, sparse for exact matching, combined via Reciprocal Rank Fusion (RRF).