Lesson 8 · 18 min

Capstone: build a domain-adapted embedding pipeline

End-to-end: select an embedding model from MTEB, generate fine-tuning triplets, train, evaluate offline, set up hybrid retrieval, and instrument production monitoring for a legal document Q&A system.

The task

Build an embedding pipeline for a legal document Q&A system:

Corpus: 5,000 UK contract law documents (case summaries, clauses, precedents)
Query types: point lookup ("what does the termination clause say?"), semantic ("how is liability typically limited?"), and identifier ("see clause 12.4.3")
Baseline: general-purpose embedding model, fixed-size chunking, pure dense retrieval
Target: > 0.80 nDCG@10 on a hand-labeled test set of 200 queries