Lesson 8 · 18 min
Capstone: build a domain-adapted embedding pipeline
End-to-end: select an embedding model from MTEB, generate fine-tuning triplets, train, evaluate offline, set up hybrid retrieval, and instrument production monitoring for a legal document Q&A system.
The task
Build an embedding pipeline for a legal document Q&A system:
- Corpus: 5,000 UK contract law documents (case summaries, clauses, precedents)
- Query types: point lookup ("what does the termination clause say?"), semantic ("how is liability typically limited?"), and identifier ("see clause 12.4.3")
- Baseline: general-purpose embedding model, fixed-size chunking, pure dense retrieval
- Target: > 0.80 nDCG@10 on a hand-labeled test set of 200 queries