Lesson 3 · 13 min

Fine-tuning embedding models for domain-specific retrieval

When off-the-shelf embeddings fail on your domain, fine-tuning on your own (query, positive, hard negative) triplets reliably improves nDCG@10 by 15–30%. The data pipeline matters more than the training code.

When to fine-tune vs. when to use off-the-shelf

Fine-tune when:

Your domain has specialized vocabulary not well-represented in web data (clinical notes, legal contracts, source code, scientific papers)
Retrieval quality on a domain-specific eval set is > 10% below a general MTEB leader
You have or can generate 1,000+ (query, positive document) pairs

Don't fine-tune when:

Your retrieval corpus is general web-like text
You have < 500 training pairs (overfitting risk)
The off-the-shelf model already achieves > 0.80 nDCG@10 on your domain

The most common mistake: fine-tuning on a small domain dataset and overfitting, producing a model that memorizes training pairs but generalizes worse than the base model.