Lesson 2 · 11 min
Choosing an embedding model: MTEB and task-specific evaluation
MTEB (Massive Text Embedding Benchmark) is the standard leaderboard for embedding models. Using it correctly means selecting for your specific task type and language — not just picking the top overall score.
The MTEB leaderboard
MTEB evaluates embedding models across 8 task categories and 58 datasets:
| Category | What it measures |
|---|---|
| Retrieval | Find the most relevant document for a query |
| Reranking | Rank candidates by relevance to a query |
| Semantic Textual Similarity (STS) | Predict similarity score between sentence pairs |
| Clustering | Group semantically similar texts |
| Classification | Embed → classify with a linear probe |
| Pair Classification | Duplicate detection, NLI |
| Bitext Mining | Find parallel sentences across languages |
| Summarization | Score generated summaries vs. reference |
The top overall score on MTEB is a useful first filter, but it conflates very different task types. A model optimized for STS may underperform on retrieval; a retrieval champion may be mediocre at clustering.