Skip to main content

Lesson 2 · 11 min

Choosing an embedding model: MTEB and task-specific evaluation

MTEB (Massive Text Embedding Benchmark) is the standard leaderboard for embedding models. Using it correctly means selecting for your specific task type and language — not just picking the top overall score.

The MTEB leaderboard

MTEB evaluates embedding models across 8 task categories and 58 datasets:

| Category | What it measures |

|---|---|

| Retrieval | Find the most relevant document for a query |

| Reranking | Rank candidates by relevance to a query |

| Semantic Textual Similarity (STS) | Predict similarity score between sentence pairs |

| Clustering | Group semantically similar texts |

| Classification | Embed → classify with a linear probe |

| Pair Classification | Duplicate detection, NLI |

| Bitext Mining | Find parallel sentences across languages |

| Summarization | Score generated summaries vs. reference |

The top overall score on MTEB is a useful first filter, but it conflates very different task types. A model optimized for STS may underperform on retrieval; a retrieval champion may be mediocre at clustering.