Lesson 2 · 11 min
Embedding models — what to actually use
Closed-source vs open, dense vs sparse vs hybrid, multilingual coverage, dimensions. The 2026 picks and the trade-offs.
The 2026 leaderboard, simplified
For English + general domains:
- OpenAI text-embedding-3-large (3072 dims). Strong on most benchmarks. Paid.
- Cohere embed-english-v3 (1024 dims). Strong on retrieval. Paid.
- Voyage voyage-3 (1024 dims). Currently top of MTEB retrieval. Paid.
- BGE-M3 (1024 dims, open source). Multilingual + dense + sparse + multi-vector in one model. Often best when self-hosting.
- E5-mistral-7b-instruct (4096 dims, open source). Highest quality on niche tasks; expensive to host.
For multilingual: BGE-M3 is the default. It covers 100+ languages competitively.
For code: voyage-code-3 or BGE-M3 with code corpus. General-purpose embedding models lose 20-30 points on code retrieval vs code-specialized ones.