OpenAI Whisper v4 — multilingual error rate drops by half on low-resource languages

Whisper v4 lands with consistent improvements on languages that were previously above 30% WER. The bar for 'production-acceptable ASR' moved.

What changed

Whisper v4 (released early May) ships meaningful improvements over v3 on the long tail:

Low-resource languages (Welsh, Yoruba, Tamil, Vietnamese): WER dropped 35-55%.
High-resource languages (English, French, Spanish): incremental — already at <5% WER on clean speech, now <3.5%.
Diarization: vastly improved on overlapping speech, which was v3's weakest area.
Streaming: TTFT remains around 200ms on the fast tier.

Weights are MIT-licensed and run on a single H100 at ~80x real-time throughput.

What it means

Multilingual support tickets are now genuinely viable to transcribe and route automatically without a per-language model.
Diarized meeting summaries become reliable enough for legal review without a manual cleanup pass.
Self-hosting math shifts further in your favor — Whisper v4 is the first version where the open-source option is unambiguously better than most paid ASR APIs on coverage and price.

What to watch

The step-up on low-resource languages came from training-data improvements as much as architecture. The community benchmarks (FLEURS, CommonVoice) confirm the headline numbers — but always run a small eval on your domain before swapping production. Acoustic environment matters more for ASR than provider claims usually admit.

Want the deep dive?

The lessons that ground this news in mechanics — not opinion.

Browse courses