What changed
Whisper v4 (released early May) ships meaningful improvements over v3 on the long tail:
- Low-resource languages (Welsh, Yoruba, Tamil, Vietnamese): WER dropped 35-55%.
- High-resource languages (English, French, Spanish): incremental — already at <5% WER on clean speech, now <3.5%.
- Diarization: vastly improved on overlapping speech, which was v3's weakest area.
- Streaming: TTFT remains around 200ms on the fast tier.
Weights are MIT-licensed and run on a single H100 at ~80x real-time throughput.
What it means
- Multilingual support tickets are now genuinely viable to transcribe and route automatically without a per-language model.
- Diarized meeting summaries become reliable enough for legal review without a manual cleanup pass.
- Self-hosting math shifts further in your favor — Whisper v4 is the first version where the open-source option is unambiguously better than most paid ASR APIs on coverage and price.
What to watch
The step-up on low-resource languages came from training-data improvements as much as architecture. The community benchmarks (FLEURS, CommonVoice) confirm the headline numbers — but always run a small eval on your domain before swapping production. Acoustic environment matters more for ASR than provider claims usually admit.