Skip to main content

Lesson 6 · 10 min

Audio — ASR and TTS in production

Whisper-class speech-to-text is solved enough to ship; high-quality text-to-speech is too. The decisions that actually matter: streaming, diarization, voice cloning ethics.

ASR is mostly solved

Whisper, AssemblyAI, Deepgram, and several open-source variants give you sub-5% word-error rates on clean English speech and acceptable performance on 50+ other languages. The interesting choices in 2026 are operational, not model-quality:

  • Streaming vs batch. Streaming for live transcription (meeting tools, support agent assist). Batch for archives (podcast indexing, recorded calls).
  • Diarization — who said what. Critical for support-call transcripts; nice-to-have for monologues.
  • Forced alignment — word-level timestamps. Required for video captioning, optional otherwise.
  • PII redaction — built-in for some providers (Deepgram), bring-your-own for others.