intermediateMultimodalVisionAudioProduction

Multimodal AI

Beyond text — vision, audio, video, in production.

The four production-ready multimodal workloads in 2026: document understanding, chart and screenshot Q&A, audio (ASR + TTS), and video. Cost-aware routing patterns that keep multimodal features defensible at scale.

Start course Certify on CertQuests

Duration

Lessons

1.8k

Learners

Course map

Lessons unlock as you complete the previous one. Your progress is saved on this device.

Lesson 1

What multimodal AI actually is

9m35 XP

Lesson 2

Vision encoders — how images become tokens

11m40 XP

Lesson 3

Document understanding — PDFs, tables, figures

12m45 XP

Lesson 4

Chart and diagram Q&A

9m35 XP

Lesson 5

Screenshot analysis and UI agents

10m40 XP

Lesson 6

Audio — ASR and TTS in production

10m40 XP

Lesson 7

Video understanding — what works in 2026

10m40 XP

Lesson 8

Cost-aware multimodal routing

11m45 XP

Take next

Courses that pair well after — or alongside — Multimodal AI.

AI Safety & Alignment for Engineers

Ship AI features that don't become incidents.

intermediate · 9h

Cost-Aware AI Engineering

Ship AI features with a defensible bill — five habits that cut cost 40-70%.

intermediate · 7h

LLMs & Transformers

See inside the model — without the math wall.

intermediate · 9h