Multimodal AI
Beyond text — vision, audio, video, in production.
The four production-ready multimodal workloads in 2026: document understanding, chart and screenshot Q&A, audio (ASR + TTS), and video. Cost-aware routing patterns that keep multimodal features defensible at scale.
7h
Duration
8
Lessons
1.8k
Learners
Course map
Lessons unlock as you complete the previous one. Your progress is saved on this device.
Lesson 1
What multimodal AI actually is
9m35 XPLesson 2
Vision encoders — how images become tokens
11m40 XPLesson 3
Document understanding — PDFs, tables, figures
12m45 XPLesson 4
Chart and diagram Q&A
9m35 XPLesson 5
Screenshot analysis and UI agents
10m40 XPLesson 6
Audio — ASR and TTS in production
10m40 XPLesson 7
Video understanding — what works in 2026
10m40 XPLesson 8
Cost-aware multimodal routing
11m45 XP
Take next
Courses that pair well after — or alongside — Multimodal AI.
AI Safety & Alignment for Engineers
Ship AI features that don't become incidents.
intermediate · 9h
Cost-Aware AI Engineering
Ship AI features with a defensible bill — five habits that cut cost 40-70%.
intermediate · 7h
LLMs & Transformers
See inside the model — without the math wall.
intermediate · 9h