Lesson 8 · 11 min

Cost-aware multimodal routing

Vision tokens are 2-5× the cost of text tokens. The routing patterns that keep multimodal features viable at production volume.

The economics

A single high-resolution image at typical 2026 prices costs the same as a 3,000-token text prompt. Multiply by call volume and the bill stops being a rounding error.

The four levers that keep multimodal economics defensible:

Resize before sending. Most images can be downsized to 768px long-edge with no quality loss for most tasks.
Cache stable visual context. If your prompt includes a logo, a layout reference, or a stable diagram, prefix-cache it.
Tier your models. Cheap vision-capable model for triage, frontier for the hard 20%.
OCR locally when text is the goal. If you only need the text from an image, run Tesseract or a small OCR model and send the text to a text-only LLM.