Llama 4 ships open weights — and the math finally favors self-hosting

Meta's Llama 4 70B and 405B are out. With API prices flat-to-rising, the breakeven point for self-hosting has shifted meaningfully.

What's in the box

Llama 4 70B and 405B, both with permissive license (Meta's usual community license — commercial use OK below 700M MAU).
Mixture-of-experts variant (LLama-4-Scout-17B active / 109B total) at the small end.
256k context on the 70B, 1M on the 405B.
Native tool-use training; performs comparably to closed-source mid-tier models on common benchmarks.

The economics now

For a single H100 serving Llama 4 70B at 4-bit quantization with vLLM, throughput is ~80 tokens/sec under modest load. At ~$2/hr cloud cost, that's roughly $0.07 per 1M tokens — about 50× cheaper than frontier-model API pricing.

The breakeven vs Sonnet-tier APIs at typical workloads is now around 30k requests/day. Below that, APIs win on ops simplicity. Above that, self-hosting Llama 4 is genuinely competitive.

What to actually do

Run a 1-week shadow eval comparing your prompts on Llama 4 vs your current API.
If quality is within 5% on your eval set and volume is non-trivial, build a self-host POC.
Don't self-host as a religious choice. The ops cost is real.