What's in the box
- Llama 4 70B and 405B, both with permissive license (Meta's usual community license — commercial use OK below 700M MAU).
- Mixture-of-experts variant (LLama-4-Scout-17B active / 109B total) at the small end.
- 256k context on the 70B, 1M on the 405B.
- Native tool-use training; performs comparably to closed-source mid-tier models on common benchmarks.
The economics now
For a single H100 serving Llama 4 70B at 4-bit quantization with vLLM, throughput is ~80 tokens/sec under modest load. At ~$2/hr cloud cost, that's roughly $0.07 per 1M tokens — about 50× cheaper than frontier-model API pricing.
The breakeven vs Sonnet-tier APIs at typical workloads is now around 30k requests/day. Below that, APIs win on ops simplicity. Above that, self-hosting Llama 4 is genuinely competitive.
What to actually do
- Run a 1-week shadow eval comparing your prompts on Llama 4 vs your current API.
- If quality is within 5% on your eval set and volume is non-trivial, build a self-host POC.
- Don't self-host as a religious choice. The ops cost is real.