The numbers
- GPT-6 (default): SWE-Bench Verified 76.2%, MMLU-Pro 91.4%. Pricing: $5/$25 per 1M input/output.
- DeepSeek-V4-Coder (open weights): SWE-Bench Verified 73.8%, MMLU-Pro 84.1%. Pricing on DeepSeek's API: $0.18/$0.72.
GPT-6 is genuinely better on broad reasoning. On the narrow code-task eval, the gap is 2.4 points — and the cost gap is 30×.
What it means
- For companies: the "always use frontier" reflex is harder to justify on coding workflows. A 30× cost reduction at 97% of the quality is the kind of trade most CFOs make happily.
- For benchmarks: SWE-Bench Verified may be saturated as a public eval. The frontier and the open-source long tail are now within noise on it.
- For self-hosting: DeepSeek-V4 weights are MIT-licensed. A single H200 node serves it at ~$0.04/1M tokens internally. The breakeven volume vs DeepSeek's API is around 200k requests/day.
What to actually do
If you're shipping AI features that touch code (review bots, autofix, copilots, refactoring agents), run a 100-task eval comparing GPT-6 to DeepSeek-V4-Coder on your task distribution this week. The published benchmarks won't match your domain — but the cost gap is so large that even a meaningful quality drop on your distribution can still favor V4.