GPT-6 launches — and DeepSeek-V4 ties it on code at 1/30th the cost

OpenAI's GPT-6 takes the top SWE-Bench Verified spot, but the headline isn't the win — it's how close DeepSeek-V4-Coder lands on the same eval.

The numbers

GPT-6 (default): SWE-Bench Verified 76.2%, MMLU-Pro 91.4%. Pricing: $5/$25 per 1M input/output.
DeepSeek-V4-Coder (open weights): SWE-Bench Verified 73.8%, MMLU-Pro 84.1%. Pricing on DeepSeek's API: $0.18/$0.72.

GPT-6 is genuinely better on broad reasoning. On the narrow code-task eval, the gap is 2.4 points — and the cost gap is 30×.

What it means

For companies: the "always use frontier" reflex is harder to justify on coding workflows. A 30× cost reduction at 97% of the quality is the kind of trade most CFOs make happily.
For benchmarks: SWE-Bench Verified may be saturated as a public eval. The frontier and the open-source long tail are now within noise on it.
For self-hosting: DeepSeek-V4 weights are MIT-licensed. A single H200 node serves it at ~$0.04/1M tokens internally. The breakeven volume vs DeepSeek's API is around 200k requests/day.

What to actually do

If you're shipping AI features that touch code (review bots, autofix, copilots, refactoring agents), run a 100-task eval comparing GPT-6 to DeepSeek-V4-Coder on your task distribution this week. The published benchmarks won't match your domain — but the cost gap is so large that even a meaningful quality drop on your distribution can still favor V4.

Want the deep dive?

The lessons that ground this news in mechanics — not opinion.

Browse courses