Open-Weight vs Proprietary LLMs in 2026: The Real Cost Comparison

Model	Weights	Input / 1M	Output / 1M	Context
Llama 4 Maverick	Open	$0.15	$0.60	1M
Llama 4 Scout	Open	$0.10	$0.30	10M
Qwen 3.7 Plus	Open	$0.40	$1.60	1M
Mistral Large 2512	Open	$0.50	$1.50	262K
DeepSeek V4 Pro	Open	$0.44	$0.87	1M
GPT-5.4	Proprietary	$2.50	$15.00	1M
Claude Sonnet 4.6	Proprietary	$3.00	$15.00	1M
Gemini 3.5 Flash	Proprietary	$1.50	$9.00	1M

Model

Weights

Input / 1M

Output / 1M

Context

Llama 4 Maverick

Open

$0.15

$0.60

Llama 4 Scout

Open

$0.10

$0.30

10M

Qwen 3.7 Plus

Open

$0.40

$1.60

Mistral Large 2512

Open

$0.50

$1.50

262K

DeepSeek V4 Pro

Open

$0.44

$0.87

GPT-5.4

Proprietary

$2.50

$15.00

Claude Sonnet 4.6

Proprietary

$3.00

$15.00

Gemini 3.5 Flash

Proprietary

$1.50

$9.00

First, the terms: open-weight is not self-hosted

An open-weight model is one whose parameters you can download — Llama, Qwen, Mistral, DeepSeek, GLM. That's separate from how you run it. The comparison most teams actually face isn't 'API versus my own GPUs'; it's 'proprietary API versus an open-weight model served by a host' — Together, Fireworks, Groq, DeepInfra, or an aggregator like OpenRouter that fronts all of them.

That distinction matters because hosted open-weight inference has become a brutally competitive commodity market: any host can serve Llama, so margins compress toward hardware cost. Proprietary models have no such competition — there is exactly one seller of Claude tokens. The price gap in the table above is that market structure, made visible. I touched related ground in the OpenRouter vs direct piece.

How big the discount really is

Run my standard chat workload (1,500 tokens in, 400 out) across the table: Llama 4 Maverick costs $0.000465 per message; Claude Sonnet 4.6 costs $0.0105 — 23x more. At a million messages a month, that's $465 versus $10,500.

The quality picture is the necessary other half. The strongest open-weight models now post leaderboard scores level with proprietary mid tiers — Qwen 3.7 Max and GLM 5.1 both score 72 on Arena, two points behind GPT-5.4 and within three of GPT-5.5. What open weights still don't reach in mid-2026 is the frontier: nothing open touches Gemini 3.1 Pro, Opus 4.8, or GPT-5.5 on hard reasoning and long-horizon agentic work. So the honest framing: open weights compete with — and underprice — everything except the top shelf. My earlier survey of underrated bargain models digs into specific picks.

What the per-token price doesn't include

Three costs hide outside the token price, and they're why proprietary APIs keep winning deals they lose on paper.

Evals and integration: proprietary vendors ship models heavily tuned for instruction-following, tool use, and structured output. Open-weight deployments more often need prompt iteration, output validation, and occasionally a fine-tune to match that polish. That's engineering time, and at small scale engineering time dwarfs token savings — the same trap as fine-tuning vs prompt engineering.

Operational variance: host quality differs — quantization choices, context handling, throughput, uptime. Two hosts serving 'the same' Llama can behave measurably differently. Budget a bake-off.

Ecosystem features: first-party prompt caching, batch discounts, vision pricing, and enterprise compliance paperwork are mature on the big three and uneven across open-weight hosts. If caching would halve your bill (see the caching guide), check the host supports it before comparing list prices.

The self-hosting math, briefly and honestly

Renting a single H100-class GPU runs roughly $2-3 an hour, $1,500-2,200 a month — before redundancy, autoscaling, and the engineer who owns it. A serious deployment of a large model needs several. Hosted open-weight APIs amortize those same GPUs across thousands of customers at much higher utilization than you'll ever achieve solo.

The crossover where self-hosting wins on cost sits at sustained, predictable volume — think hundreds of millions of tokens a day — or when a hard constraint (data cannot leave your network, sub-50ms latency, custom fine-tunes) forces the issue regardless of cost. Below that, self-hosting is a strategic choice you pay for, not a savings plan. For everyone else, the hosted route captures 90% of the open-weight discount with none of the pager duty.

The strategic argument: open weights as price insurance

There's a second-order benefit that doesn't show up in any monthly bill: exit rights. A proprietary model can be repriced, deprecated, or retired on its vendor's schedule — and migrations off a deprecated model are real, recurring engineering costs. An open-weight model, once your stack runs on it, can't be taken away: if your host raises prices, five competitors serve identical weights tomorrow.

In practice I see this shape sensible 2026 architectures: proprietary frontier models for the hard, high-stakes 10% of calls, open-weight workhorses for the bulk lane, and the routing layer between them as the real asset — the pattern from multi-model routing. That portfolio gets frontier quality where it pays, commodity prices where it doesn't, and negotiating leverage everywhere. Current prices for both camps sit side by side in the live table.

Frequently Asked Questions

How much cheaper are open-weight models than proprietary APIs?

Hosted open-weight models run 5-30x cheaper per token in June 2026: Llama 4 Maverick costs $0.15/$0.60 per million tokens versus $3/$15 for Claude Sonnet 4.6 — about 23x cheaper on a typical chat message. Against proprietary budget tiers the gap narrows to 2-7x.

Are open-weight models as good as GPT or Claude?

The best open models (Qwen 3.7 Max, GLM 5.1, both scoring 72 on Arena) match proprietary mid tiers like GPT-5.4 within a couple of points. None currently match the frontier — Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.8 — on hard reasoning and agentic work.

Is it cheaper to self-host an open-weight LLM?

Rarely. GPU rental starts around $1,500-2,200/month per H100-class card before redundancy and staffing, while hosted open-weight APIs amortize hardware across thousands of customers. Self-hosting wins only at very large sustained volume or under hard data-locality constraints.

Which open-weight model has the largest context window?

Llama 4 Scout, with a 10M-token context window — the largest of any model I track, open or proprietary — at $0.10/$0.30 per million tokens. Whether you should actually fill 10M tokens per request is a separate cost question.

Open-Weight vs Proprietary LLMs in 2026: The Real Cost Comparison

First, the terms: open-weight is not self-hosted

How big the discount really is

What the per-token price doesn't include

The self-hosting math, briefly and honestly

The strategic argument: open weights as price insurance

Primary sources

Frequently Asked Questions