First, the terms: open-weight is not self-hosted
That distinction matters because hosted open-weight inference has become a brutally competitive commodity market: any host can serve Llama, so margins compress toward hardware cost. Proprietary models have no such competition — there is exactly one seller of Claude tokens. The price gap in the table above is that market structure, made visible. I touched related ground in the OpenRouter vs direct piece.
How big the discount really is
The quality picture is the necessary other half. The strongest open-weight models now post leaderboard scores level with proprietary mid tiers — Qwen 3.7 Max and GLM 5.1 both score 72 on Arena, two points behind GPT-5.4 and within three of GPT-5.5. What open weights still don't reach in mid-2026 is the frontier: nothing open touches Gemini 3.1 Pro, Opus 4.8, or GPT-5.5 on hard reasoning and long-horizon agentic work. So the honest framing: open weights compete with — and underprice — everything except the top shelf. My earlier survey of underrated bargain models digs into specific picks.
What the per-token price doesn't include
Evals and integration: proprietary vendors ship models heavily tuned for instruction-following, tool use, and structured output. Open-weight deployments more often need prompt iteration, output validation, and occasionally a fine-tune to match that polish. That's engineering time, and at small scale engineering time dwarfs token savings — the same trap as fine-tuning vs prompt engineering.
Operational variance: host quality differs — quantization choices, context handling, throughput, uptime. Two hosts serving 'the same' Llama can behave measurably differently. Budget a bake-off.
Ecosystem features: first-party prompt caching, batch discounts, vision pricing, and enterprise compliance paperwork are mature on the big three and uneven across open-weight hosts. If caching would halve your bill (see the caching guide), check the host supports it before comparing list prices.
The self-hosting math, briefly and honestly
The crossover where self-hosting wins on cost sits at sustained, predictable volume — think hundreds of millions of tokens a day — or when a hard constraint (data cannot leave your network, sub-50ms latency, custom fine-tunes) forces the issue regardless of cost. Below that, self-hosting is a strategic choice you pay for, not a savings plan. For everyone else, the hosted route captures 90% of the open-weight discount with none of the pager duty.
The strategic argument: open weights as price insurance
In practice I see this shape sensible 2026 architectures: proprietary frontier models for the hard, high-stakes 10% of calls, open-weight workhorses for the bulk lane, and the routing layer between them as the real asset — the pattern from multi-model routing. That portfolio gets frontier quality where it pays, commodity prices where it doesn't, and negotiating leverage everywhere. Current prices for both camps sit side by side in the live table.