The unit: tokens, context, and caps
Per-1M pricing: prices are quoted per million tokens ('$3 / 1M input'). Some older pages quote per 1,000 — same number divided by a thousand. When comparing providers, always normalize to per-1M; that's what every table on this site uses.
Context window: the maximum tokens a model can hold in one request — prompt plus history plus its own reply. Current windows run from 131K (some DeepSeek models) through 1M (most flagships) to 10M (Llama 4 Scout). The window is a capacity, not a price: you pay only for tokens you actually send, but see 'context resending' below for the trap. More in understanding context windows.
Output cap (max output tokens): a separate, smaller limit on how long a single reply can be — typically 8K-128K. Hitting it truncates your response mid-sentence; what happens when you exceed limits covers the failure modes.
The two meters: input, output, and the multiplier
Output tokens: everything the model writes back. Always pricier per token than input, because generation is the compute-expensive direction.
Output multiplier: the output price divided by the input price — the single most revealing number on a pricing page. In June 2026: Claude runs a uniform 5x, OpenAI and Google mostly 6x, xAI's Grok an unusual 2x. A low multiplier favors generation-heavy work; a low input price favors context-heavy work. The full argument is in the output multiplier piece.
The discount lanes: caching and batch
Cache write / cache read: the two cached-input line items on an Anthropic invoice — writes cost slightly more than regular input (you're paying to populate the cache), reads cost a small fraction of it. Profitable whenever a prefix is reused more than once or twice.
Batch API: submit a file of requests, get results within a deadline (up to 24h), pay a flat 50% off both meters. The easiest discount in the industry for evals, backfills, and nightly jobs — see the batch guide.
The quiet surcharge: reasoning tokens
This is the most common source of 'why is my bill 5x my estimate' tickets I hear about, and it's invisible if you estimate from response length alone. Always estimate reasoning workloads from billed usage on real samples, never from output you can read. The cost anatomy is in the extended thinking analysis and whether reasoning models are worth it.
Effort or thinking budget controls — settings that cap how long the model reasons — are billing controls as much as quality controls; turning effort down is often a 50%+ cost cut on reasoning-heavy pipelines.
Derived metrics: how to compare models on one number
Tokens per dollar: the inverse view — how many tokens a dollar buys. Useful for intuition, hazardous for decisions if computed only from input price; the tokens-per-dollar comparison does it properly.
Cost per request: blended cost applied to your average request shape (say 1,500 in / 400 out). The number that actually belongs in your unit economics, next to cost per user from the SaaS cost-per-user piece.
Quality per dollar: a leaderboard score divided by blended cost — the closest thing to a value ranking. How TokenRate computes it is documented in the quality score methodology.
The throughput fine print: rate limits and tiers
Usage tiers: providers raise rate limits as your account ages and spends — new accounts start with tight limits that can bottleneck a launch. If you're planning a high-traffic release, check your tier's TPM against expected peak load weeks early; tier upgrades are usually automatic but not instant.
Priority and provisioned tiers: some providers sell guaranteed-latency or reserved-capacity lanes at a premium over standard list price — relevant once you have strict SLOs, irrelevant before. With those terms down, every pricing page is the same five questions: per-1M prices on both meters, multiplier, caching mechanics, batch discount, and your tier's TPM. The answers for every major model live in the live table.