What is the difference between input and output tokens?

Input tokens are everything you send the model (prompts, history, documents); output tokens are everything it writes back. They're billed at different rates — output typically costs 2-6x more per token depending on the provider.

What does cached input mean on my API bill?

It's the discounted lane for repeated prompt prefixes (system prompts, reused documents). Cached reads bill at roughly 10% of the normal input rate. On Anthropic you'll also see cache writes, which cost slightly more than regular input the first time a prefix is stored.

Why is my reasoning model bill higher than my output suggests?

Reasoning models bill their internal thinking tokens as output even though you don't see them. A 200-token visible answer can carry thousands of billed thinking tokens behind it. Estimate from real billed usage, not from visible response length.

What do RPM and TPM mean?

Requests per minute and tokens per minute — rate limits that cap how fast you can call the API, independent of budget. Hitting them returns 429 errors. Limits rise with your account's usage tier, so check them before a high-traffic launch.

The LLM API Pricing Glossary: Every Billing Term, Plainly Explained

The unit: tokens, context, and caps

Token: the unit everything is priced in — a chunk of text roughly 4 characters or three-quarters of an English word. 'Unbelievable' is about 3 tokens; 1,000 words is about 1,330 tokens. Full primer in what are AI tokens.

Per-1M pricing: prices are quoted per million tokens ('$3 / 1M input'). Some older pages quote per 1,000 — same number divided by a thousand. When comparing providers, always normalize to per-1M; that's what every table on this site uses.

Context window: the maximum tokens a model can hold in one request — prompt plus history plus its own reply. Current windows run from 131K (some DeepSeek models) through 1M (most flagships) to 10M (Llama 4 Scout). The window is a capacity, not a price: you pay only for tokens you actually send, but see 'context resending' below for the trap. More in understanding context windows.

Output cap (max output tokens): a separate, smaller limit on how long a single reply can be — typically 8K-128K. Hitting it truncates your response mid-sentence; what happens when you exceed limits covers the failure modes.

The two meters: input, output, and the multiplier

Input tokens: everything you send — system prompt, conversation history, retrieved documents, the user's question. Billed at the input rate on every request, which means resent history is re-billed every turn. This 'context resending' is the dominant cost in chat applications, as the math in the chatbot cost guide shows.

Output tokens: everything the model writes back. Always pricier per token than input, because generation is the compute-expensive direction.

Output multiplier: the output price divided by the input price — the single most revealing number on a pricing page. In June 2026: Claude runs a uniform 5x, OpenAI and Google mostly 6x, xAI's Grok an unusual 2x. A low multiplier favors generation-heavy work; a low input price favors context-heavy work. The full argument is in the output multiplier piece.

The discount lanes: caching and batch

Cached input (prompt caching): repeated prompt prefixes — system prompts, tool definitions, that 50-page document you keep asking about — can be served from the provider's cache at roughly a tenth of the normal input rate. Anthropic uses explicit cache breakpoints with a small write premium the first time; OpenAI and Google apply caching automatically past a minimum prefix length. On prefix-heavy workloads this is the single biggest lever you control: the caching guide has worked examples.

Cache write / cache read: the two cached-input line items on an Anthropic invoice — writes cost slightly more than regular input (you're paying to populate the cache), reads cost a small fraction of it. Profitable whenever a prefix is reused more than once or twice.

Batch API: submit a file of requests, get results within a deadline (up to 24h), pay a flat 50% off both meters. The easiest discount in the industry for evals, backfills, and nightly jobs — see the batch guide.

The quiet surcharge: reasoning tokens

Reasoning tokens (thinking tokens): when a reasoning model — Claude with extended thinking, OpenAI's reasoning modes, DeepSeek R1 — works through a problem, its internal chain of thought is metered and billed as output, even though you may never see it. A question with a 200-token visible answer can quietly bill 3,000 thinking tokens behind it, making the real cost 15x the apparent one.

This is the most common source of 'why is my bill 5x my estimate' tickets I hear about, and it's invisible if you estimate from response length alone. Always estimate reasoning workloads from billed usage on real samples, never from output you can read. The cost anatomy is in the extended thinking analysis and whether reasoning models are worth it.

Effort or thinking budget controls — settings that cap how long the model reasons — are billing controls as much as quality controls; turning effort down is often a 50%+ cost cut on reasoning-heavy pipelines.

Derived metrics: how to compare models on one number

Blended cost per 1M: input and output prices collapsed into one number using your workload's mix — typically 0.75 x input + 0.25 x output for chat-shaped work. The basis of every comparison in the $50 budget guide.

Tokens per dollar: the inverse view — how many tokens a dollar buys. Useful for intuition, hazardous for decisions if computed only from input price; the tokens-per-dollar comparison does it properly.

Cost per request: blended cost applied to your average request shape (say 1,500 in / 400 out). The number that actually belongs in your unit economics, next to cost per user from the SaaS cost-per-user piece.

Quality per dollar: a leaderboard score divided by blended cost — the closest thing to a value ranking. How TokenRate computes it is documented in the quality score methodology.

The throughput fine print: rate limits and tiers

RPM / TPM: requests per minute and tokens per minute — caps on throughput, separate from cost. You can have budget left and still be throttled; a 429 error is the API telling you to slow down, not that you're out of money. Some providers split TPM into input and output limits (ITPM/OTPM).

Usage tiers: providers raise rate limits as your account ages and spends — new accounts start with tight limits that can bottleneck a launch. If you're planning a high-traffic release, check your tier's TPM against expected peak load weeks early; tier upgrades are usually automatic but not instant.

Priority and provisioned tiers: some providers sell guaranteed-latency or reserved-capacity lanes at a premium over standard list price — relevant once you have strict SLOs, irrelevant before. With those terms down, every pricing page is the same five questions: per-1M prices on both meters, multiplier, caching mechanics, batch discount, and your tier's TPM. The answers for every major model live in the live table.