The model conversation, spelled out
One conversation: 6 turns. A 500-token system prompt (instructions, persona, guardrails). Users write about 150 tokens per message; the bot replies with about 200. The crucial mechanic: on every turn, the API receives the system prompt plus the entire conversation so far — LLM APIs are stateless, so history gets resent and re-billed each time.
Add it up across six turns and one conversation consumes roughly 9,150 input tokens but only 1,200 output tokens. That 7.6-to-1 input-to-output ratio is the signature of chat workloads, and it's why the input price and caching policy of your model matter more than the headline output price. If tokens are still fuzzy, start with what tokens are and come back.
Why history resending dominates the bill
This has two practical consequences. First, truncate or summarize history: if your bot only needs the last few turns to stay coherent, dropping older turns cuts real money at scale. Second, watch your turn-length distribution in production — a small share of marathon conversations can eat a surprising share of the budget. I've covered the diagnosis side in why your LLM bill is higher than expected.
Reading the ten-model table honestly
For classic tier-one support — FAQs, order status, password resets — the budget tier handles it: DeepSeek V4 Flash at $11/month if its jurisdictional profile works for you (see the DeepSeek guide), otherwise Gemini Flash-Lite at $41 or GPT-5.4-mini at $122.
For anything customer-visible where tone and judgment matter, the mid tier earns its keep: Claude Haiku 4.5 at $150 or Gemini 3.5 Flash at $243 — Flash scores 73 on Arena, frontier-adjacent, which is remarkable at this price.
Frontier models (Sonnet, Opus, GPT-5.5) at $450-810/month belong in chat products where the conversation is the product — coaching, tutoring, expert assistance — not in deflection-rate support bots. Quality-per-dollar rankings across all tiers live in the showdown piece.
Caching: the lever that changes the rankings
In this scenario, caching the system prompt and prior history cuts the effective input bill by half or more, which compresses the table dramatically: Haiku drops from $150 toward $80, Sonnet from $450 toward $250. Providers differ in mechanics (Anthropic uses explicit cache breakpoints; OpenAI and Google cache automatically above a minimum prefix length), and the details are in the prompt caching guide.
The corollary: structure your prompt so the stable parts come first and the per-turn parts last. Cache-friendliness is a prompt-architecture decision, not a billing checkbox.
From per-conversation cost to a real budget
Retries and overruns: failed JSON parses, timeouts, and safety re-prompts add 5-15% in my experience. Budget it.
RAG context: if each turn retrieves 1,000-2,000 tokens of documentation into the prompt, input costs double or triple — rerun the table with your real per-turn context. The RAG cost piece covers trimming that.
Evals and development: testing prompt changes against a few hundred recorded conversations before each deploy is cheap insurance (a few dollars a run on a mid-tier model) but it's a real recurring cost most budgets forget. For the full production-budgeting picture, see token budgeting for production apps.
Then validate against reality: instrument actual token counts per conversation from day one, because real users never match the model conversation exactly. Paste your observed averages into the calculator and re-rank.