How LLM Quality Scores Are Calculated: Inside TokenRate's Quality Index
Inside look at how TokenRate calculates LLM quality scores — the precedence between Arena AI Elo, the Artificial Analysis Intelligence Index, and a curated static fallback for ~70 models.
Published
Frequently Asked Questions
Why blend two sources instead of using one?
Arena and AA measure different things — live human preference vs static benchmark composite — and disagree often. Blending captures both signals; the precedence (AA > Arena > static) prefers AA's broader coverage when available. If only one source has data for a model, that source determines the Quality value.
How often does the Quality score update?
Every 60 minutes — both Arena and AA feeds revalidate on the same hourly cycle as OpenRouter pricing, so a price change and a quality change can happen on the same refresh. The first request after a refresh may take an extra moment as the merge runs; subsequent requests hit the in-memory cache.
Can I see which source produced a given Quality score?
Yes — the source badge on each Quality cell on the calculator reads 'arena' or 'aa' (the static fallback also tags itself as 'aa' since its scores match AA's scale). If a model has no badge, it has no Quality data from any source.
Why is a model I expect to score high showing no Quality value?
Three possibilities: (1) the model isn't in Arena's top ~20 or AA's dataset; (2) the key-matching heuristic failed to align the OpenRouter ID with a Quality-source key (a fuzzy-match miss); (3) you're on the static-fallback path but the model post-dates the quarterly update. The Compare Prices view will still show pricing even when quality data is missing.
Try the TokenRate Calculator
Open the TokenRate calculator to see the blended Quality score in action — the source badge on each cell tells you whether the value came from Arena AI, Artificial Analysis, or the curated fallback.
Open Calculator →