Methodology

How the rankings work

Transparent scoring

A modular scoring system combining benchmark evidence, product characteristics, context length, enterprise readiness, and pricing across 10 ranking categories.

Ranking philosophy

LLM Atlas uses category-specific weighted scores instead of a single monolithic ranking. Each category emphasizes different signals based on what matters most for that use case. A model ranked #1 for coding may rank #10 for value, and that is by design.

Every score is derived from publicly available data: official benchmarks, provider pricing pages, HuggingFace model cards, and measured performance. No score is fabricated or estimated without source attribution.

Score fields

Each model is scored on 8 dimensions. Scores are on a 0-100 scale, calibrated against public benchmarks and measured performance.

reasoningScore

Reasoning

Step-by-step logic, problem-solving, math, and analytical capability. Calibrated from MMLU-Pro, GPQA Diamond, AIME, and HLE benchmarks.

codingScore

Coding

Code generation, debugging, refactoring, and software engineering. Calibrated from LiveCodeBench, HumanEval+, SWE-Bench Verified, and Terminal-Bench.

visionScore

Vision

Image understanding, OCR, document parsing, and multimodal reasoning. Calibrated from MMMU, Vision Vista, and provider-specific vision benchmarks.

safetyScore

Safety

Content moderation, alignment quality, refusal behavior, and responsible AI practices. Based on red-teaming reports and safety evaluations.

speedScore

Speed

Inference latency, throughput, and time-to-first-token. Based on provider infrastructure, model architecture, and measured performance.

enterpriseScore

Enterprise

API reliability, SLA commitments, compliance, documentation quality, and operational readiness for business workloads.

contextWindow

Context window

Effective context length normalized using logarithmic scaling (log10(x) × 20). A 1M-token window scores 120, a 128K window scores ~98.

priceScore

Price efficiency

Calculated as max(35, 100 − (input × 0.4 + output × 0.6) × 2500). Models without published pricing default to null, not zero.

Category weights

Each ranking category blends the 8 score fields with different weights. The weighted average produces the final category score.

Category	Weight breakdown
Overall	Reasoning 22%, Coding 16%, Vision 14%, Safety 11%, Speed 11%, Enterprise 12%, Context 8%, Price 6%
Coding	Coding 42%, Reasoning 24%, Speed 12%, Enterprise 8%, Price 8%, Context 6%
Reasoning	Reasoning 50%, Safety 12%, Context 15%, Speed 8%, Price 5%, Enterprise 10%
Value	Price 36%, Reasoning 18%, Coding 12%, Speed 12%, Enterprise 12%, Safety 10%
Long context	Context 45%, Reasoning 18%, Enterprise 12%, Speed 8%, Price 8%, Coding 9%
Enterprise	Enterprise 40%, Safety 18%, Reasoning 15%, Price 8%, Context 10%, Speed 9%
Safety	Safety 55%, Enterprise 15%, Reasoning 10%, Price 6%, Speed 6%, Context 8%
Vision	Vision 55%, Reasoning 14%, Speed 10%, Enterprise 8%, Price 5%, Context 8%
Open source	Open-source bonus 20%, Price 20%, Coding 18%, Reasoning 18%, Speed 10%, Context 8%, Enterprise 6%
Structured output	Reasoning 25%, Enterprise 18%, Safety 16%, Coding 15%, Speed 12%, Price 14%

Price efficiency scoring

Price efficiency is calculated from per-token input and output pricing using the formula:

priceScore = max(35, 100 − (inputPrice × 0.4 + outputPrice × 0.6) × 2500)

This formula weights output tokens more heavily (60%) because they typically represent the majority of cost in production workloads. The floor of 35 ensures that even expensive models receive some score rather than zero.

Models without published API pricing (open-weight-only, self-hosted) receive a null price score. They are included in rankings but do not benefit from price efficiency in value-focused categories.

Context window normalization

Raw context windows vary wildly (4K to 10M tokens). We normalize them using logarithmic scaling to prevent extreme outliers from dominating rankings:

normalized = min(100, log10(contextWindow) × 20)

Examples:

4K context → score 72
128K context → score 98
1M context → score 120 (capped at 100)
10M context → score 140 (capped at 100)

This means models with 1M+ context are treated equally in the context dimension, while differences between 4K and 128K are fully captured.

Benchmark calibration

Scores are not direct benchmark pass rates. They are calibrated on a 0-100 scale based on relative performance across the model catalog:

95-100: Frontier leaders (top 1-3 models globally)
85-94: Strong performers (competitive in production)
70-84: Capable models (suitable for most applications)
55-69: Moderate models (good for specific tasks)
Below 55: Limited or specialized (niche use only)

Scores are updated as new models release and benchmarks evolve. A model's score may change even without re-testing, as the calibration curve shifts.

Model coverage tiers

Not all models in the catalog have identical data depth. We classify coverage into tiers:

Full-profile

Models with all core specs (context window, release date, pricing), at least one benchmark score, and a non-empty summary. These models are eligible for rankings, comparisons, and the strongest analytical claims.

Verified-listing

Models with a summary and source attribution but missing some core specs. They appear in directory and provider pages but are not used for ranking or comparison claims.

Limited-listing

Models with minimal data — typically name, provider, and category only. Listed for market visibility but without analytical claims.

Data sources

All model data in LLM Atlas is sourced from publicly available references:

Benchmarks: LiveCodeBench, MMLU-Pro, GPQA Diamond, AIME 2025, SWE-Bench Verified, HumanEval+, MMMU, GSM8K, HLE, ARC-Challenge, GlobalMMLU, Terminal-Bench
Model specs: HuggingFace model cards, official provider documentation, API pricing pages
Performance data: LM Arena leaderboard, published research papers, provider benchmark reports
Pricing: Official provider pricing pages (OpenAI, Anthropic, Google, Mistral, Cohere, Amazon, etc.)
Safety: Red-teaming reports, alignment research, provider safety evaluations

When data is estimated rather than measured, it is marked accordingly. We never fabricate benchmark scores or pricing.

Limitations and caveats

No scoring system is perfect. We acknowledge the following limitations:

Benchmark gaps: Not every model has been evaluated on every benchmark. Scores are calibrated from available data, which may not capture the full picture.
Rapid change: The AI landscape evolves weekly. Scores reflect the latest available data at the time of update.
Provider-reported data: Some specs (speed, context window, max output) are provider-reported and not independently verified.
Regional pricing: Pricing varies by region, volume, and contract terms. We use standard public pricing for comparison.
Subjectivity in safety: Safety scoring involves editorial judgment about alignment quality and content moderation practices.

We welcome corrections and data contributions. If you find an error in any model's data, please open an issue or submit a pull request.