Methodology
How the rankings work
Transparent scoringA modular scoring system combining benchmark evidence, product characteristics, context length, enterprise readiness, and pricing across 10 ranking categories.
Ranking philosophy
LLM Atlas uses category-specific weighted scores instead of a single monolithic ranking. Each category emphasizes different signals based on what matters most for that use case. A model ranked #1 for coding may rank #10 for value, and that is by design.
Every score is derived from publicly available data: official benchmarks, provider pricing pages, HuggingFace model cards, and measured performance. No score is fabricated or estimated without source attribution.
Score fields
Each model is scored on 8 dimensions. Scores are on a 0-100 scale, calibrated against public benchmarks and measured performance.
Reasoning
Step-by-step logic, problem-solving, math, and analytical capability. Calibrated from MMLU-Pro, GPQA Diamond, AIME, and HLE benchmarks.
Coding
Code generation, debugging, refactoring, and software engineering. Calibrated from LiveCodeBench, HumanEval+, SWE-Bench Verified, and Terminal-Bench.
Vision
Image understanding, OCR, document parsing, and multimodal reasoning. Calibrated from MMMU, Vision Vista, and provider-specific vision benchmarks.
Safety
Content moderation, alignment quality, refusal behavior, and responsible AI practices. Based on red-teaming reports and safety evaluations.
Speed
Inference latency, throughput, and time-to-first-token. Based on provider infrastructure, model architecture, and measured performance.
Enterprise
API reliability, SLA commitments, compliance, documentation quality, and operational readiness for business workloads.
Context window
Effective context length normalized using logarithmic scaling (log10(x) × 20). A 1M-token window scores 120, a 128K window scores ~98.
Price efficiency
Calculated as max(35, 100 − (input × 0.4 + output × 0.6) × 2500). Models without published pricing default to null, not zero.
Category weights
Each ranking category blends the 8 score fields with different weights. The weighted average produces the final category score.
| Category | Weight breakdown |
|---|---|
| Overall | Reasoning 22%, Coding 16%, Vision 14%, Safety 11%, Speed 11%, Enterprise 12%, Context 8%, Price 6% |
| Coding | Coding 42%, Reasoning 24%, Speed 12%, Enterprise 8%, Price 8%, Context 6% |
| Reasoning | Reasoning 50%, Safety 12%, Context 15%, Speed 8%, Price 5%, Enterprise 10% |
| Value | Price 36%, Reasoning 18%, Coding 12%, Speed 12%, Enterprise 12%, Safety 10% |
| Long context | Context 45%, Reasoning 18%, Enterprise 12%, Speed 8%, Price 8%, Coding 9% |
| Enterprise | Enterprise 40%, Safety 18%, Reasoning 15%, Price 8%, Context 10%, Speed 9% |
| Safety | Safety 55%, Enterprise 15%, Reasoning 10%, Price 6%, Speed 6%, Context 8% |
| Vision | Vision 55%, Reasoning 14%, Speed 10%, Enterprise 8%, Price 5%, Context 8% |
| Open source | Open-source bonus 20%, Price 20%, Coding 18%, Reasoning 18%, Speed 10%, Context 8%, Enterprise 6% |
| Structured output | Reasoning 25%, Enterprise 18%, Safety 16%, Coding 15%, Speed 12%, Price 14% |
Price efficiency scoring
Price efficiency is calculated from per-token input and output pricing using the formula:
priceScore = max(35, 100 − (inputPrice × 0.4 + outputPrice × 0.6) × 2500)
This formula weights output tokens more heavily (60%) because they typically represent the majority of cost in production workloads. The floor of 35 ensures that even expensive models receive some score rather than zero.
Models without published API pricing (open-weight-only, self-hosted) receive a null price score. They are included in rankings but do not benefit from price efficiency in value-focused categories.
Context window normalization
Raw context windows vary wildly (4K to 10M tokens). We normalize them using logarithmic scaling to prevent extreme outliers from dominating rankings:
normalized = min(100, log10(contextWindow) × 20)
Examples:
- 4K context → score 72
- 128K context → score 98
- 1M context → score 120 (capped at 100)
- 10M context → score 140 (capped at 100)
This means models with 1M+ context are treated equally in the context dimension, while differences between 4K and 128K are fully captured.
Benchmark calibration
Scores are not direct benchmark pass rates. They are calibrated on a 0-100 scale based on relative performance across the model catalog:
- 95-100: Frontier leaders (top 1-3 models globally)
- 85-94: Strong performers (competitive in production)
- 70-84: Capable models (suitable for most applications)
- 55-69: Moderate models (good for specific tasks)
- Below 55: Limited or specialized (niche use only)
Scores are updated as new models release and benchmarks evolve. A model's score may change even without re-testing, as the calibration curve shifts.
Model coverage tiers
Not all models in the catalog have identical data depth. We classify coverage into tiers:
Full-profile
Models with all core specs (context window, release date, pricing), at least one benchmark score, and a non-empty summary. These models are eligible for rankings, comparisons, and the strongest analytical claims.
Verified-listing
Models with a summary and source attribution but missing some core specs. They appear in directory and provider pages but are not used for ranking or comparison claims.
Limited-listing
Models with minimal data — typically name, provider, and category only. Listed for market visibility but without analytical claims.
Data sources
All model data in LLM Atlas is sourced from publicly available references:
- Benchmarks: LiveCodeBench, MMLU-Pro, GPQA Diamond, AIME 2025, SWE-Bench Verified, HumanEval+, MMMU, GSM8K, HLE, ARC-Challenge, GlobalMMLU, Terminal-Bench
- Model specs: HuggingFace model cards, official provider documentation, API pricing pages
- Performance data: LM Arena leaderboard, published research papers, provider benchmark reports
- Pricing: Official provider pricing pages (OpenAI, Anthropic, Google, Mistral, Cohere, Amazon, etc.)
- Safety: Red-teaming reports, alignment research, provider safety evaluations
When data is estimated rather than measured, it is marked accordingly. We never fabricate benchmark scores or pricing.
Limitations and caveats
No scoring system is perfect. We acknowledge the following limitations:
- Benchmark gaps: Not every model has been evaluated on every benchmark. Scores are calibrated from available data, which may not capture the full picture.
- Rapid change: The AI landscape evolves weekly. Scores reflect the latest available data at the time of update.
- Provider-reported data: Some specs (speed, context window, max output) are provider-reported and not independently verified.
- Regional pricing: Pricing varies by region, volume, and contract terms. We use standard public pricing for comparison.
- Subjectivity in safety: Safety scoring involves editorial judgment about alignment quality and content moderation practices.
We welcome corrections and data contributions. If you find an error in any model's data, please open an issue or submit a pull request.