Benchmarks

Benchmark explorer

14 tracked

Each benchmark page explains what is measured, where the leaderboard stands, and how to interpret the results.

LiveCodeBench

Competitive coding benchmark focused on practical software tasks. Measures code generation, debugging, and real-world engineering capability across Python, JavaScript, and systems languages.

MMLU-Pro

Advanced reasoning and domain breadth benchmark. Tests knowledge across 57 academic subjects including STEM, humanities, social sciences, and professional domains.

Math Arena

Structured mathematical reasoning benchmark. Evaluates step-by-step problem solving, proof construction, and mathematical abstraction on competition-level problems.

Vision Vista

Synthetic multimodal benchmark for image understanding and analysis. Tests visual reasoning, OCR, document understanding, and image captioning.

HumanEval+

Function-level code generation benchmark. Tests whether models can write correct Python functions from docstrings, with expanded test coverage.

SWE-Bench Verified

Real-world software engineering benchmark. Tests ability to resolve actual GitHub issues in large open-source repositories.

GPQA Diamond

Graduate-level science reasoning benchmark. Tests deep reasoning across physics, chemistry, and biology at PhD-level difficulty.

ARC-Challenge

Grade-school science reasoning benchmark. Tests common-sense reasoning and scientific knowledge on multiple-choice questions.

AIME 2025

American Invitational Mathematics Examination. Competition-level math problems testing advanced mathematical reasoning and problem-solving.

HLE (Humanity's Last Exam)

Expert-level reasoning benchmark designed to be impossibly hard for current AI. Tests the frontier of model capability across all domains.

MMMU

Massive Multi-discipline Multimodal Understanding. Tests visual reasoning across 30+ subjects with images, charts, and diagrams.

GSM8K

Grade School Math 8K. Tests basic mathematical reasoning with 8,000 grade-school math word problems.

GlobalMMLU

Multilingual reasoning benchmark. Tests knowledge and reasoning across 40+ languages and cultural contexts.

Terminal-Bench 2.0

Agentic terminal benchmark. Tests ability to use command-line tools, debug systems, and complete infrastructure tasks.