Benchmark
GPQA Diamond
reasoningGraduate-level science reasoning benchmark. Tests deep reasoning across physics, chemistry, and biology at PhD-level difficulty.
Interpretation
GPQA Diamond is a reasoning benchmark evaluating reasoning and problem-solving capabilities. It ranks 20 models from Gemini 3.1 Pro (91.9) to Command R+ 2026 (50). This benchmark contributes to the reasoning scoring on model pages and rankings.
Methodology: 198 4-choice questions written by domain experts in physics, chemistry, and biology at graduate-level difficulty.