LLM AtlasLLM AtlasSearch models

Benchmark

GPQA Diamond

reasoning

Graduate-level science reasoning benchmark. Tests deep reasoning across physics, chemistry, and biology at PhD-level difficulty.

Interpretation

GPQA Diamond is a reasoning benchmark evaluating reasoning and problem-solving capabilities. It ranks 20 models from Gemini 3.1 Pro (91.9) to Command R+ 2026 (50). This benchmark contributes to the reasoning scoring on model pages and rankings.

Methodology: 198 4-choice questions written by domain experts in physics, chemistry, and biology at graduate-level difficulty.

Source: https://huggingface.co/datasets/Idavidrein/gpqa