Benchmark

ARC-Challenge

reasoning

Grade-school science reasoning benchmark. Tests common-sense reasoning and scientific knowledge on multiple-choice questions.

Interpretation

ARC-Challenge is a reasoning benchmark evaluating reasoning and problem-solving capabilities. It ranks 13 models from GPT-5.4 (98.5) to Llama 4 Maverick (95). This benchmark contributes to the reasoning scoring on model pages and rankings.

Methodology: 2,590 challenging science questions from grade-school exams. Requires both knowledge retrieval and reasoning.

Source: https://allenai.org/data/arc