Benchmark
HLE (Humanity's Last Exam)
reasoningExpert-level reasoning benchmark designed to be impossibly hard for current AI. Tests the frontier of model capability across all domains.
Interpretation
HLE (Humanity's Last Exam) is a reasoning benchmark evaluating reasoning and problem-solving capabilities. It ranks 11 models from Gemini 3.1 Pro (37.5) to GPT-4o (8.5). This benchmark contributes to the reasoning scoring on model pages and rankings.
Methodology: Expert-written questions across science, math, humanities, and professional domains. Designed to stress-test the limits of AI reasoning.
Source: https://lastexam.ai/