Benchmark

HLE (Humanity's Last Exam)

reasoning

Expert-level reasoning benchmark designed to be impossibly hard for current AI. Tests the frontier of model capability across all domains.

Interpretation

HLE (Humanity's Last Exam) is a reasoning benchmark evaluating reasoning and problem-solving capabilities. It ranks 11 models from Gemini 3.1 Pro (37.5) to GPT-4o (8.5). This benchmark contributes to the reasoning scoring on model pages and rankings.

Methodology: Expert-written questions across science, math, humanities, and professional domains. Designed to stress-test the limits of AI reasoning.

Source: https://lastexam.ai/