Benchmark

Terminal-Bench 2.0

coding

Agentic terminal benchmark. Tests ability to use command-line tools, debug systems, and complete infrastructure tasks.

Interpretation

Terminal-Bench 2.0 is a coding benchmark evaluating code generation and engineering capabilities. It ranks 11 models from Gemini 3.1 Pro (54.2) to GPT-4o (28). This benchmark contributes to the coding scoring on model pages and rankings.

Methodology: Real-world terminal tasks requiring command execution, debugging, and system administration. Tests agentic coding capability.