Benchmark
Terminal-Bench 2.0
codingAgentic terminal benchmark. Tests ability to use command-line tools, debug systems, and complete infrastructure tasks.
Interpretation
Terminal-Bench 2.0 is a coding benchmark evaluating code generation and engineering capabilities. It ranks 11 models from Gemini 3.1 Pro (54.2) to GPT-4o (28). This benchmark contributes to the coding scoring on model pages and rankings.
Methodology: Real-world terminal tasks requiring command execution, debugging, and system administration. Tests agentic coding capability.