Benchmark

SWE-Bench Verified

coding

Real-world software engineering benchmark. Tests ability to resolve actual GitHub issues in large open-source repositories.

Interpretation

SWE-Bench Verified is a coding benchmark evaluating code generation and engineering capabilities. It ranks 19 models from Claude Opus 4.6 (80) to Claude Haiku 4.5 (55). This benchmark contributes to the coding scoring on model pages and rankings.

Methodology: 500 verified GitHub issues from popular Python repositories. Models must generate patches that pass existing test suites.

Source: https://www.swebench.com/