Benchmark
SWE-Bench Verified
codingReal-world software engineering benchmark. Tests ability to resolve actual GitHub issues in large open-source repositories.
Interpretation
SWE-Bench Verified is a coding benchmark evaluating code generation and engineering capabilities. It ranks 19 models from Claude Opus 4.6 (80) to Claude Haiku 4.5 (55). This benchmark contributes to the coding scoring on model pages and rankings.
Methodology: 500 verified GitHub issues from popular Python repositories. Models must generate patches that pass existing test suites.
Source: https://www.swebench.com/