Benchmark
HumanEval+
codingFunction-level code generation benchmark. Tests whether models can write correct Python functions from docstrings, with expanded test coverage.
Interpretation
HumanEval+ is a coding benchmark evaluating code generation and engineering capabilities. It ranks 16 models from GPT-5.4 (97) to MiMo-V2-Flash (70.7). This benchmark contributes to the coding scoring on model pages and rankings.
Methodology: 164 Python programming problems with unit tests. Evaluates functional correctness of generated code from natural language specifications.