Benchmark

HumanEval+

coding

Function-level code generation benchmark. Tests whether models can write correct Python functions from docstrings, with expanded test coverage.

Interpretation

HumanEval+ is a coding benchmark evaluating code generation and engineering capabilities. It ranks 16 models from GPT-5.4 (97) to MiMo-V2-Flash (70.7). This benchmark contributes to the coding scoring on model pages and rankings.

Methodology: 164 Python programming problems with unit tests. Evaluates functional correctness of generated code from natural language specifications.

Source: https://github.com/evalplus/humaneval_plus