164 handwritten Python problems for evaluating code generation models.
The dataset consists of 164 handwritten Python functions, each including a signature, docstring, implementation, and multiple unit tests. Problems were created manually to ensure they were absent from training sets of code generation models.
It is useful for researchers evaluating large language models on programming tasks, code completion, and functional correctness in Python.
Run pass@k evaluations on LLMs by prompting them with function signatures and docstrings then checking outputs against the included unit tests.
Test new prompting strategies or fine-tuned models on the 164 problems to measure improvements in functional correctness.
Integrate the dataset into CI pipelines to automatically score internal code-completion tools before deployment.
from datasets import load_dataset
ds = load_dataset("openai/openai_humaneval")A benchmark of 164 handwritten Python programming problems with signatures, docstrings, and unit tests for evaluating code synthesis.
Verified reviews from the community shape this listing's rating.
Loading reviews…