- OpenAI’s 2021 Codex paper introduced HumanEval, a benchmark of 164 hand-written Python problems scored by running the generated code against real unit tests.
- It popularized pass@k functional-correctness scoring: Codex solved 28.8% of problems with one sample (pass@1) and 70.2% when allowed 100 samples per problem (pass@100).
- Authored by Mark Chen and 57 colleagues (58 authors total), the paper set the template that nearly every later code-generation benchmark copied.
Overview
“Evaluating Large Language Models Trained on Code” is the paper that introduced Codex, the GPT model fine-tuned on public GitHub code that later powered GitHub Copilot. Its more durable contribution is methodological: it gave the field a way to measure whether generated code actually works, rather than whether it looks plausible.
Before this paper, code models were often graded on text-similarity metrics borrowed from machine translation, like BLEU. The OpenAI team argued that matching reference text is the wrong target for programs, and built an evaluation around execution instead. That choice, and the HumanEval set they released to support it, became the default for the next several years of code-model research.
How HumanEval is built
HumanEval is a set of 164 programming problems written by hand specifically for this evaluation. Each problem is a Python function signature plus a docstring describing the intended behavior, and the model’s job is to fill in the body. The problems cover the kind of work you would expect from an introductory-to-intermediate programmer: string manipulation, simple math, list processing, and basic algorithms.
The team wrote the problems from scratch on purpose. Models trained on GitHub have effectively read most of the public internet’s code, so any benchmark scraped from existing repositories risks being part of the training data. Original, unpublished problems reduce that contamination and make the score a closer read on generalization rather than memorization.
Crucially, every problem ships with a set of unit tests, averaging several test cases each. A candidate solution counts as correct only if it passes all of them.
Scoring with pass@k
Functional correctness is binary per problem: the code either passes the hidden tests or it does not. The interesting question is how to score a model that samples many candidate solutions, since one good answer among many is still useful to a developer who can pick the working one.
That is what pass@k captures. It estimates the probability that at least one of k generated samples for a problem passes all its tests. The naive way to compute it, generate exactly k samples and check, has high variance. The paper instead generates a larger number of samples (for example 200) and uses an unbiased estimator to calculate pass@k for smaller k from that pool, which keeps the numbers stable.
What the numbers showed
On HumanEval, Codex solved 28.8% of problems with a single sample. For comparison, GPT-3 with no code fine-tuning solved 0%, and the open GPT-J model solved 11.4%. The fine-tuning on code mattered enormously.
The result that reframed the field was pass@100: when Codex was allowed 100 samples per problem and credited if any one of them passed, it solved 70.2% of the set. A model that is right less than a third of the time on its first try is right most of the time if you let it guess repeatedly and have a way to check the answers. Tests give you exactly that check.
Limitations the authors flagged
The paper is unusually candid about where Codex breaks. The authors found it struggles as docstrings describe longer chains of operations, and it is unreliable at binding operations to the correct variables. They also documented sample inefficiency, misaligned outputs, and the safety and economic risks of deploying code generation at scale, alongside the obvious point that passing unit tests is not the same as being correct, secure, or maintainable.
Why it matters for code-generation QA
The lasting lesson here is that execution beats resemblance. If you are evaluating an AI coding tool, a similarity score against a reference answer tells you almost nothing; running the output against tests tells you whether it works. HumanEval made that the norm, and any serious code-gen QA pipeline should treat “does it pass tests” as the floor, not the ceiling.
The pass@k framing also has a sharp practical edge. A low pass@1 paired with a high pass@k is the signal behind agentic coding workflows: sample several solutions, run the tests, keep the one that passes. That only works when you have a reliable oracle, which is the catch. HumanEval supplies its own tests; your production code often does not, and writing a trustworthy test harness is frequently harder than generating the code under evaluation.
Two caveats are worth carrying forward. First, 164 short, self-contained Python functions are not representative of real engineering, which involves large codebases, ambiguous requirements, and cross-file changes; later benchmarks like MBPP, APPS, and SWE-bench exist precisely because HumanEval saturated and underrepresents that complexity. Second, passing tests measures functional correctness only. It says nothing about security vulnerabilities, performance regressions, or readability, which is why test-based scoring belongs inside a broader QA strategy rather than standing in for one. HumanEval defined how to ask “does the generated code run correctly,” and that question is necessary but not sufficient.
Read the original: Evaluating Large Language Models Trained on Code — OpenAI, 2021-07-07.
