Code Generation QA: Best Practices for Evaluating AI Code

  • Score generated code by running it, not by comparing text: use execution-based unit tests with the unbiased pass@k estimator (sample n≥k, count correct, use the combinatorial formula), because match metrics like BLEU can’t tell functionally-equivalent programs apart.
  • A passing benchmark score is only as trustworthy as its tests. Strengthened suites such as HumanEval+ add ~80× more tests and cut measured pass@k by 19–29% — weak tests don’t just inflate scores, they re-rank models.
  • For agentic, real-world evaluation, the harness is part of the experiment: the same model can swing from 2.7% to 28.3% across scaffolds, and container resources alone move scores up to 6 points. Document and match both, and distrust leaderboard gaps under ~3 points.

Overview

Code generation QA is the practice of deciding whether AI-written code is actually correct, safe, and good enough to ship — and whether the number you used to make that call means what you think it means. The hard part is rarely getting a model to produce code. It’s proving the code does what was asked, and proving your measurement wasn’t fooled by a weak test, a generous harness, or a problem the model had already seen.

The research consensus across the foundational labs points to a layered approach. Start with execution. Insist on test adequacy. Move to real-world tasks while pinning the environment. Defend against contamination. Then look past pass/fail to security, quality, and efficiency. The sections below walk each layer, with the evidence behind it.

Start with execution, not text similarity

The first decision in code-gen QA is what counts as “correct.” Early work borrowed BLEU from machine translation, scoring generated code by token overlap with a reference solution. That breaks on code. As the OpenAI Codex team showed when introducing HumanEval, match-based metrics “are unable to account for the large and complex space of programs functionally equivalent to a reference solution” — two correct programs can look nothing alike, and a wrong program can score higher than a right one. Our explainer on CodeBLEU covers why even syntax- and dataflow-aware match metrics still fall short of running the code.

The reliable alternative is execution-based functional correctness: generate the code, run it against unit tests, count it correct only if the tests pass. The standard way to report this is pass@k — the probability that at least one of k samples is correct. The catch is statistical. Estimating pass@k as 1−(1−p)k from a single rate is biased and understates performance. The Codex paper’s fix, now universal, is to sample n≥k generations per task, count the c that pass, and use the combinatorial estimator that avoids that bias and the numerical instability of large binomials. The practical rule: sample generously (the original used n=200), and never report pass@k from too few samples. The HumanEval and pass@k breakdown digs into the method, and the open-source BigCode evaluation harness shows what a reproducible execution sandbox looks like in practice.

Demand test adequacy — a passing score is only as good as its tests

Execution fixes the metric. It does not fix the tests. HumanEval ships 164 hand-written problems with an average of 7.7 tests each — enough to be useful, far too few to be conclusive. When the EvalPlus team regenerated the test suites with LLM-seeded, type-aware mutation and expanded HumanEval roughly 80× (to about 13,000 tests, forming HumanEval+), measured pass@k fell by 19.3% at pass@1 and up to 28.9% at pass@100 across 26 models, GPT-4 and ChatGPT included. The strengthened tests caught working-looking code that was quietly wrong.

Two findings from that work deserve to change how you read any leaderboard. First, roughly 11% of HumanEval’s own ground-truth solutions were themselves incorrect — the answer key had bugs. Second, weak tests don’t just lower everyone’s score evenly; they mis-rank models. Under HumanEval+, models that trailed ChatGPT on the original benchmark overtook it. If your test suite is thin, your ranking is fiction.

The practice that follows: treat test adequacy as a hard requirement, prefer strengthened or high-coverage suites, and confirm the reference solutions before trusting the grade. Several model reports now lead with EvalPlus-style numbers — see how DeepSeek-Coder-V2 reports against the strengthened suite — and benchmarks like BigCodeBench push toward realistic tasks with many tests per problem (why it succeeds HumanEval). The same caution applies when reading any single-benchmark claim from MBPP, Code Llama, WizardCoder, StarCoder2, CodeQwen1.5, Qwen2.5-Coder (and its technical report), DeepSeek-Coder, or Mistral’s Codestral (25.01).

Evaluate real-world tasks — and pin the harness

Function-level puzzles tell you a model can write a sorting routine. They don’t tell you it can fix a bug in a real repository. That gap is why SWE-bench Verified exists: 500 real GitHub issues, each human-validated by expert annotators. The cleanup mattered. Screening the original pool found 38.3% of issues underspecified and 61.1% carrying unfair tests; 68.3% of samples were filtered out. Removing those impossible tasks roughly doubled GPT-4o’s score (16% to 33.2%), with gains appearing inside each difficulty bucket — the sign of removing noise, not cherry-picking easy problems. Anthropic’s minimal-agent write-up is a good companion on how the task is actually run.

Here is the part most leaderboards hide: on agentic benchmarks, the scaffold around the model often matters more than the model. OpenAI documented GPT-4 ranging from 2.7% to 28.3% on SWE-bench Lite depending only on the agent harness — a tenfold spread from the same weights. The Holistic Agent Leaderboard makes the same case at scale, arguing that evaluation has to vary models, scaffolds, and benchmarks together because “no single dimension suffices.” So report the scaffold as part of the result, not a footnote. You can see how different teams frame their agentic setups in Qwen3-Coder and Mistral’s Devstral line (v1, Small 1.1 / Medium, Devstral 2), and how the idea extends to ML engineering in MLE-bench. OpenAI has since gone further and argued the benchmark has aged out of usefulness for frontier models — see why they are retiring SWE-bench Verified.

Control infrastructure noise

Even with a fixed model and scaffold, the machine underneath can decide the result. Anthropic measured a 6-percentage-point gap on Terminal-Bench 2.0 between the most- and least-resourced setups (p<0.01), driven mainly by how memory limits were enforced: a hard kill at exactly 1× the expected allocation produced spurious out-of-memory failures (a 5.8% infra-error rate versus 0.5% uncapped), while a 3× ceiling cut those errors by about two-thirds without changing real success rates. Their infrastructure-noise study argues resource configuration should be a documented, controlled variable held to the same standard as prompt format or temperature.

The downstream rule is a useful filter for reading any leaderboard: gaps below about 3 points deserve skepticism until the configuration is documented and matched. Naive binomial confidence intervals already span 1–2 points on these sample sizes; infrastructure confounders stack on top of that, not inside it. A standardized, containerized harness that runs every model in the same environment is the fix, and it pays for itself by parallelizing hundreds of tasks and removing implementation bugs.

Defend against contamination

A model that memorized the answer is not solving the problem. Because frontier models train on a large fraction of public GitHub, any benchmark built from public problems risks measuring recall instead of reasoning — HumanEval was hand-written precisely to reduce that risk, and even its authors called hand-writing a mitigation, not a guarantee.

Two defenses have held up. The first is temporal: live benchmarks like LiveCodeBench continuously collect new problems from contest sites (LeetCode, AtCoder, Codeforces) and tag each with a release date, then score a model only on problems published after its training cutoff. NVIDIA’s OpenCodeReasoning work leans on exactly this kind of contamination-aware evaluation. The competition-programming lineage applies the same logic with human-comparable ratings — CodeElo submits to live Codeforces, and the AlphaCode papers (blog, paper, AlphaCode 2) and OpenAI’s competitive-programming evaluation test on contests newer than the training data. The second defense is determinism: building SWE-bench Docker images from source pulls unpinned apt and PyPI packages, so images built on different days differ even from identical Dockerfiles. The fix is a registry of pre-built, pinned images so every run is the same run. And because contamination sometimes hides as behavior, LLM-aided inspection of execution logs has caught agents searching for the benchmark on Hugging Face instead of solving the task.

Look past pass/fail: security, quality, and efficiency

Correct is necessary, not sufficient. Code that passes every test can still be insecure, unmaintainable, or slow. Security is the dimension with the most mature tooling: static-analysis-based suites scan generated code for known weakness classes (CWEs) and measure how often models emit vulnerable patterns. Meta’s CyberSecEval line is the reference here — see our coverage of CyberSecEval, CyberSecEval 2, and CyberSecEval 3 — and treat insecure-generation rate as a first-class metric, not an afterthought.

Quality and human judgment round it out. A randomized controlled trial on GitHub Copilot’s effect on code quality pairs unit-test pass rates with blind human review of readability and maintainability — a reminder that some properties only a human (or an execution-grounded preference arena like BigCodeArena, which lets reviewers run the code before voting; method paper) can score reliably. Efficiency and domain fit matter too: GPU and systems code need their own harnesses, as NVIDIA’s ComputeEval (2025.2 update) and KernelBench kernel-generation work show, where a kernel must be both correct and fast. For the broader task taxonomy across understanding and generation, CodeXGLUE remains a useful map. (The published evidence here is thinner and more vendor-specific than for functional correctness, so weight these dimensions for your own risk profile rather than chasing a single headline number.)

A practical code generation QA checklist

  1. Grade by execution. Run the code against tests; never rank models on text-similarity metrics.
  2. Report unbiased pass@k. Sample n≥k generations, use the combinatorial estimator, and disclose n.
  3. Check test adequacy. Prefer strengthened/high-coverage suites; verify the reference solutions; assume thin tests inflate and mis-rank.
  4. Test real tasks. Use human-validated, repository-level benchmarks for anything beyond toy problems.
  5. Pin the harness and the hardware. Document the scaffold and resource limits; run every model in one containerized environment.
  6. Resist contamination. Favor live/post-cutoff benchmarks and deterministic, pre-built images; inspect logs for shortcuts.
  7. Measure beyond correctness. Add security (CWE/insecure-generation rate), quality, and efficiency to the scorecard.
  8. Respect the noise floor. Treat sub-3-point gaps as ties until configurations match.

Why it matters for code generation QA

Most published coding scores are higher than the truth, and the size of that gap depends on choices the scoreboard rarely shows: how many tests ran, which scaffold drove the agent, how much memory each task got, whether the model had seen the problem. Teams shipping AI-generated code can’t outsource that judgment to a leaderboard. The defensible position is to run your own execution-based evaluation, on adequate tests, in a pinned environment, against tasks that resemble your work — and to read every external benchmark through the failure modes above. That is the difference between a number that markets a model and a number you can stake a release on.

Primary sources: Evaluating Large Language Models Trained on Code (Chen et al., 2021, pass@k & HumanEval); Is Your Code Generated by ChatGPT Really Correct? (EvalPlus) (Liu et al., NeurIPS 2023); Introducing SWE-bench Verified (OpenAI, 2024); Quantifying infrastructure noise in agentic coding evals (Anthropic, 2026); LiveCodeBench (Jain et al., 2024); Holistic Agent Leaderboard (2026).