BigCode Eval Harness: pass@k Testing for Code Models

The BigCode Evaluation Harness runs code-generation models against HumanEval, HumanEval+, MBPP, MBPP+, APPS, DS-1000, and MultiPL-E from one command-line interface, scored with the pass@k metric.
It executes model-written code inside Docker containers, so untrusted generations run in a sandbox instead of on your machine.
BigCode (the Hugging Face and ServiceNow group behind SantaCoder and StarCoder) built it as the reference tool for reproducing the Big Code Models Leaderboard, and released it under Apache 2.0 on 2022-08-09.

Overview

The BigCode Evaluation Harness is a single framework for measuring how well an autoregressive model writes code. It bundles the standardized prompts, the execution sandbox, and the scoring math for a dozen-plus benchmarks so that two teams reporting a HumanEval number are measuring the same thing. The design comes from EleutherAI’s lm-evaluation-harness, specialized for the one task text harnesses cannot handle: running the output to see if it works.

Comparability is the hard part of code evaluation. A pass@1 score is only meaningful if everyone uses the same prompt format, test cases, sampling temperature, and execution environment. The harness pins all of those down, which is why it became the engine behind the Big Code Models Leaderboard.

What the harness evaluates

The framework covers two broad families. The first is Python code generation: HumanEval, HumanEval+, InstructHumanEval, MBPP, MBPP+, APPS, and DS-1000. Each gives the model a function signature or a natural-language description and expects executable Python back. The “+” variants keep the original problems but add far more test cases, catching solutions that pass the sparse original tests by luck.

The second family is multilingual and extended. MultiPL-E translates HumanEval into 18 programming languages, so you can ask whether a model strong in Python also handles Rust, Go, or Lua. HumanEvalPack extends HumanEval into three scenarios — synthesis, fixing, and explanation — across six languages via human translation. The harness also ships ReCode for perturbed-input stress tests, CoNaLa and Concode scored with BLEU, CodeXGLUE tasks, and GSM8K and GSM-HARD for program-aided math.

The pass@k metric

Functional correctness is the whole point, so the harness scores with pass@k rather than text similarity. Sample k candidate solutions for a problem, run them against the hidden tests, and count the problem solved if at least one candidate passes. Pass@1 measures whether the model gets it right on a single try; pass@10 and pass@100 measure whether the right answer appears anywhere in a batch of samples. The harness uses the unbiased estimator from the original Codex paper, which generates more samples than k and computes the expected pass rate — this avoids the high variance of naively sampling exactly k times.

The execution sandbox

Running model output is dangerous by default. A generated solution can contain an infinite loop, delete files, or open a network socket, and you do not get to inspect every line first. The harness handles this with Docker: it provides Dockerfiles that execute candidate code inside a container, isolated from the host. Execution is also gated behind an explicit --allow_code_execution flag, so you never run untrusted code by accident. That flag plus the container is the security model — sandbox the blast radius, make the dangerous step opt-in.

How a run is structured

The harness separates generation from evaluation, which is what makes it practical at scale. Run --generation_only on a GPU box to produce samples, save them, then evaluate later with --load_generations_path on a separate machine that only needs Docker. Generation scales across multiple GPUs through Hugging Face accelerate. Key knobs include --n_samples for candidates per problem and --max_length_generation for the token budget, which defaults to 512 but climbs to around 2048 for reasoning-heavy tasks like GSM8K.

Any autoregressive model on the Hugging Face Hub works as input. The project was tested against code-specialized models of its era — SantaCoder, InCoder, CodeGen — but the interface is model-agnostic, which is why later leaderboards could feed in newer models without touching the evaluation code.

Why it matters for code-generation QA

If you evaluate AI-generated code, the harness is worth studying as a working definition of “do it the same way every time.” Most disagreements about model quality trace back to evaluation drift: a different prompt template, a looser temperature, a missing test case, a host that times out differently than a container. By fixing the prompts, the tests, and the execution environment, the harness turns a fuzzy comparison into a reproducible one — and that reproducibility is the real product, more than any single benchmark.

The caveats live in the benchmarks, not the harness. HumanEval and MBPP are small and well-known, so modern models have almost certainly seen them during training — a high score can reflect memorization as much as skill. That is why HumanEval+ and MBPP+ exist: extra tests expose solutions overfit to the original handful. Pass@k also rewards a model that eventually lands a correct answer somewhere in many samples, which is not the same as being reliable on the first try. For production QA, pass@1 at low temperature is usually the number that matters, not an inflated pass@100.

The broader point: these benchmarks measure self-contained function synthesis, not the work most teams ship — multi-file changes, real dependencies, existing codebases. The harness is the right tool for ranking raw code-writing ability and reproducing published numbers honestly. Treat it as the foundation, then layer execution-based checks against your own code on top, because a model that aces HumanEval has only proven it can pass HumanEval.

Read the original: bigcode-evaluation-harness: A framework for the evaluation of autoregressive code generation language models — Hugging Face / BigCode, 2022-08-09.

Overview

What the harness evaluates

The pass@k metric

The execution sandbox

How a run is structured

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation