How BigCode Benchmarks StarCoder2: A Code-LLM Eval Guide

StarCoder2 ships in three sizes — 3B, 7B, and 15B — trained by BigCode on The Stack v2, a corpus drawn from Software Heritage spanning 619 programming languages.
StarCoder2-15B posts 72.6% pass@1 on HumanEval, 75.2% on MBPP, and 62.0% on DS-1000, matching or beating CodeLlama-34B despite being less than half its size.
The release is built around a reproducible evaluation suite — HumanEval(+), MBPP(+), DS-1000, and MultiPL-E — that lets teams compare same-size models on the same ground.

Overview

StarCoder2 is an open code-generation model family from BigCode, the open-science collaboration run by Hugging Face, Software Heritage, and ServiceNow. The February 2024 release paired three models (3B, 7B, 15B) with The Stack v2, a training dataset roughly four times the size of the original StarCoder corpus. What makes it useful beyond the weights is the evaluation: the team measured every model on a published set of public benchmarks and released the data provenance behind the training set, so the numbers can be checked rather than taken on faith.

For anyone deciding how to test AI-generated code, the StarCoder2 report doubles as a worked example of how to benchmark a code model honestly.

What the benchmarks actually measure

The four benchmarks in the suite each probe a different slice of code generation, and knowing what they do and do not test is half the value.

HumanEval and MBPP — the functional baselines

HumanEval is 164 hand-written Python problems, each with a docstring and a set of hidden unit tests. The model sees the function signature plus the description and has to write the body. Scoring uses pass@1: the fraction of problems where the first generated solution passes all the tests. MBPP (Mostly Basic Python Problems) works the same way over a larger, simpler set of around 1,000 tasks. Both reward functional correctness — does the code run and return the right answer — not style or readability.

The “plus” variants matter. HumanEval+ and MBPP+ (from the EvalPlus project) bolt on far more test cases per problem, exposing solutions that pass the original handful of tests but break on edge cases. StarCoder2-15B drops from 72.6% on HumanEval to 63.4% on HumanEval+, and from 75.2% on MBPP to 61.2% on MBPP+. That gap is the point: thin test suites overstate how good a model really is.

DS-1000 — code that does real work

DS-1000 is 1,000 data-science problems pulled from Stack Overflow, covering libraries like NumPy, pandas, scikit-learn, and Matplotlib. These tasks are closer to what a working engineer writes than the algorithmic puzzles in HumanEval, and they require correct use of specific library APIs. StarCoder2-15B scores 62.0% pass@1 here, the strongest result among the large open models in the comparison.

MultiPL-E — beyond Python

HumanEval is Python-only, which hides whether a model can actually write Go, Rust, or Julia. MultiPL-E solves this by translating the HumanEval problems into 18 languages, so the same logical task is tested across an entire language matrix. StarCoder2-15B leads on 16 of those 18 languages among large models and edges out DeepSeekCoder-33B on several low-resource ones, including D, Julia, Lua, and Perl.

How the evaluation was kept reproducible

The headline numbers are only useful if someone else can regenerate them. BigCode leaned on three practices worth copying. First, every benchmark is public and execution-based — solutions are run against unit tests, not graded by a model or a human, so the score is deterministic given the same generations. Second, the team released the Software Heritage persistent identifiers (SWHIDs) for the training data, which means the exact source files behind the model are traceable rather than a vague “scraped from the web.” Third, results are reported per model size against same-size peers, so a 3B model is judged against other 3B models instead of being flattered by comparison to something tiny.

The size-matched framing produces the most quotable findings: StarCoder2-3B outperforms other code LLMs of its size on most benchmarks and even beats the older StarCoderBase-15B, while StarCoder2-15B matches or outperforms CodeLlama-34B, a model more than twice its parameter count. DeepSeekCoder-33B still leads on high-resource language completion, and the paper says so plainly rather than cherry-picking.

Why it matters for code-generation QA

The StarCoder2 evaluation is a clean template for anyone building a QA process around AI-generated code, and its limits are as instructive as its results. The strongest idea here is execution-based scoring with expanded test suites. The HumanEval-to-HumanEval+ drop of nine points is a direct warning: if your acceptance check for generated code is a couple of happy-path tests, you are measuring almost nothing. Real evaluation needs adversarial and edge-case coverage, or it will green-light code that fails in production.

The caveats are real. pass@1 on a fixed benchmark says nothing about security, performance, maintainability, or whether the code matches your codebase’s conventions — all things a human reviewer or a dedicated QA layer still has to catch. Benchmark contamination is a live concern too: as these problem sets age, they leak into training data, and a high score can reflect memorization rather than reasoning. DS-1000 and MultiPL-E partly hedge this by using library-specific and multi-language tasks that are harder to game, which is why a balanced suite beats any single number.

The practical takeaway: treat published benchmarks as a floor, not a verdict. They tell you which model is worth piloting; they do not tell you whether its output is safe to ship. Pair functional benchmarks with your own execution tests, security checks, and review on the code paths that actually matter to you — the gap between “passes HumanEval” and “passes your test suite” is exactly where QA lives.

Read the original: StarCoder 2 and The Stack v2: The Next Generation — Hugging Face / BigCode, 2024-02-29.

Overview

What the benchmarks actually measure

HumanEval and MBPP — the functional baselines

DS-1000 — code that does real work

MultiPL-E — beyond Python

How the evaluation was kept reproducible

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation