- Meta’s Code Llama (built on Llama 2) is evaluated on three benchmarks: HumanEval zero-shot, MBPP 3-shot, and MultiPL-E for multilingual code generation.
- The 34B model reported 53.7% on HumanEval and 56.2% on MBPP, state-of-the-art among open models at release; the family also covers 7B, 13B, and 70B sizes plus Python and Instruct variants.
- All scores are graded by execution against hidden tests using the pass@k metric — code that runs and passes counts, code that merely looks right does not.
Overview
Code Llama is Meta’s family of open code models, released in August 2023 by Baptiste Rozière and colleagues at Meta GenAI and FAIR. The paper takes Llama 2, trains it further on code, and reports where the result lands against the standard public benchmarks for program synthesis.
The reason it stayed relevant has less to do with the model than with how it was measured. The evaluation recipe — HumanEval zero-shot, MBPP few-shot, MultiPL-E across languages, all execution-graded — became the default scorecard that later open code models were compared against. Reading the methodology tells you what a credible code-model claim should include.
The benchmark suite and how each one works
Three benchmarks carry most of the weight, and each tests a different thing.
HumanEval, zero-shot
HumanEval is 164 hand-written Python problems, each a function signature plus a docstring describing the behavior. The model sees the signature and the description and nothing else — zero-shot, no worked examples in the prompt. It writes the function body, and that body runs against a set of hidden unit tests. Pass every test, the problem is solved. Zero-shot is the harder setting: the model gets no demonstration of the expected output format, so the score reflects what it can do cold.
MBPP, 3-shot
MBPP (Mostly Basic Programming Problems) is a separate set of short Python tasks, each with a natural-language description and assert-style test cases. Code Llama is evaluated 3-shot here: three example problems and solutions go into the prompt before the real task. The few-shot framing is deliberate. It tells the model what a good answer looks like, which isolates synthesis ability from prompt-format confusion and makes MBPP a complement to HumanEval rather than a duplicate.
MultiPL-E, multilingual
HumanEval and MBPP are Python-only. MultiPL-E translates HumanEval-style problems into many other languages, grading the model on working C++, Java, PHP, TypeScript, and more. This is where Code Llama’s claim was broadest: the paper reports it outperforming the publicly available alternatives across MultiPL-E, a stronger statement than topping one Python benchmark.
What the numbers actually say
The headline figures for the 34B model are 53.7% on HumanEval and 56.2% on MBPP. Read those carefully. Roughly half the problems are solved on the first qualifying attempt, which means half are not — and these are self-contained functions, not production tickets. The abstract quotes higher ceilings across the family, up to 67% on HumanEval and 65% on MBPP, reached by the specialized Code Llama – Python variant. That spread is the point: targeted training on one language buys measurable accuracy on that language’s benchmark.
Scores are reported with pass@k: the model generates k samples per problem, and it counts as solved if any sample passes the tests. pass@1 is the strict single-shot number; higher k rewards a model that lands a correct answer somewhere in several tries. Quoting which k a number uses is not a formality — pass@100 looks far better than pass@1 on the same model, and conflating them is the easiest way to inflate a result.
Methodology details that change the reading
Two design choices affect how the evaluation should be read. Code Llama was trained on 16k-token sequences and shows gains on inputs up to 100k tokens, so long-context handling is part of the offering — yet HumanEval and MBPP problems are tiny and never exercise it. It also supports infilling, predicting code in the middle of a file given both sides, which matters for editor-style completion but is orthogonal to the function-from-spec task these benchmarks measure. A strong HumanEval number says nothing about infilling quality.
Why it matters for code-generation QA
The durable lesson is the shape of the evaluation, not the leaderboard spot. A trustworthy code-model claim names the benchmark, the shot count, the pass@k, and grades by execution. Code Llama’s report does all four, and that is the bar to hold any vendor number against. If a claim omits the k, or grades by string similarity instead of running the code, treat it as marketing.
The caveats matter as much as the scores. HumanEval and MBPP are short, single-function, unambiguous problems — the opposite of most real work, which spans files, carries existing state, and ships with vague requirements. A 53.7% pass@1 on isolated functions tells you little about a model editing a live service, and the long-context and infilling features these benchmarks never touch are exactly what production coding depends on. Use these numbers to rank models against each other, not to predict how one behaves in your codebase.
For a team building an eval pipeline, copy the harness, not the headline. Write the spec, write executable tests, sample more than once, and count what passes — then extend it where these benchmarks stop: multi-file context, edge-case and adversarial inputs, and tasks from your own repository. MultiPL-E is also a reminder to test every language you ship, since per-language scores diverge sharply and a Python number does not transfer.
Read the original: Code Llama: Open Foundation Models for Code — Meta AI, 2023-08-24.
