MBPP: How Google Tests LLM Code Generation

Google Research introduced MBPP (Mostly Basic Programming Problems): 974 short Python tasks, each paired with a natural-language description and three test cases that the generated code must pass.
Correctness is judged by execution, not text similarity — a solution counts only if it runs and passes the hidden tests. The largest 137B-parameter model hit 59.6% accuracy on MBPP with few-shot prompting alone.
Synthesis quality scaled log-linearly with model size (244M to 137B params), and a single round of natural-language feedback from a human halved the error rate on failed problems.

Overview

This 2021 paper from Google Research is one of the first large studies to ask a blunt question about AI code generation: does the program actually run and do what was asked? The authors built two benchmarks — MBPP and MathQA-Python — and measured large language models against them by executing the generated code against test cases rather than scoring how the text looks.

It matters because it set the template most code-eval work still follows. The combination of a clean task format, execution-based grading, and a study across model scales gave the field a repeatable way to compare models on real programming, not paraphrasing.

What MBPP actually is

MBPP contains 974 problems aimed at entry-level programmers — string manipulation, list operations, basic math, simple data handling. Each entry has three parts: a one-sentence English description, a reference solution, and three assert-style test cases. The tests are the contract. A model can write any code it wants as long as the asserts pass.

That design choice is the important one. Earlier code-generation work often leaned on metrics like BLEU that compare generated text to a reference string. Two correct programs can look nothing alike, and a near-identical string can be broken. By grading on execution, MBPP sidesteps both failure modes.

How the tasks are scored

The model receives the description (and in few-shot mode, a handful of worked examples). It produces a function. That function runs against the test cases in a sandbox. Pass all three, the problem counts as solved. The headline metric is the fraction of problems solved this way. There is no partial credit for code that almost works — a single failing assert means zero.

MathQA-Python and the two evaluation regimes

The second benchmark, MathQA-Python, reframes 23,914 math word problems as code-synthesis tasks: read a more complex natural-language prompt, write Python that computes the answer. It stresses a different skill — translating dense, multi-step text into a correct procedure — and gives a much larger sample than MBPP’s 974.

The authors evaluated under two regimes. In few-shot prompting, the model sees a few examples in the prompt and nothing more. In fine-tuning, the model is further trained on held-out task data. Across model sizes, fine-tuning added roughly 10 percentage points over few-shot. On MathQA-Python, the largest fine-tuned model reached 83.8% accuracy — much higher than MBPP, since the math problems map more directly onto short, formulaic code.

The findings worth remembering

Three results stand out. First, performance scaled log-linearly with parameter count from 244M up to 137B — bigger models were predictably better, with no plateau in the tested range. Second, the best few-shot result on MBPP was 59.6%, meaning even the largest model failed roughly four in ten beginner-level problems on the first try. Third, and most useful for QA work: when a human gave one round of plain-English feedback on a wrong answer, the model’s error rate on those problems dropped by half.

One negative result is just as instructive. The models could write code that passed tests, yet could not reliably predict what their own programs would output for a given input. Generating plausible code and understanding execution semantics are not the same capability — a gap that directly limits how far you can trust a model to reason about its own output.

Why it matters for code-generation QA

The execution-first stance is the durable lesson here. If you are evaluating AI-generated code today, similarity to a reference or a model’s own confidence tells you almost nothing. Run it. MBPP’s three-asserts-per-task format is a minimum viable harness you can copy: write the spec, write the tests, execute, count passes.

The caveats deserve equal weight. MBPP problems are short and self-contained — no external state, no multi-file context, no ambiguous requirements. A 59.6% pass rate on toy problems says little about a model patching a real service. Three test cases per task also under-test edge behavior; code can pass all three and still be wrong on inputs nobody checked. Treat benchmark scores as a floor, not a guarantee, and back them with property-based or adversarial tests on anything you ship.

The feedback result points at the practical workflow. Models are far stronger as a fast first draft that a human or an automated checker corrects than as a one-shot oracle. A QA pipeline that executes generated code, surfaces the failing test, and feeds that signal back beats one that accepts the first answer — and that loop is exactly what later coding agents were built around. MBPP showed the value of closing it.

Read the original: Program Synthesis with Large Language Models — Google DeepMind, 2021-08-16.

Overview

What MBPP actually is

How the tasks are scored

MathQA-Python and the two evaluation regimes

The findings worth remembering

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation