How AlphaCode 2 Uses Gemini to Solve 43% of Codeforces

AlphaCode 2, built on Google DeepMind’s Gemini Pro, solved 43% of Codeforces competition problems within 10 attempts, up from AlphaCode 1’s 25% (about a 1.7x gain).
It reached the 85th percentile of human competitors on average, sitting between the Expert and Candidate Master ranks on Codeforces.
The system generates up to one million C++ samples per problem, then prunes them to 10 submissions through test-based filtering, behavioral clustering, and a fine-tuned scoring model.

Overview

The AlphaCode 2 Technical Report, published by the AlphaCode Team at Google DeepMind on December 6, 2023, describes a competitive-programming system powered by Gemini Pro. It is the successor to the 2022 AlphaCode, which was the first AI to reach median-human performance on Codeforces. The report matters because it is one of the clearest public accounts of how a generate-many-then-filter pipeline turns a code-generating model into a system that actually ships correct answers under a hard submission budget.

The headline result is real but worth reading carefully. AlphaCode 2 does not write one good program; it writes a million candidate programs and spends most of its engineering effort deciding which ten to submit.

How the system is built

The foundation is Gemini Pro, fine-tuned in two consecutive rounds using the GOLD training objective. The first round runs on an updated CodeContests dataset containing roughly 15,000 problems and 30 million human code samples. The second round runs a few additional steps on a smaller, higher-quality dataset. Rather than producing a single policy model, the team varied hyperparameters to create a family of fine-tuned models, because diversity across models is what lets the system reach problems a single model would miss.

Sampling for breadth

For each problem, AlphaCode 2 generates up to one million code samples, splitting the budget evenly across the family of policy models. Each sample uses a randomized temperature, and the prompt randomizes metadata like the problem’s difficulty rating and category tags. The point is coverage: with a wide enough spread of attempts, at least some are likely to be correct. Notably, AlphaCode 1 sampled in both Python and C++, while AlphaCode 2 dropped Python entirely because the team found its C++ samples were higher quality.

How a million samples become ten submissions

Generation is the easy part. The interesting engineering is the funnel that reduces a million candidates to the 10 the system is allowed to submit.

Filtering against public tests

Every Codeforces problem ships with at least one public input/output example. AlphaCode 2 executes each sample against that test and discards anything that produces the wrong output, plus the under-5% of samples that fail to compile. On average this removes about 95% of all generated code, leaving roughly 50,000 candidates per problem.

Clustering by behavior

Fifty thousand candidates is still far too many for 10 slots. To collapse them, the system trains a separate model to generate new test inputs, runs the surviving samples on those inputs, and treats each program’s outputs as a behavioral signature. Programs with matching signatures get grouped into clusters. The clusters are ordered by size, and only the 10 largest are kept. The logic is that programs behaving identically are probably redundant, so submitting one representative per cluster avoids wasting attempts on near-duplicates.

Scoring and final selection

A second Gemini Pro model, fine-tuned to estimate correctness, assigns each remaining sample a score between 0 and 1. From each of the 10 surviving clusters, the system picks the highest-scoring sample. Those become the ordered list of up to 10 submissions, sent to the online judge one at a time until a correct solution lands or the candidates run out.

What the evaluation actually measured

The team tested on 12 recent Codeforces contests, each with more than 8,000 participants, drawn from division 2 or the harder combined division 1+2. That came to 77 problems. For each, they sampled one million candidates and submitted up to 10. The 43% solve rate and 85th-percentile estimate come from this set, measured on the same platform as AlphaCode 1’s 25%, which is what makes the 1.7x claim credible rather than a benchmark-shopping artifact.

Why it matters for code-generation QA

The most useful lesson here is not the percentage. It is the shape of the pipeline. AlphaCode 2 treats a language model as an unreliable candidate generator and puts its trust in execution-based verification. Filtering against real test cases removes 95% of output before any model judgment is applied. That ordering is the right instinct for any team evaluating AI-generated code: run it before you rank it.

The caveats are sharp, though. Competitive programming is a friendly setting for this approach because problems are self-contained, correctness is decidable from tests, and a wrong answer costs nothing but an attempt. Most production code has none of those properties. There is no oracle that tells you a refactor preserved behavior, and you cannot generate a million pull requests and submit the 10 that compile. The behavioral-clustering trick also depends on being able to manufacture test inputs cheaply, which is itself a hard QA problem in real systems.

There is a quieter point about cost. A million samples and two execution passes per problem is enormous compute spent to extract a handful of trustworthy outputs. For teams building evaluation harnesses, the takeaway is that model quality and verification quality are separate budgets, and AlphaCode 2 spends heavily on the second. A weak generator paired with strong, execution-grounded filtering can beat a strong generator you simply trust. That principle generalizes well beyond contests, even when the million-sample budget does not.

Read the original: AlphaCode 2 Technical Report — Google DeepMind, 2023-12-06.

Overview

How the system is built

Sampling for breadth

How a million samples become ten submissions

Filtering against public tests

Clustering by behavior

Scoring and final selection

What the evaluation actually measured

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation