Qwen2.5-Coder: How Alibaba Benchmarks AI Code

  • Alibaba’s September 2024 report covers the Qwen2.5-Coder family — six open models from 0.5B to 32B parameters, continued-pretrained on more than 5.5 trillion tokens of code-heavy data.
  • The 32B-Instruct model reported 92.7% on HumanEval and 86.3% on EvalPlus, edging past Claude 3.5 Sonnet (92.1 and 85.9) and outscoring DeepSeek-Coder-V2-Instruct at release.
  • Evaluation runs across more than 10 benchmarks covering generation, completion, reasoning, and repair — including HumanEval, MBPP, EvalPlus, LiveCodeBench, and BigCodeBench — all execution-graded rather than text-matched.

Overview

The Qwen2.5-Coder Technical Report, published September 18, 2024 by Binyuan Hui, Jian Yang, Junyang Lin, Jingren Zhou and colleagues at Alibaba, documents a family of open code models and benchmarks them against the full public evaluation suite. The framing claim is direct: state-of-the-art open-source results across more than ten benchmarks, with the flagship matching or beating closed models of similar capability.

What makes the report instructive for evaluation work is its breadth. Most code-model papers lean on one or two headline numbers. This one spreads the claim across generation, completion, reasoning, and repair — four distinct skills that a single benchmark cannot measure — and reports across all six model sizes, which lets you read scaling behavior rather than a single cherry-picked point.

The model family and how it was trained

Six sizes ship in the series: 0.5B, 1.5B, 3B, 7B, 14B, and 32B parameters. Each comes as a base model and an instruction-tuned variant. The wide size range is part of the contribution — the report claims each model outperforms larger competitors at the same scale, a comparison that only means something when you have a clean ladder of sizes to test.

The models were continued from the Qwen2.5 base checkpoints on a corpus exceeding 5.5 trillion tokens. The recipe combines aggressive data cleaning, synthetic data generation, and a deliberate mixing ratio that holds code, math, and general text together. That last choice is the interesting one. Many code-specialized models trade away general reasoning to chase a higher HumanEval number; the report claims Qwen2.5-Coder keeps its math and general skills intact while specializing, which matters because real coding tasks routinely require arithmetic, logic, and natural-language understanding alongside syntax.

The benchmark suite and what each one tests

The report deliberately avoids a single-benchmark verdict. Each suite stresses a different ability, and a model can be strong on one while weak on another.

Generation: HumanEval, MBPP, EvalPlus

HumanEval is 164 hand-written Python problems — a function signature and a docstring, with the model writing the body and the body run against hidden unit tests. MBPP is a separate set of short tasks with assert-style checks, drawn from a different distribution so a model cannot win both by overfitting one. EvalPlus is the sharper instrument: it takes HumanEval and MBPP and adds far more test cases per problem, catching solutions that pass the thin original tests but break on edge inputs. A high EvalPlus score is harder to fake than a high HumanEval score, which is why the 86.3% figure carries more weight than the 92.7%.

Reasoning and contamination resistance: LiveCodeBench

LiveCodeBench pulls problems from live competitive-programming sites and timestamps them, so you can evaluate only on problems published after a model’s training cutoff. That is the cleanest available defense against contamination — if a model has memorized public solutions, fresh problems expose it. Reporting on LiveCodeBench is a signal that the authors took memorization seriously rather than letting a leaderboard number stand unexamined.

Real-world dependencies: BigCodeBench

BigCodeBench moves out of self-contained-function territory. Its tasks require calling real libraries across multiple domains, so the model has to know specific APIs and compose them correctly, not just produce clean control flow. It is closer to the messiness of production code than HumanEval’s tidy puzzles.

Completion and repair

Beyond generation, the report measures code completion — fill-in-the-middle style infilling, where the model predicts a missing span given the code before and after it — and code repair, where it fixes broken programs. The base models reported state-of-the-art completion results on suites including HumanEval-Infilling, CrossCodeEval, RepoEval, and SAFIM. These exercise the capabilities an IDE assistant actually leans on, which function-from-spec benchmarks barely touch.

Why it matters for code-generation QA

The transferable lesson is the evaluation design, not the leaderboard placement that newer models have already passed. The report does what a credible code-model claim should: it grades by execution against hidden tests, reports across four different task types, and includes a contamination-resistant benchmark in LiveCodeBench. The EvalPlus inclusion is the part most worth copying. Adding more test cases to existing benchmarks is cheap, and it converts a soft pass into a real one — the gap between a model’s HumanEval and EvalPlus score is a direct measure of how brittle its passing solutions are.

The caveats are familiar and load-bearing. A 92.7% on HumanEval still means roughly one in fourteen self-contained functions fails on the first qualifying attempt, and these are short, unambiguous problems — the opposite of a vague production ticket that spans files and carries existing state. Pass numbers also need their k: a single-shot pass@1 and a pass@10 are not comparable, and a score without its k is not a score you can trust. The “consistently beats larger models at the same size” claim is strong, but every benchmark in the suite is public, so contamination pressure grows with each month the data sits on the web — which is exactly why the LiveCodeBench-after-cutoff result should be weighted more heavily than the saturated HumanEval one.

If you are building an eval pipeline, take the structure and extend it. Grade by running code, sample more than once, report each metric with its k, and never trust a single benchmark to stand in for a skill it does not measure. Then go past where these suites stop: your own repository’s tasks, multi-file context, adversarial inputs, and a private holdout the model could not have seen. The completion and repair benchmarks are the closest proxy for day-to-day assistant use, so weight them if that is your workload — a strong generation number tells you little about infilling quality in your codebase.

Read the original: Qwen2.5-Coder Technical Report — Alibaba Qwen, 2024-09-18.