Qwen2.5-Coder Benchmarks: How Alibaba Tests AI Code

  • Six model sizes (0.5B to 32B) ship under one evaluation harness, so you can compare the same benchmark suite across an entire capability range instead of cherry-picked single checkpoints.
  • The 32B-Instruct model scored 73.7 on Aider code repair and 75.2 on MdEval, the latter ranking first among open-source models — repair, not just greenfield generation, is treated as a first-class metric.
  • McEval spans 40+ programming languages (65.9 overall) and the fill-in-the-middle suite covers five separate benchmarks, pushing evaluation past Python-only Pass@1 toward how code is actually written and patched.

Overview

Alibaba’s Qwen team published the Qwen2.5-Coder launch on November 12, 2024. The interesting part is not the model weights — it’s the evaluation design. Rather than reporting a single HumanEval number, the post runs the full family (0.5B, 1.5B, 3B, 7B, 14B, 32B) through a battery of benchmarks that each probe a different code-writing skill: generating from scratch, repairing broken code, working across dozens of languages, matching human preference, and filling gaps inside existing files.

For anyone building a QA process around AI-generated code, the methodology is more reusable than the leaderboard. It’s a worked example of how to measure a coding model along the axes that matter in production.

How the generation benchmarks are designed

The generation track uses three suites, and they deliberately disagree with each other. EvalPlus takes the original HumanEval and MBPP problems and adds far more test cases per task. The point is to catch solutions that pass the handful of canonical tests but break on edge inputs — a model can look correct on HumanEval and fail EvalPlus on the same prompt. LiveCodeBench pulls problems from competitive programming sites after a cutoff date, which limits the chance the model memorized the answer during training. BigCodeBench pushes toward realistic tasks that require calling external libraries rather than writing self-contained functions.

On these three, the 32B-Instruct model reached the top of the open-source field and, per the post, landed at “competitive performance with GPT-4o.” Read that claim carefully: competitive on these specific suites is not the same as parity everywhere.

Repair, multilingual, and preference evaluation

Code repair

Repair gets its own track because generating correct code and fixing wrong code are different abilities. Aider measures whether a model can apply edits to an existing file and land a working change — the 32B scored 73.7, which the team frames as comparable to GPT-4o. MdEval targets bug detection and repair, and the 32B scored 75.2, first among open models. If your workflow involves an agent editing a repository, these numbers map to your use case more directly than any from-scratch generation score.

Multilingual breadth

McEval evaluates across more than 40 programming languages, with a 65.9 aggregate and notably strong results in low-resource languages like Haskell and Racket. Most coding leaderboards are Python with a little JavaScript. A 40-language harness exposes where a model quietly degrades — useful if your codebase isn’t mainstream.

Human preference

Code Arena is an internal preference benchmark. It pits two models’ answers head-to-head and uses GPT-4o as the judge in an A-vs-B win-rate format. This catches something pass/fail tests miss: solutions that are technically correct but awkward, unidiomatic, or hard to read. The tradeoff is obvious — a model-as-judge inherits that judge’s biases, so the score reflects GPT-4o’s taste, not ground truth.

Fill-in-the-middle: testing how code is actually written

Most generation benchmarks ask for a complete function from an empty file. Real editing happens in the middle of a file, with code above and below the cursor. The fill-in-the-middle (FIM) track measures exactly that across five benchmarks: HumanEval-Infilling, CrossCodeEval, CrossCodeLongEval, RepoEval, and SAFIM.

The differences between them matter. HumanEval-Infilling masks a span inside a single function. RepoEval and the CrossCodeEval variants require pulling context from other files in the repository to complete the gap correctly, which is closer to how an IDE autocomplete works against a real codebase. SAFIM focuses on syntax-aware completions. The team reports state-of-the-art results across all five, evaluated with exact match and Pass@1, with the input context capped at 8k tokens. The 32B led this track for open models.

Why it matters for code-generation QA

The reusable lesson is the shape of the evaluation, not the rankings. A single benchmark number is a weak signal for code quality, and Qwen2.5-Coder’s design quietly argues the case by splitting the question into five tracks that can move independently. A model can ace EvalPlus and still write code your reviewers reject in Code Arena. It can generate clean functions and fumble a multi-file repair.

If you’re building an internal eval for AI-generated code, mirror this structure: separate generation from repair, test more than one language, include a multi-file FIM scenario that reflects your actual repository layout, and add a preference or readability check on top of pass/fail. The caveats travel too. EvalPlus only catches the edge cases someone wrote tests for. LiveCodeBench’s contamination resistance decays as problems age into training sets. Code Arena’s GPT-4o judge is a proxy, not a verdict — and “competitive with GPT-4o” is a claim scoped to these suites at this date, not a general guarantee.

The broader point: contamination, judge bias, and benchmark saturation are now the hard problems in code evaluation, and a launch post that runs eleven-plus benchmarks instead of one is implicitly acknowledging that no single score is trustworthy on its own. Treat any vendor leaderboard as a starting hypothesis to verify against your own code, not a result.

Read the original: Qwen2.5-Coder Series: Powerful, Diverse, Practical. — Alibaba Qwen, 2024-11-12.