CodeQwen1.5 Explained: How Alibaba Benchmarks Code AI

Alibaba’s CodeQwen1.5-7B (built on Qwen1.5, 64K context, ~3T tokens of code) was launched on the strength of its evaluation suite, not its size — HumanEval, MBPP, LiveCodeBench, MultiPL-E, SWE-Bench, CodeEditorBench, and Text-to-SQL on Spider and Bird.
The Chat model reported 83.5% pass@1 on HumanEval and 78.7% on HumanEval+, with a self-reported SWE-Bench score of 0.89 that the team says surpasses ChatGPT-3.5 on repository-level issue resolution.
The launch is unusually candid about contamination: it warns that LeetCode data in the pretraining corpus may inflate its LiveCodeBench ranking — a caveat most model launches omit.

Overview

CodeQwen1.5 is a 7-billion-parameter code model the Qwen Team at Alibaba released on April 16, 2024, built on Qwen1.5 and trained on roughly 3 trillion tokens of code across 92 programming languages with a 64K-token context window. The launch blog spends most of its length on measurement rather than architecture, which is what makes it worth reading for anyone who evaluates generated code.

What stands out is the breadth of the scorecard and the willingness to flag its own weak spots. The post runs the model through seven distinct benchmarks, each probing a different capability, and openly names a data-leakage risk in one of them — wide coverage plus stated caveats, closer to how a careful eval should read than the single-number leaderboard claims common at the time.

The benchmark suite and what each one measures

The evaluation is layered. Each benchmark isolates a separate skill, and the model is tested in both Base and Chat (instruction-tuned) forms where relevant.

HumanEval and MBPP: single-function synthesis

These are the standard Python program-synthesis tests, graded by running generated code against hidden unit tests. CodeQwen1.5-Base scored 51.8% on HumanEval and 72.2% on MBPP zero-shot; the Chat model jumped to 83.5% on HumanEval and 77.7% on MBPP. The plus variants — HumanEval+ and MBPP+ — add more test cases to catch code that passes the original tests by luck. The Chat model held 78.7% on HumanEval+ and 67.2% on MBPP+, and that gap is the interesting part: it shows how much a thinner test set can overstate a model.

MultiPL-E: does the skill transfer across languages

HumanEval and MBPP are Python-only. MultiPL-E translates HumanEval-style problems into eight mainstream languages — Python, C++, Java, PHP, TypeScript, C#, Bash, and JavaScript — and grades each by execution, exposing a model that tops Python but collapses on Bash. CodeQwen1.5 reported strong results across all eight, a broader claim than any single Python number.

LiveCodeBench: contamination-aware scoring

LiveCodeBench collects competitive-programming problems with timestamps, so you can score a model only on problems published after its training cutoff — windowing that limits how much of the score comes from memorized solutions. The launch evaluates over a September 2023 to April 2024 window and reports CodeQwen1.5 among the top open-access models. Then it adds the caveat that matters: the inclusion of LeetCode data in our pretraining corpus may contribute to the performance. Even a contamination-aware benchmark can be partially gamed if the training data overlaps the problem source.

SWE-Bench and CodeEditorBench: real repositories, real edits

SWE-Bench moves from toy functions to actual GitHub issues: the model reads a repository and a bug report and must produce a commit patch that resolves the issue, scored by whether the project’s own tests pass afterward. CodeQwen1.5 reported a score of 0.89 and claimed it surpasses ChatGPT-3.5 on this task. CodeEditorBench tests a different muscle — editing existing code across four axes: debugging, translation, language switching, and code polishing — where the model claimed state-of-the-art at the 7B scale.

Spider and Bird: structured Text-to-SQL

Both benchmarks ask the model to turn a natural-language question into a SQL query against a real schema, graded on whether the query executes and returns the right rows. CodeQwen1.5-Chat ranked second on both, close to GPT-4, credited to synthetic SQL data used in pretraining and fine-tuning.

The long-context evaluations

The 64K window gets its own tests rather than being asserted. The team measured perplexity trending down as sequence length grew on real GitHub repositories, and ran a “Needle in Code” retrieval task that buries a function deep in a large codebase and checks the model can find it within 64K tokens. SWE-Bench doubles as a long-context test here, since resolving a real issue means reading far more code than HumanEval ever requires — which is why the repository-level scores are partly a context-length result.

Why it matters for code-generation QA

The takeaway for teams evaluating AI-generated code is the structure of this scorecard, not the model itself. A credible code-model claim should look like this one: multiple benchmarks that each test a distinct skill, execution-based grading rather than string similarity, and an explicit window for contamination. If a vendor reports one HumanEval number and nothing else, you are looking at marketing, not evidence.

The contamination caveat is the most reusable lesson. CodeQwen1.5’s own team flagged that LeetCode data in pretraining could lift its LiveCodeBench score — so when you read any competitive-programming result, ask where the problems came from and whether they could already be in the training set. Timestamped, post-cutoff evaluation is the minimum defense, and even that is not airtight when the data source overlaps.

Treat the headline figures with proportion. The single-function HumanEval and MBPP scores describe a narrow task that looks nothing like shipping a fix to a live service, which is why the SWE-Bench and CodeEditorBench results are the more honest signal for production work. The 0.89 SWE-Bench figure is also self-reported and not fully defined — confirm the evaluation setup before comparing one vendor’s number against another’s. For your own pipeline, copy the layering: synthesis tests, multilingual coverage, repository-level edits, and a contamination window, then add tasks from your own codebase that none of these public benchmarks touch.

Read the original: Code with CodeQwen1.5 — Alibaba Qwen, 2024-04-16.

Overview

The benchmark suite and what each one measures

HumanEval and MBPP: single-function synthesis

MultiPL-E: does the skill transfer across languages

LiveCodeBench: contamination-aware scoring

SWE-Bench and CodeEditorBench: real repositories, real edits

Spider and Bird: structured Text-to-SQL

The long-context evaluations

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation