WizardCoder: How Microsoft Evolved Code LLM Benchmarks

  • WizardCoder fine-tunes the 15B StarCoder base model on 78k code instructions generated by Evol-Instruct, an automated method that rewrites simple prompts into harder ones.
  • The paper reports 57.3 pass@1 on HumanEval, a +22.3 point jump over the best prior open-source code model (35.0), plus +8.2 on MBPP (51.8 vs 43.6).
  • It is benchmarked across HumanEval, HumanEval+, MBPP, DS-1000, and MultiPL-E, and the released model surpasses Anthropic’s Claude and Google’s Bard on HumanEval despite being smaller.

Overview

WizardCoder is a 2023 paper from Microsoft and Hong Kong Baptist University (Can Xu, Pu Zhao, Qingfeng Sun, Ziyang Luo, et al.) that takes an instruction-tuning trick built for general chat models and applies it to code. The trick is Evol-Instruct: instead of hand-writing training instructions, you start with a seed set and use an LLM to evolve each one into something more complex, then fine-tune on the result. The contribution is not a new model architecture. It is a recipe plus a thorough benchmark evaluation that shows a mid-size open model can close most of the gap to closed systems on standard code tasks.

For anyone who evaluates AI-generated code, the value of the paper is in how it measures correctness across five different benchmarks rather than relying on a single number.

How Evol-Instruct works for code

Evol-Instruct started life in WizardLM as a way to generate harder chat instructions automatically. WizardCoder adapts the prompt to the coding domain. You feed the model an existing instruction and ask it to produce a more demanding version through a fixed set of operations.

  • Add constraints or extra requirements to the task.
  • Increase reasoning depth or replace a generic ask with a more specific one.
  • Introduce concrete inputs, edge cases, or trickier data.
  • Raise the time or space complexity the solution must handle.
  • Deepen the problem so a correct answer needs more steps.

Running these rounds turns a small seed set into 78k evolved instructions. The team then fine-tunes StarCoder, a 15B-parameter open base model, on that data. The cost is mostly the generation step; the fine-tune itself is conventional supervised tuning.

How the evaluation is designed

The paper leans on five public benchmarks, each probing a different slice of code generation. This matters because a model can look strong on one and weak on the rest.

The benchmarks

  • HumanEval — 164 hand-written Python problems. The model generates a function body from a docstring, and the answer is judged by running unit tests.
  • HumanEval+ — the same problems with far more test cases added, which catches solutions that pass the original tests but are actually wrong.
  • MBPP — roughly 1,000 entry-level Python problems, again checked by execution.
  • DS-1000 — 1,000 realistic data-science problems using libraries like NumPy and pandas, closer to working code than puzzle-style tasks.
  • MultiPL-E — HumanEval and MBPP translated into many programming languages to test beyond Python.

The metric: pass@1

Every score is pass@1. The model gets one attempt per problem, the generated code runs against the test suite, and it counts only if every test passes. There is no partial credit and no human grading. Execution against tests is the whole point: it measures whether the code actually works, not whether it looks plausible.

Key findings

On HumanEval the paper reports 57.3 pass@1, which is 22.3 points above the strongest open-source code model it compares against (35.0). On MBPP it reports 51.8 versus 43.6 for the prior best, a +8.2 gain. The released 15B model pushes HumanEval to 59.8 and HumanEval+ to 52.4.

The more interesting comparison is against closed models. On HumanEval the released WizardCoder beats Anthropic’s Claude (53.0) and Google’s Bard (44.5), placing it behind only GPT-4 and GPT-3.5 at the time of writing. It does this at 15B parameters, far smaller than those systems. The takeaway is that targeted, harder instruction data buys more than raw scale on these particular tasks.

WizardCoder attains the third position on HumanEval, surpassing Claude (59.8 vs 53.0) and Bard (59.8 vs 44.5) despite being substantially smaller than both.

Why it matters for code-generation QA

The honest lesson here is methodological, not just about one model. WizardCoder is evidence that the way you build a training set can move correctness more than parameter count, and that you should never trust a single benchmark. A model that wins HumanEval can still stumble on DS-1000’s library-heavy tasks or fall apart in non-Python languages on MultiPL-E. If your team only checks one suite, you are measuring one narrow skill.

There are real caveats. Pass@1 on a fixed benchmark rewards passing known tests, and HumanEval+ exists precisely because the original test sets were too thin to catch wrong answers. Benchmark contamination is a live risk: as these datasets age, they leak into pretraining corpora, and a high score can reflect memorization rather than reasoning. The benchmarks also test self-contained functions, not the messy work of editing a large repository, handling security, or matching a team’s conventions.

For practitioners evaluating AI-generated code, the practical guidance is to copy the structure, not the leaderboard. Use execution-based tests as the source of truth. Run more than one benchmark so a single strong number cannot hide weaknesses. Add adversarial or extra test cases the way HumanEval+ does. And maintain a private evaluation set the model has never seen, because public benchmarks slowly stop measuring what you think they measure. WizardCoder fits the broader shift in code evaluation away from text-similarity scores and toward running the code and checking whether it is correct.

Read the original: WizardCoder: Empowering Code Large Language Models with Evol-Instruct — Microsoft, 2023-06-14.