DeepSeek-Coder-V2: How HumanEval and MBPP Are Scored

DeepSeek-Coder-V2 reports 90.2% on HumanEval and 76.2% on MBPP, both run through the EvalPlus pipeline rather than the original test suites.
It was the first open-source model to clear 10% on SWE-bench, landing at 12.7%, alongside 43.4% on LiveCodeBench.
The model is a Mixture-of-Experts continued from a DeepSeek-V2 checkpoint on 6 trillion more tokens, with 338-language coverage and a 128K-token context.

Overview

This is DeepSeek-AI’s June 2024 technical report for DeepSeek-Coder-V2, and its argument is a benchmark argument: an open-weights code model can match GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro on coding tasks. The paper backs that claim with a spread of evaluations rather than a single headline number, which is the part worth reading closely.

For anyone evaluating code models, the report doubles as a reference for which benchmarks matter and how each one is scored. The numbers only mean something once you know what the harness behind them actually checks.

The architecture behind the numbers

DeepSeek-Coder-V2 is a Mixture-of-Experts (MoE) model, meaning only a fraction of its parameters activate for any given token. The team did not train it from scratch. They took an intermediate DeepSeek-V2 checkpoint and continued pre-training on 6 trillion additional tokens weighted toward source code and math. That continuation pushed language coverage from 86 to 338 programming languages and extended the context window from 16K to 128K tokens.

The 128K context is a practical detail for code QA. Repository-level tasks need the model to see imports, call sites, and tests at once. A short window forces truncation, which quietly changes what you are actually measuring.

How each benchmark scores generated code

The report leans on execution-based evaluation, where a generated solution is run against hidden tests and counted correct only if it passes. That is stricter than any similarity metric, but the strictness depends entirely on the test suite.

HumanEval and MBPP, run through EvalPlus

HumanEval is 164 hand-written Python problems, each a function signature and docstring with a handful of unit tests. MBPP is around a thousand short crowd-sourced tasks. Both became standard years ago, and both have a known weakness: their original tests are sparse enough that wrong code can slip through. EvalPlus exists to close that gap. It generates roughly 80x more test cases per problem, so a solution has to be correct, not merely plausible. DeepSeek-Coder-V2’s 90.2% on HumanEval and 76.2% on MBPP are EvalPlus numbers, which makes them harder to game than the base figures other reports sometimes quote.

LiveCodeBench, SWE-bench, and Aider

The other three benchmarks probe different failure modes. LiveCodeBench pulls problems from contest sites after a model’s training cutoff, so a score there is less likely to reflect memorized solutions; DeepSeek-Coder-V2 reports 43.4%. SWE-bench is the hardest of the set: real GitHub issues where the model must produce a patch that makes a repository’s failing tests pass. At 12.7%, the model was the first open-source release to break 10% on it, a low absolute number that says more about how brutal the task is than about the model. Aider measures whether a model can edit existing files correctly in the exact diff format a tool expects, which is closer to day-to-day assisted coding than greenfield function writing.

Why it matters for code-generation QA

The most useful takeaway here is not the GPT-4-Turbo comparison. It is the gap between benchmarks. A model can sit at 90.2% on HumanEval and 12.7% on SWE-bench at the same time, and both are honest. HumanEval asks for one isolated function from a clean spec. SWE-bench asks for a working patch to a real codebase from a vague issue. If your team only tracks the first kind of number, you are measuring something close to autocomplete and reporting it as engineering capability.

The EvalPlus detail carries the same lesson at smaller scale. Two reports can both say HumanEval and mean different things, because the original suite and the EvalPlus suite reward different solutions. When you compare models, confirm the harness, not just the percentage. A number without its pipeline is not a measurement you can trust.

There are caveats the report does not dwell on. Execution-based scoring only works where you have tests, and most production code does not arrive with a hidden suite that defines correctness. Contamination is a permanent risk for HumanEval and MBPP given their age, which is exactly why LiveCodeBench’s post-cutoff sourcing earns its place. And a single aggregate per benchmark hides the distribution: which languages, which problem types, and how the model fails when it fails. For internal evaluation, treat this paper as a menu of complementary tests rather than a leaderboard. The point of running five benchmarks is that no one of them tells you whether generated code is safe to ship.

Read the original: DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence — DeepSeek, 2024-06-17.

Overview

The architecture behind the numbers

How each benchmark scores generated code

HumanEval and MBPP, run through EvalPlus

LiveCodeBench, SWE-bench, and Aider

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation