DeepSeek-Coder Explained: HumanEval, MBPP and Code Evals

  • DeepSeek-AI’s January 2024 paper introduced DeepSeek-Coder, a family of open code models spanning 1.3B to 33B parameters, trained on 2 trillion tokens of project-level code.
  • The 33B base model reported 50.3% pass@1 on HumanEval and 66% on MBPP, the best open-source figures at release, and the paper claims it surpasses closed models like Codex and GPT-3.5.
  • Evaluation spans HumanEval (Python and multilingual), MBPP, DS-1000, and LeetCode contest problems — all execution-graded, with the model trained on a fill-in-the-blank objective over whole repositories.

Overview

DeepSeek-Coder is the foundational code-model paper from DeepSeek-AI, published January 25, 2024 by Daya Guo, Qihao Zhu, and colleagues. It takes a clean-room approach to training open code models and then measures the result against the standard public benchmarks for program synthesis, claiming state-of-the-art open-source numbers and parity with or wins over closed models.

What makes it worth studying is the training recipe and the breadth of evaluation. The models were pretrained at the repository level rather than on isolated files, which is closer to how engineers actually work, and the paper reports across four distinct benchmarks instead of one. That combination tells you something about what a credible code-model claim should look like.

How DeepSeek-Coder was trained

The corpus is 2 trillion tokens of code drawn from a high-quality project-level dataset across many languages. The choice of project-level matters. Most early code datasets shuffled individual files, which severs the relationships between a function and the modules it imports, the tests that cover it, and the configuration it reads. Training on whole repositories keeps those dependencies intact.

Fill-in-the-blank pretraining

Rather than only predicting the next token left to right, DeepSeek-Coder uses a fill-in-the-blank objective: given the code before and after a gap, predict the missing span. This is the same idea as fill-in-the-middle, and it maps directly onto the way real editing happens — you rarely write a file top to bottom, you insert and modify in the middle. A 16K-token context window lets the model condition on a meaningful slice of a file or several related files at once.

The benchmark suite and what each one tests

Four benchmarks carry the evaluation, and they probe different abilities.

HumanEval, Python and multilingual

HumanEval is 164 hand-written Python problems, each a function signature and a docstring. The model writes the body, and the body runs against hidden unit tests — pass them all and the problem counts as solved. The paper also runs a multilingual version, grading working solutions in languages beyond Python, which is a stronger claim than topping one language.

MBPP

MBPP (Mostly Basic Programming Problems) is a separate set of short Python tasks, each paired with assert-style test cases. It overlaps with HumanEval in spirit but uses a different problem distribution, so reporting both reduces the chance a model is simply tuned to one benchmark’s quirks.

DS-1000 and LeetCode

DS-1000 moves out of toy territory into 1,000 data-science problems that use real libraries like NumPy and pandas, so the model has to know specific APIs, not just control flow. The LeetCode contest set goes further still: problems published after the training cutoff, which is the cleanest available defense against contamination. If a model has memorized the answer, fresh contest problems expose it. All four benchmarks grade by running the code, not by comparing text.

What the numbers say

The headline figures for the 33B base model are 50.3% pass@1 on HumanEval and 66% on MBPP. Read them honestly. Half the HumanEval problems are solved on the first qualifying attempt, which means half are not, and these are self-contained functions rather than production tickets. The paper’s larger claim is comparative: these were the strongest open-source results at the time, and the abstract states DeepSeek-Coder surpasses closed models including Codex and GPT-3.5. The instruction-tuned variants push higher on instruction-following tasks, which is where most chat-style coding assistants actually operate.

Pass@1 is the strict single-shot number — generate once, run the tests, count the pass. Quoting the metric and the k matters because pass@10 or pass@100 always looks better than pass@1 on the same model. A score without its k is not a score you can compare.

Why it matters for code-generation QA

The reusable lesson here is the evaluation discipline, not the leaderboard position, which newer models have long since passed. DeepSeek-Coder’s report does the things a trustworthy claim should do: it names each benchmark, grades by execution against hidden tests, and reports across four datasets that stress different skills. The LeetCode-after-cutoff choice is the part worth copying most. Contamination is the quiet failure mode of every code benchmark — once HumanEval and MBPP solutions are all over the public web, a high score can mean memorization as easily as capability. Holding out problems the model could not have seen is one of the few real defenses.

The caveats are familiar but load-bearing. HumanEval and MBPP are short, single-function, unambiguous problems, which is the opposite of most engineering work that spans files, carries existing state, and ships with vague requirements. DS-1000 is closer to real work and is the more informative number for teams doing data and ML code. A 50% pass@1 on isolated functions predicts very little about a model refactoring a live service. The repo-level training and fill-in-the-blank objective are exactly the capabilities production coding leans on, yet the function-from-spec benchmarks barely exercise them, so a strong HumanEval figure says nothing about infilling quality.

If you are building an eval pipeline, take the structure: write executable tests, sample more than once, count what passes, and report the metric with its k. Then extend past where these benchmarks stop — multi-file context, your own repository’s tasks, adversarial and edge-case inputs, and a contamination-resistant holdout you control. Per-language scores diverge, so test every language you ship rather than trusting a Python number to transfer.

Read the original: DeepSeek-Coder: When the Large Language Model Meets Programming — The Rise of Code Intelligence — DeepSeek, 2024-01-25.