Codestral 25.01: Inside Mistral’s Code Benchmarks

Codestral 25.01 scores 86.6% on HumanEval (Python) and 80.2% on MBPP, and debuted at #1 on the LMSys Copilot Arena leaderboard.
Its fill-in-the-middle results are the standout: 95.3% average pass@1 across Python, Java, and JavaScript, with 85.89% single-line exact match.
Mistral reports the model generates and completes code about 2x faster than Codestral-2405 (22B), with a 256k-token context and a reworked tokenizer.

Overview

Codestral 25.01 is Mistral’s January 2025 refresh of its dedicated code model, and the launch leans almost entirely on a benchmark table. The headline is a #1 debut on LMSys Copilot Arena, but the more instructive part for anyone building a code-eval pipeline is the spread of numbers underneath it: separate scores for generation, fill-in-the-middle, SQL, and seven programming languages, each measuring a different thing.

The model targets autocomplete and inline completion inside IDEs, not chat. That focus shapes which benchmarks Mistral chose to report and explains why the fill-in-the-middle figures get as much weight as the classic HumanEval line.

What the benchmark table actually measures

Mistral splits its evaluation into three families: from-scratch generation, code editing, and fill-in-the-middle. Each rewards a different behavior, and a single average across them would hide more than it shows.

Generation: HumanEval, MBPP, and the contest benchmarks

HumanEval is 164 hand-written Python problems, each a function signature plus a docstring scored by running the output against hidden unit tests. Codestral 25.01 lands at 86.6%. MBPP, a larger set of short crowd-sourced tasks scored the same way, comes in at 80.2%. These two are the oldest benchmarks in the table and the easiest to over-trust, because both have been public long enough to leak into training data.

The harder numbers sit lower. CruxEval, which tests whether a model can predict the output of a given program rather than write one, reaches 55.5%. LiveCodeBench pulls contest problems released after the training cutoff to dodge contamination, and the model scores 37.9% there. RepoBench, which measures completion using context drawn from across a repository, lands at 38.0%. The gap between 86.6% on HumanEval and ~38% on the repository-aware tasks is the most honest signal in the whole release.

SQL and the multi-language spread

Spider, a text-to-SQL benchmark, gets its own line at 66.5%, a reminder that query generation is a distinct skill from writing Python. The multi-language HumanEval results show how unevenly capability distributes across languages: Python 86.6%, JavaScript 82.6%, TypeScript 82.4%, C++ 78.9%, Java 72.8%, C# 53.2%, and Bash 43.0%, averaging 71.4%. The 43-point spread between Python and Bash is the kind of detail a single aggregate score erases. If your team ships Bash or C#, the Python headline tells you almost nothing.

Fill-in-the-middle, the metric that matters for autocomplete

Fill-in-the-middle (FIM) is the task an IDE completion engine actually performs: given the code before and after a cursor, predict what goes in the gap. This is harder than left-to-right generation because the model is constrained on both sides. Mistral reports two ways of scoring it.

Exact match checks whether the completion is character-for-character identical to the ground-truth line. The single-line average is 85.89% (Python 80.2%, Java 89.6%, JavaScript 87.96%). Pass@1 is more forgiving: it runs the completed code and counts it correct if tests pass, regardless of whether the text matches exactly. That average is 95.3% (Python 92.5%, Java 97.1%, JavaScript 96.1%). The gap between the two is meaningful. A completion can be functionally correct without matching the original line verbatim, which is why pass@1 sits roughly ten points higher.

The architecture and speed claim

Codestral 25.01 keeps the same training focus as its predecessor but ships what Mistral describes as a more efficient architecture and an improved tokenizer. The practical payoff is speed: the company reports the model generates and completes code about 2x faster than Codestral-2405. For inline completion that latency number is not cosmetic. A suggestion that arrives after the developer has already typed the next line is worse than no suggestion, so throughput is part of the product, not a footnote.

The context window is 256k tokens. That is what makes RepoBench-style evaluation viable at all: the model can hold imports, sibling files, and call sites in view instead of guessing from a truncated slice.

Why it matters for code-generation QA

The Copilot Arena debut is the marketing line, but Arena is a human-preference ranking of head-to-head completions. It tells you developers liked the suggestions in blind comparisons. It does not tell you the code was correct, secure, or maintainable. Preference and correctness are different axes, and a model can win one while losing the other.

The more durable lesson is in the table’s structure. Mistral reported FIM separately from generation, and that separation is the right instinct to copy. Most teams evaluate code models with HumanEval-style generation suites and then deploy them as autocomplete, which is a FIM task. You end up validating a capability you do not ship and shipping one you never validated. If inline completion is the product, the exact-match and pass@1 FIM numbers should drive the decision, not the HumanEval headline.

Two caveats sit underneath these results. First, every execution-based score here depends on a hidden test suite, and HumanEval and MBPP are old enough that contamination is a real risk; LiveCodeBench’s post-cutoff sourcing is the only number partly insulated from it. Second, none of these benchmarks check for security defects, license violations, or whether a completion silently introduces a subtle behavioral change. A 95.3% pass@1 means tests passed, not that the code is safe to merge. For internal QA, treat this release as a well-organized menu of complementary tests, and add the dimension Mistral did not measure: what the generated code does when it is wrong.

Read the original: Codestral 25.01 — Mistral AI, 2025-01-13.

Overview

What the benchmark table actually measures

Generation: HumanEval, MBPP, and the contest benchmarks

SQL and the multi-language spread

Fill-in-the-middle, the metric that matters for autocomplete

The architecture and speed claim

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation