Codestral: How Mistral Evaluates Its First Code Model

  • Codestral is Mistral’s first code model: 22B parameters, 80+ programming languages, and a 32k-token context window aimed at repository-scale completion.
  • Mistral evaluates it on six distinct code tasks — HumanEval and MBPP for Python, CruxEval for output prediction, RepoBench EM for long-range completion, Spider for SQL, and fill-in-the-middle pass@1 across Python, JavaScript, and Java.
  • On JetBrains’ Kotlin-HumanEval, Codestral scored 73.75 pass rate (T=0.2), ahead of GPT-4-Turbo at 72.05 and GPT-3.5-Turbo at 54.66.

Overview

Mistral’s May 2024 Codestral announcement reads less like a marketing post and more like an eval card. The model is the company’s first dedicated to code, and the launch spends most of its space on how it was measured rather than how it was built. That choice is the interesting part: a code model is only as credible as the tasks it is scored against, and Codestral is scored against six of them.

For anyone building an evaluation pipeline for AI-generated code, the post is a compact tour of the benchmarks that matter and what each one actually checks. The numbers are secondary to the method.

What Codestral is

Codestral is a 22-billion-parameter model trained on a dataset spanning more than 80 programming languages — Python, Java, C, C++, JavaScript, Bash, and less common targets like Swift and Fortran. The headline architectural detail is the 32k-token context window, which at launch was larger than the 4k, 8k, or 16k windows of the code models Mistral compared against. Context length is not a vanity metric for code: repository-level work needs the model to hold imports, call sites, and surrounding files in view at once, and a short window forces truncation that narrows what it can reason about.

It shipped as an open-weight model under Mistral’s Non-Production License, downloadable from HuggingFace, with a free API endpoint during an eight-week beta.

The six benchmarks, and what each one tests

Mistral groups its evaluation into tasks that probe different abilities. Treating them as one number would hide exactly the distinctions a QA team cares about.

Python correctness: HumanEval and MBPP

These are the long-standing execution-based benchmarks for Python. HumanEval is 164 hand-written problems, each a function signature and docstring with a small set of unit tests. MBPP is a larger pool of short crowd-sourced tasks; Mistral reports the sanitised split. Both are scored with pass@1 — the generated solution is run against hidden tests and counts only if it passes on the first attempt. This is stricter than any text-similarity metric, but the strictness is capped by how thorough the test suite is.

Reasoning about code: CruxEval

CruxEval is the task most teams overlook. Instead of asking the model to write code, it asks the model to predict what a given piece of code outputs. That separates two skills that benchmarks usually conflate: generating plausible syntax versus actually understanding execution semantics. A model can autocomplete a function correctly and still mispredict what it returns, and CruxEval is built to catch that gap.

Repository scale: RepoBench EM and fill-in-the-middle

RepoBench measures long-range, repository-level completion and is scored with exact match (EM) against the real next lines of code, drawing on context from across files. Fill-in-the-middle (FIM) is the complement: the model is given code before and after a gap and must produce the missing span — the pattern behind real IDE autocomplete, where the cursor sits in the middle of a file, not at the end. Mistral reports FIM pass@1 in Python, JavaScript, and Java, comparing against DeepSeek Coder 33B. These two tasks are where the 32k window earns its keep.

SQL and multi-language: Spider plus six more

Spider evaluates text-to-SQL: turning a natural-language question into a query that runs correctly against a database schema. Separately, Mistral reports averaged HumanEval pass@1 across C++, Bash, Java, PHP, TypeScript, and C# to show the model is not just a Python specialist.

Why it matters for code-generation QA

The structure of Codestral’s evaluation is the lesson, not the leaderboard position. Mistral did not pick one benchmark and optimise for it; they spread the claim across generation (HumanEval, MBPP), comprehension (CruxEval), repository context (RepoBench, FIM), and a domain task (Spider). Each metric fails to capture what the others measure. A model that aces HumanEval can still be useless at filling a gap inside an existing file, and FIM is the workload most developers trigger dozens of times a day.

There are caveats worth holding onto. HumanEval and MBPP are old and widely scraped, so contamination is a permanent risk — strong pass@1 numbers on them say less in 2024 than they did in 2021. Exact match on RepoBench is brittle: it rewards reproducing the original line verbatim and penalises a correct alternative that a human reviewer would accept, so EM understates real capability on open-ended completion. And the cross-vendor comparisons here are Mistral’s own runs; the JetBrains Kotlin-HumanEval result, where Codestral’s 73.75 beat GPT-4-Turbo’s 72.05, is independent and therefore the more interesting data point.

The practical takeaway for a QA team: copy the shape of this eval, not the scores. If you are evaluating a code assistant, you need at least one execution-based generation test, one comprehension test like CruxEval, and one FIM or repository-level test that mirrors how your engineers actually use the tool. A single aggregate number, on any one benchmark, is not enough to decide whether generated code is safe to ship.

Read the original: Codestral — Mistral AI, 2024-05-29.