Devstral 2: SWE-bench Verified Plus Cline Human Evals

Devstral 2 (123B) scores 72.2% on SWE-bench Verified and Devstral Small 2 (24B) scores 68.0%, both with a 256K context window.
Mistral paired the automated benchmark with independent human evaluations run through the Cline agent scaffold, reporting a 42.8% win rate and 28.6% loss rate for Devstral 2 against DeepSeek V3.2.
The launch ships the open-source Mistral Vibe CLI (Apache 2.0), a terminal coding agent powered by Devstral.

Overview

On 9 December 2025 Mistral released two coding models, Devstral 2 and Devstral Small 2, alongside an open-source command-line agent called Mistral Vibe CLI. The announcement leads with SWE-bench Verified numbers, which is standard for a coding launch. The more interesting part for anyone building evals is the second measurement Mistral ran: human reviewers scoring real tasks through a live agent harness.

That pairing, one automated execution benchmark plus one human-judged head-to-head, is the structure worth studying. It is a practical template for how to assess generated code when a single leaderboard number is not enough.

What Mistral shipped

Devstral 2 is a 123B-parameter model under a modified MIT license. Devstral Small 2 is 24B parameters under Apache 2.0 and accepts image inputs, which lets it act inside multimodal agents that read screenshots or diagrams. Both carry a 256K context window, enough to hold a meaningful slice of a repository plus tool output across a multi-turn session.

Mistral frames the size story around efficiency. It puts Devstral 2 at roughly 5x smaller than DeepSeek V3.2 and 8x smaller than Kimi K2, with Devstral Small 2 at 28x and 41x smaller respectively. The pitch is that you can hit competitive coding scores without a frontier-scale model, which matters for self-hosting cost and latency.

How the two evaluations work

The launch reports quality through two different lenses, and they answer different questions.

SWE-bench Verified: does the patch pass the tests

SWE-bench Verified is a human-filtered set of real GitHub issues, each paired with a hidden test suite. The model reads the issue and the repository, produces a patch, and the patch is graded by running the project’s tests. There is no judge model and no similarity scoring. The code either makes the failing tests pass without breaking the passing ones, or it does not. Devstral 2 resolves 72.2% of those tasks; Devstral Small 2 resolves 68.0%. Because the grading is execution-based, the number is hard to game with output that merely looks plausible.

Cline-scaffolded human evaluation: which output a developer prefers

SWE-bench tells you whether a patch is correct against a fixed test suite. It does not tell you whether a developer would actually want to merge it. To get at that, Mistral had an independent annotation provider run coding tasks through Cline, an open-source agent scaffold, and had humans compare outputs head-to-head. Against DeepSeek V3.2, Devstral 2 won 42.8% of comparisons and lost 28.6%, with the remainder presumably ties. Mistral also ran the comparison against Claude Sonnet 4.5 and reports Devstral as up to 7x more cost-efficient at real-world tasks.

The key design choice is the scaffold. Running both models inside the same agent harness, with the same tools and task framing, isolates the model as the variable. A win rate captures qualities a pass/fail test misses: readability, whether the change is scoped sensibly, whether the agent flailed before landing the fix.

Mistral Vibe CLI as the delivery surface

Vibe CLI is an Apache 2.0 terminal agent powered by Devstral. It handles file edits, code search, multi-file changes with architecture-level reasoning, version control, and command execution, and it speaks the Agent Communication Protocol so editors can drive it. For an eval team the CLI is more than a product feature. It is the same kind of harness the human evaluation relied on, so the conditions Mistral measured under are reproducible on your own tasks.

Why it matters for code-generation QA

The reusable lesson is the two-signal structure, not the headline percentages. SWE-bench Verified answers a binary question with no ambiguity: did the tests pass. The Cline-based human eval answers a softer but equally important one: given two correct-ish solutions, which one does a person prefer. Teams that report only the first are blind to quality differences that never show up as a failing test, and teams that report only the second are trusting human taste over verified correctness. You want both.

The caveats are worth stating plainly. A 42.8% win rate against 28.6% losses is a real edge, but pairwise preference rates depend heavily on the task mix and on annotator instructions, neither of which a vendor controls in a way you can audit. SWE-bench Verified remains Python-centric and is drawn from public repositories that may overlap with training data, so a 72.2% there is a proxy for your private codebase, not a guarantee. And “up to 7x more cost-efficient” is a best-case framing; the realized number depends on your token volume and task length.

The practical move is to copy the method, not the scores. Run the model inside an agent scaffold you control, grade the binary outcome with your own test suite, then add a structured human or model-judged preference pass for the cases where multiple solutions compile and pass. Use a launch like this to decide what to trial. Use your own two-signal harness to decide what to ship.

Read the original: Introducing: Devstral 2 and Mistral Vibe CLI — Mistral AI, 2025-12-09.

Overview

What Mistral shipped

How the two evaluations work

SWE-bench Verified: does the patch pass the tests

Cline-scaffolded human evaluation: which output a developer prefers

Mistral Vibe CLI as the delivery surface

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation