Code Generation QA
Devstral 2: SWE-bench Verified Plus Cline Human Evals
How Mistral evaluated Devstral 2 and Devstral Small 2: 72.2% and 68.0% on SWE-bench Verified, plus Cline-scaffolded human win-rate comparisons.
How Mistral evaluated Devstral 2 and Devstral Small 2: 72.2% and 68.0% on SWE-bench Verified, plus Cline-scaffolded human win-rate comparisons.
How Mistral AI's Devstral Small 1.1 (53.6%) and Devstral Medium (61.6%) score on SWE-bench Verified, and what their no-test-time-scaling claim means for QA.
Devstral, Mistral AI and All Hands AI's 24B agentic coder, scored 46.8% on SWE-bench Verified under OpenHands, beating models many times its size.
How Mistral's Codestral 25.01 scores code generation: 86.6% HumanEval, 95.3% fill-in-the-middle pass@1, and a #1 Copilot Arena debut, explained.
How Mistral evaluates Codestral, its first 22B code model, across HumanEval, MBPP, CruxEval, RepoBench, Spider, and fill-in-the-middle benchmarks.