How Container Limits Skew Anthropic Coding Evals
Anthropic found container resource limits alone swing Terminal-Bench 2.0 scores by 6 points. What that means for trusting AI coding eval leaderboards.
Testing, evaluating, and validating AI-generated code and code-generation models.
Anthropic found container resource limits alone swing Terminal-Bench 2.0 scores by 6 points. What that means for trusting AI coding eval leaderboards.
How Mistral evaluated Devstral 2 and Devstral Small 2: 72.2% and 68.0% on SWE-bench Verified, plus Cline-scaffolded human win-rate comparisons.
NVIDIA's ComputeEval 2025.2 grew to 232 CUDA and CCCL problems. See the pass@1 leaderboard, why every model dropped, and what it means for code QA.
How BigCodeArena (Hugging Face/BigCode) collects 4,700+ execution-grounded human votes and adds the BigCodeReward and AutoCodeArena benchmarks for code eval.
How Hugging Face BigCode's BigCodeArena ranks code-generation models by executing their output, plus the BigCodeReward and AutoCodeArena benchmarks.
How Alibaba Qwen3-Coder is trained and evaluated: execution-driven Code RL plus Agent RL across 20,000 parallel sandboxes, and what it means for code QA.
How Mistral AI's Devstral Small 1.1 (53.6%) and Devstral Medium (61.6%) score on SWE-bench Verified, and what their no-test-time-scaling claim means for QA.
Devstral, Mistral AI and All Hands AI's 24B agentic coder, scored 46.8% on SWE-bench Verified under OpenHands, beating models many times its size.
How NVIDIA's ComputeEval scores LLM-generated CUDA on 128 handcrafted problems with sandboxed pass@1 tests, and what the baseline model results reveal.
How NVIDIA's OpenCodeReasoning hit 61.8% on LiveCodeBench with SFT distillation alone, and why filtering training data for correct code hurt accuracy.