Code Generation QA

How Container Limits Skew Anthropic Coding Evals

Anthropic found container resource limits alone swing Terminal-Bench 2.0 scores by 6 points. What that means for trusting AI coding eval leaderboards.

February 5, 2026 4 min read

Code Generation QA

Devstral 2: SWE-bench Verified Plus Cline Human Evals

How Mistral evaluated Devstral 2 and Devstral Small 2: 72.2% and 68.0% on SWE-bench Verified, plus Cline-scaffolded human win-rate comparisons.

December 9, 2025 4 min read

Code Generation QA

ComputeEval 2025.2: How NVIDIA Tests LLM CUDA Code

NVIDIA's ComputeEval 2025.2 grew to 232 CUDA and CCCL problems. See the pass@1 leaderboard, why every model dropped, and what it means for code QA.

November 7, 2025 4 min read

Code Generation QA

BigCodeArena: How Execution Grounds Code-Gen Evaluation

How BigCodeArena (Hugging Face/BigCode) collects 4,700+ execution-grounded human votes and adds the BigCodeReward and AutoCodeArena benchmarks for code eval.

October 9, 2025 4 min read

Code Generation QA

BigCodeArena: How Hugging Face Evaluates AI Code by Running It

How Hugging Face BigCode's BigCodeArena ranks code-generation models by executing their output, plus the BigCodeReward and AutoCodeArena benchmarks.

October 7, 2025 4 min read

Code Generation QA

Qwen3-Coder Explained: Agent RL and SWE-Bench Evaluation

How Alibaba Qwen3-Coder is trained and evaluated: execution-driven Code RL plus Agent RL across 20,000 parallel sandboxes, and what it means for code QA.

July 22, 2025 4 min read

Code Generation QA

Devstral on SWE-bench: Open Code Agents That Actually Pass

How Mistral AI's Devstral Small 1.1 (53.6%) and Devstral Medium (61.6%) score on SWE-bench Verified, and what their no-test-time-scaling claim means for QA.

July 10, 2025 4 min read

Code Generation QA

Devstral: Mistral’s Agentic Coder on SWE-bench Verified

Devstral, Mistral AI and All Hands AI's 24B agentic coder, scored 46.8% on SWE-bench Verified under OpenHands, beating models many times its size.

May 21, 2025 4 min read

Code Generation QA

ComputeEval: NVIDIA’s CUDA Benchmark for LLM Code

How NVIDIA's ComputeEval scores LLM-generated CUDA on 128 handcrafted problems with sandboxed pass@1 tests, and what the baseline model results reveal.

April 16, 2025 4 min read

Code Generation QA

NVIDIA OpenCodeReasoning: SFT Data Distillation for Code

How NVIDIA's OpenCodeReasoning hit 61.8% on LiveCodeBench with SFT distillation alone, and why filtering training data for correct code hurt accuracy.

April 2, 2025 4 min read