Code Generation QA

CodeElo: How Qwen Rates LLM Coding With CodeForces Elo

CodeElo, the Qwen benchmark, submits LLM code to live CodeForces for human-comparable Elo. See how it grades 387 problems and what o1-mini scored.

January 2, 2025 4 min read

Code Generation QA

GitHub’s RCT on Copilot Code Quality: What the Data Shows

GitHub ran a randomized controlled trial of Copilot: 202 developers, 1,293 blind reviews. Copilot users were 53.2% more likely to pass all 10 unit tests.

November 18, 2024 4 min read

Code Generation QA

Qwen2.5-Coder Benchmarks: How Alibaba Tests AI Code

A practitioner walkthrough of how Alibaba's Qwen2.5-Coder is evaluated: EvalPlus, Aider repair, McEval across 40+ languages, Code Arena, and fill-in-the-middle.

November 12, 2024 4 min read

Code Generation QA

How Anthropic Evaluates Claude on SWE-bench Verified

How Anthropic measured Claude 3.5 Sonnet at 49% on SWE-bench Verified using a minimal Bash and Edit agent, graded by real repo unit tests.

October 30, 2024 4 min read

Code Generation QA

MLE-bench: How OpenAI Grades AI Agents on Real Kaggle ML

OpenAI's MLE-bench scores AI agents on 75 real Kaggle competitions against human medal thresholds. How it works and what it means for code-gen QA.

October 9, 2024 4 min read

Code Generation QA

Qwen2.5-Coder: How Alibaba Benchmarks AI Code

How Alibaba's Qwen2.5-Coder report evaluates AI-generated code across HumanEval, EvalPlus, LiveCodeBench and more, and what it means for code QA.

September 18, 2024 5 min read

Code Generation QA

SWE-bench Verified: OpenAI’s Cleaned-Up Coding Benchmark

How OpenAI used 93 developers to build SWE-bench Verified, a 500-task coding benchmark that fixes the broken tests inflating AI agent scores.

August 13, 2024 4 min read

Code Generation QA

How Meta’s CyberSecEval 3 Tests LLM Security Code

How Meta's CyberSecEval 3 benchmark evaluates LLM security: 31% of code completions flagged vulnerable by CodeShield, plus offensive uplift testing on Llama 3.

August 2, 2024 4 min read

Code Generation QA

BigCodeBench: Testing How LLMs Handle Real Library Calls

How BigCodeBench from Hugging Face / BigCode tests LLMs on 1,140 tasks across 139 libraries, and why the 60% vs 97% gap matters for code QA.

June 22, 2024 4 min read

Code Generation QA

BigCodeBench Explained: A Realistic Code-Gen Benchmark

How Hugging Face's BigCodeBench tests AI on 1,140 real Python tasks across 139 libraries, why GPT-4o hit 61.1%, and what it means for code QA.

June 18, 2024 4 min read