CodeElo: How Qwen Rates LLM Coding With CodeForces Elo
CodeElo, the Qwen benchmark, submits LLM code to live CodeForces for human-comparable Elo. See how it grades 387 problems and what o1-mini scored.
Testing, evaluating, and validating AI-generated code and code-generation models.
CodeElo, the Qwen benchmark, submits LLM code to live CodeForces for human-comparable Elo. See how it grades 387 problems and what o1-mini scored.
GitHub ran a randomized controlled trial of Copilot: 202 developers, 1,293 blind reviews. Copilot users were 53.2% more likely to pass all 10 unit tests.
A practitioner walkthrough of how Alibaba's Qwen2.5-Coder is evaluated: EvalPlus, Aider repair, McEval across 40+ languages, Code Arena, and fill-in-the-middle.
How Anthropic measured Claude 3.5 Sonnet at 49% on SWE-bench Verified using a minimal Bash and Edit agent, graded by real repo unit tests.
OpenAI's MLE-bench scores AI agents on 75 real Kaggle competitions against human medal thresholds. How it works and what it means for code-gen QA.
How Alibaba's Qwen2.5-Coder report evaluates AI-generated code across HumanEval, EvalPlus, LiveCodeBench and more, and what it means for code QA.
How OpenAI used 93 developers to build SWE-bench Verified, a 500-task coding benchmark that fixes the broken tests inflating AI agent scores.
How Meta's CyberSecEval 3 benchmark evaluates LLM security: 31% of code completions flagged vulnerable by CodeShield, plus offensive uplift testing on Llama 3.
How BigCodeBench from Hugging Face / BigCode tests LLMs on 1,140 tasks across 139 libraries, and why the 60% vs 97% gap matters for code QA.
How Hugging Face's BigCodeBench tests AI on 1,140 real Python tasks across 139 libraries, why GPT-4o hit 61.1%, and what it means for code QA.