Alibaba Qwen | Gen.QA

Code Generation QA

Qwen3-Coder Explained: Agent RL and SWE-Bench Evaluation

How Alibaba Qwen3-Coder is trained and evaluated: execution-driven Code RL plus Agent RL across 20,000 parallel sandboxes, and what it means for code QA.

July 22, 2025 4 min read

Code Generation QA

CodeElo: How Qwen Rates LLM Coding With CodeForces Elo

CodeElo, the Qwen benchmark, submits LLM code to live CodeForces for human-comparable Elo. See how it grades 387 problems and what o1-mini scored.

January 2, 2025 4 min read

Code Generation QA

Qwen2.5-Coder Benchmarks: How Alibaba Tests AI Code

A practitioner walkthrough of how Alibaba's Qwen2.5-Coder is evaluated: EvalPlus, Aider repair, McEval across 40+ languages, Code Arena, and fill-in-the-middle.

November 12, 2024 4 min read

Code Generation QA

Qwen2.5-Coder: How Alibaba Benchmarks AI Code

How Alibaba's Qwen2.5-Coder report evaluates AI-generated code across HumanEval, EvalPlus, LiveCodeBench and more, and what it means for code QA.

September 18, 2024 5 min read

Code Generation QA

CodeQwen1.5 Explained: How Alibaba Benchmarks Code AI

How Alibaba evaluated CodeQwen1.5-7B across HumanEval, MBPP, LiveCodeBench, SWE-Bench and more, plus the LeetCode contamination caveat it admits.

April 16, 2024 4 min read