CodeElo: How Qwen Rates LLM Coding With CodeForces Elo

  • CodeElo grades LLM coding by submitting answers to live CodeForces and reporting a human-comparable Elo, built from 387 problems across 54 contests held May to November 2024.
  • o1-mini topped the field at 1578 Elo (89.2 percentile); QwQ-32B-Preview led open-source models at 1261 (63.6 percentile), while most of the 33 models tested landed in the bottom 25 percent of human competitors.
  • Submitting to the real judge removes false positives that plague LiveCodeBench and USACO, where private tests, special judges, and execution-environment mismatches let wrong code slip through.

Overview

CodeElo is a benchmark from the Qwen team at Alibaba, released January 2, 2025, that measures competition-level code generation by doing what a human contestant does: it submits the model’s solution to CodeForces and reads back the verdict. From those verdicts it computes an Elo rating you can compare directly against the platform’s human rating pool, plus a percentile.

The point is honesty in grading. Earlier competitive-coding benchmarks rebuilt the judging environment offline, and the rebuild leaked. CodeElo sidesteps that by never leaving the original judge.

How the benchmark grades generated code

The dataset is 387 problems pulled from 54 recent CodeForces contests, each carrying its real metadata: contest division, difficulty rating, and algorithm tags. Problems are bucketed by difficulty into Easy ([800, 1000)), Medium ([1000, 1300)), and Hard ([1300, 3500]). Because every problem keeps its tags, you can read performance per category instead of one blurred average.

When a model produces a solution, CodeElo submits it to CodeForces and lets the platform’s own judge decide. That judge runs the full hidden test set, applies the problem’s time and memory limits, and invokes special checkers where a problem accepts multiple valid outputs. No model judges another model. No offline harness approximates the grading. The verdict is the same one a human would see.

Why submitting to the real judge matters

Three failure modes motivated the design, and each maps to a concrete bug in prior work. First, private test cases: benchmarks like LiveCodeBench can only test against the public examples plus whatever cases they manage to scrape, so a solution that fails a hidden edge case still gets marked correct. Second, special judges: some problems have many acceptable answers and need a custom checker; a naive string-match grader rejects correct outputs or accepts wrong ones. Third, execution-environment misalignment: USACO-style offline setups run code under compiler versions, time limits, and language settings that differ from the contest, so timing and behavior drift. Routing every submission through CodeForces eliminates all three at once.

Computing a human-comparable Elo

The rating is not a pass rate dressed up. CodeElo uses an Elo formulation aligned with the CodeForces standard so the resulting number sits on the same scale humans are rated on, while cutting the variance that comes from a single contest’s luck. That is what makes a claim like “89.2 percentile” meaningful: it is a percentile against actual rated competitors, not against other models.

What the results showed

The team rated 33 models, 30 open-source and 3 proprietary. o1-mini was the clear leader at 1578 Elo, placing in the 89.2 percentile of human participants. Among open models, QwQ-32B-Preview reached 1261 Elo (63.6 percentile). The rest of the pack struggled badly, many failing even the easiest problems and landing in the lowest quarter of all human competitors. Competition-level coding remains a hard ceiling for general models.

Two findings cut against common assumptions. The best language for most models was C++, not Python, even though Python is what LLMs reach for by default and what most earlier benchmarks tested in. That gap suggests prior evaluations may have understated raw capability by locking models into a slower language with tighter time-limit risk. By algorithm category, models did well on math and straightforward implementation problems but fell apart on dynamic programming and tree problems, the categories that demand multi-step state reasoning rather than pattern recall.

Why it matters for code-generation QA

The transferable lesson is about your grader, not the leaderboard. CodeElo’s whole argument is that the environment you grade in is part of the test, and a slightly wrong environment produces confidently wrong scores. If your team evaluates generated code against a handful of example cases, or against a checker you wrote that does not match the real acceptance rule, you are measuring optimism. The private-test and special-judge problems CodeElo names are exactly the holes that make an internal eval read higher than reality.

The C++ versus Python result is a direct warning about evaluation defaults. Pin a model to one language and one runtime and you may be scoring the harness, not the model. Run the same problems in more than one language before you trust a capability number.

There are limits worth stating plainly. Submitting to a live platform means the benchmark depends on CodeForces staying available and consistent, and it cannot be run fully offline or air-gapped. Contest problems are also self-contained: clean specs, deterministic checkers, no legacy code, no ambiguous requirements. A strong Elo says a model reasons well over isolated algorithmic puzzles. It says little about whether that model can edit a large codebase, satisfy fuzzy product requirements, or avoid breaking a downstream service. Read CodeElo as a high-quality measure of one narrow, important skill, and pair it with repo-level and integration evals for the rest.

Read the original: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings — Alibaba Qwen, 2025-01-02.