BigCodeArena: How Execution Grounds Code-Gen Evaluation

  • BigCodeArena collected 14,000+ code-centric conversations and 4,700+ pairwise preference votes across 10 programming languages and 8 execution environments, with evaluators running the generated code before voting.
  • It introduces two benchmarks: BigCodeReward, which measures how well reward models agree with human judgments, and AutoCodeArena, an automatic Elo ranking that scores coding quality without human raters.
  • Among recent models, GPT-5, Claude-Sonnet-4, and Claude-Opus-4 ranked highest — and most LLM judges agreed with humans far more often once execution results were on the table.

Overview

BigCodeArena is an execution-grounded evaluation platform from the BigCode community, released on arXiv on 9 October 2025 with Terry Yue Zhuo as first author and more than 39 contributors. The core idea is simple and overdue: when people compare two pieces of model-generated code, let them actually run it first. Built on the Chatbot Arena infrastructure that popularized head-to-head LLM voting, it adds a live execution layer so a human sees what the code does before deciding which response is better.

That matters because code is one of the few model outputs you can objectively verify. A snippet either compiles, runs, and produces the right behavior, or it doesn’t. Evaluating it by reading alone throws that signal away.

How the platform is designed

BigCodeArena pairs two anonymous model responses to a coding prompt and gives the evaluator a working execution surface. The platform supports 8 execution environment types and 10 programming languages, so a single session can cover anything from a Python data script to a front-end UI rendered in a browser. The human can interact with the running output — click through a generated interface, inspect a chart, trigger an error path — and only then cast a pairwise preference vote.

The data collection ran wide. The released corpus holds over 14,000 code-centric conversation sessions spanning 10 LLMs, from which the team processed 4,700+ multi-turn samples carrying human preference labels. Each vote is anchored to something the evaluator observed at runtime, not just to which answer reads more cleanly.

Why execution changes the vote

Read-only code review rewards style. Execution-grounded review rewards correctness. The paper’s headline finding is that LLM judges themselves get noticeably better at predicting human preferences once they can see execution results — most models judge coding preferences more accurately when the run output is available to them. That is a direct argument against the common practice of scoring generated code purely from the text of the response.

The two benchmarks: BigCodeReward and AutoCodeArena

BigCodeReward reuses the 4,700 annotated conversations to test reward models. The question it answers is narrow and useful: how consistent is a given reward model with real human judgment on code? Reward models drive RLHF pipelines and automated grading, so a benchmark that grades the graders fills a real gap. The study compares judgments made with and without execution feedback, and the with-execution condition wins.

AutoCodeArena removes humans from the loop entirely. It produces an automatic Elo rating for coding quality, letting teams rank models without standing up a fresh annotation campaign every time a new release drops. In the reported results, proprietary frontier models lead — GPT-5, Claude-Sonnet-4, and Claude-Opus-4 sit at the top among the recent models evaluated.

Slicing the preferences

Because the dataset tags tasks, languages, and frameworks, the authors surface preference patterns that aggregate leaderboards hide. A model that wins overall may lose on a specific language or framework. That fine-grained breakdown is arguably the most practically valuable part of the release for anyone choosing a model for a concrete stack rather than a generic benchmark average.

Why it matters for code-generation QA

For teams evaluating AI-generated code, the lesson is to stop trusting eyeball review and stop trusting text-only LLM judges. If your evaluation harness scores a model’s output without executing it, you are measuring plausibility, not correctness — and BigCodeArena’s own numbers show that gap is large enough to flip judgments. Any serious QA pipeline for codegen should run the artifact, in a real environment, before assigning a score.

There are caveats worth holding onto. Pairwise human preference is still subjective; it captures which output a person prefers, not whether the code is secure, performant, or maintainable over time. Execution proves a snippet runs in a sandbox, not that it survives production edge cases or adversarial input. And AutoCodeArena’s automatic Elo, convenient as it is, inherits whatever blind spots its judge model carries. Treat it as a fast triage signal, not a final verdict.

Where this fits the broader picture: static benchmarks like HumanEval and MBPP measure pass/fail on fixed test sets, and arena-style human voting measures preference. BigCodeArena’s contribution is welding the two together — preference data grounded in actual execution, plus tooling to automate the ranking. For a QA program, that argues for layering checks: deterministic test suites for correctness, execution-grounded preference or reward scoring for quality, and human review reserved for the cases automation can’t settle. Each catches what the others miss.

Read the original: BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution — Hugging Face / BigCode, 2025-10-09.