BigCodeArena: How Hugging Face Evaluates AI Code by Running It

BigCodeArena evaluates code-generation models by running their output in real sandboxes, not by reading source. Over five months it gathered 14,000+ conversations and 4,700+ preference votes across 10 frontier LLMs.
Rankings use the Bradley-Terry (Elo-style) model with 100 bootstrap resamples for 95% confidence intervals. Reasoning models like o3-mini and o1-mini topped the leaderboard, with Claude-3.5-Sonnet close behind.
Two companion benchmarks shipped alongside it: BigCodeReward (does seeing execution output make an LLM judge more accurate?) and AutoCodeArena, a 600-prompt automated arena judged by Claude-3.7-Sonnet.

Overview

BigCodeArena is a human-preference evaluation platform from the Hugging Face BigCode project that judges generated code by its behavior, not its appearance. A model writes code, the platform executes it in an isolated sandbox, and a human looks at the actual output before voting. That single design choice separates it from most leaderboards, which score code on text similarity or unit-test pass rates without a person ever seeing the thing run.

The October 2025 writeup reports what five months of community use produced and introduces two derived benchmarks that try to automate the same execution-aware judgment. For anyone building QA around AI-generated code, the methodology is the interesting part.

How execution-based evaluation works

When a user submits a prompt, two anonymous models each produce a response. Instead of showing raw code side by side, the platform compiles and runs both in sandboxed environments and renders the result: a working web page, a running game, a rendered chart, a diagram. The user can interact with the output, edit the code, and re-run it before deciding which side wins.

The execution layer is broad. It supports 10 languages, including Python, JavaScript, TypeScript, HTML, C, C++, Java, Go, Rust, and Markdown, across 8 runtime environments such as React, Vue, Streamlit, Gradio, PyGame, and Mermaid. That coverage matters because the workloads people actually brought were visual and interactive.

What people used it for

The collected traffic skewed heavily toward things you have to see to judge. Web design accounted for 36% of conversations, problem solving 23%, game development 16%, scientific computing 14%, and creative coding 8%. Most sessions were short, with 76% running to two turns, but the mean conversation reached 4.12 messages, which leaves room for the debugging loops that real development needs.

Turning votes into rankings

Raw preference votes do not give you a ranking on their own. BigCodeArena fits a Bradley-Terry model to the 4,700+ head-to-head votes, estimating each model’s probability of beating every other. To keep the numbers honest, it runs 100 bootstrap resamples and reports 95% confidence intervals rather than a single point estimate.

It also slices the data three ways: all comparisons pooled, comparisons matched by runtime environment, and comparisons matched by language. The matched views control for the fact that some models are simply asked harder questions, which is exactly the confound that makes naive leaderboards misleading.

On results, reasoning-tuned models led. o3-mini and o1-mini held the top tier consistently across environments. Claude-3.5-Sonnet was strong, especially in the language-controlled comparisons. GPT-4o, o1, and the Gemini-2.0 family sat in a competitive middle, and open-weight models like Qwen2.5 and Llama-3.3-70B trailed the proprietary leaders.

BigCodeReward and AutoCodeArena

Human voting does not scale, so the team released two automated companions built from the same data.

BigCodeReward

This benchmark tests whether an LLM can act as a reward model for code, and whether showing it the execution output changes its accuracy. The answer is yes, and the lift is consistent. With execution results visible, GPT-4o’s judging accuracy rose from 54.6% to 63.8%, Claude-Sonnet-4 went from 56.7% to 62.3%, and Qwen2.5-VL-72B climbed from 58.7% to 66.2%. Letting the judge watch the code run is worth roughly seven to nine points.

AutoCodeArena

AutoCodeArena distills the live arena into 600 representative prompts drawn from the crowdsourced traffic. Each candidate model is scored against a GPT-4.1 baseline by an automated judge, Claude-3.7-Sonnet. The top performers were GPT-5, Claude-Opus-4, Claude-Sonnet-4, and a cluster of open models including Qwen3-Coder, Kimi-K2, and GLM-4.5, which narrowed the gap with the proprietary systems.

Why it matters for code-generation QA

Most teams shipping AI-generated code still grade it the way academic benchmarks do: pass or fail against unit tests, or string similarity to a reference. That misses the failures that hurt in production. Code can compile, pass every test, and still render a broken layout, crash on a real input, or produce a chart that is technically valid and visually useless. BigCodeArena’s contribution is the insistence that a human or a judge model look at what the code does before scoring it.

The BigCodeReward result is the practical takeaway for QA automation. If you are using an LLM as a judge in your pipeline, feeding it the program’s actual output instead of just the source is worth a measurable accuracy gain. That is a cheap change with a real return, and it argues for building execution into your eval harness rather than treating it as optional.

The caveats are worth holding onto. The traffic was dominated by visual, front-end-style tasks, so the rankings tell you more about generating an interactive demo than about a backend service or a data migration. Preference votes capture what looks right to a user, which is not the same as what is correct, secure, or maintainable. And the model field is a 2025 snapshot that will date quickly. Treat the platform as a strong template for execution-aware evaluation, then build your own prompts and your own pass criteria around the workloads your team actually ships.

Read the original: BigCodeArena: Judging code generations end to end with code executions — Hugging Face / BigCode, 2025-10-07.

Overview

How execution-based evaluation works

What people used it for

Turning votes into rankings

BigCodeReward and AutoCodeArena

BigCodeReward

AutoCodeArena

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation