Hugging Face | Gen.QA

Code Generation QA

BigCodeArena: How Execution Grounds Code-Gen Evaluation

How BigCodeArena (Hugging Face/BigCode) collects 4,700+ execution-grounded human votes and adds the BigCodeReward and AutoCodeArena benchmarks for code eval.

October 9, 2025 4 min read

Code Generation QA

BigCodeArena: How Hugging Face Evaluates AI Code by Running It

How Hugging Face BigCode's BigCodeArena ranks code-generation models by executing their output, plus the BigCodeReward and AutoCodeArena benchmarks.

October 7, 2025 4 min read

Code Generation QA

BigCodeBench: Testing How LLMs Handle Real Library Calls

How BigCodeBench from Hugging Face / BigCode tests LLMs on 1,140 tasks across 139 libraries, and why the 60% vs 97% gap matters for code QA.

June 22, 2024 4 min read

Code Generation QA

BigCodeBench Explained: A Realistic Code-Gen Benchmark

How Hugging Face's BigCodeBench tests AI on 1,140 real Python tasks across 139 libraries, why GPT-4o hit 61.1%, and what it means for code QA.

June 18, 2024 4 min read

Code Generation QA

How BigCode Benchmarks StarCoder2: A Code-LLM Eval Guide

How BigCode evaluated StarCoder2 (3B/7B/15B) on HumanEval+, MBPP+, DS-1000, and MultiPL-E, and what reproducible execution-based scoring means for code QA.

February 29, 2024 4 min read

Code Generation QA

BigCode Eval Harness: pass@k Testing for Code Models

How the BigCode Evaluation Harness scores code-generation models on HumanEval, MBPP, APPS, DS-1000 and MultiPL-E using pass@k in Docker sandboxes.

August 9, 2022 4 min read