BigCodeArena: How Execution Grounds Code-Gen Evaluation
How BigCodeArena (Hugging Face/BigCode) collects 4,700+ execution-grounded human votes and adds the BigCodeReward and AutoCodeArena benchmarks for code eval.
How BigCodeArena (Hugging Face/BigCode) collects 4,700+ execution-grounded human votes and adds the BigCodeReward and AutoCodeArena benchmarks for code eval.
How Hugging Face BigCode's BigCodeArena ranks code-generation models by executing their output, plus the BigCodeReward and AutoCodeArena benchmarks.
How BigCodeBench from Hugging Face / BigCode tests LLMs on 1,140 tasks across 139 libraries, and why the 60% vs 97% gap matters for code QA.
How Hugging Face's BigCodeBench tests AI on 1,140 real Python tasks across 139 libraries, why GPT-4o hit 61.1%, and what it means for code QA.
How BigCode evaluated StarCoder2 (3B/7B/15B) on HumanEval+, MBPP+, DS-1000, and MultiPL-E, and what reproducible execution-based scoring means for code QA.
How the BigCode Evaluation Harness scores code-generation models on HumanEval, MBPP, APPS, DS-1000 and MultiPL-E using pass@k in Docker sandboxes.