BigCode Eval Harness: pass@k Testing for Code Models
How the BigCode Evaluation Harness scores code-generation models on HumanEval, MBPP, APPS, DS-1000 and MultiPL-E using pass@k in Docker sandboxes.
Testing, evaluating, and validating AI-generated code and code-generation models.
How the BigCode Evaluation Harness scores code-generation models on HumanEval, MBPP, APPS, DS-1000 and MultiPL-E using pass@k in Docker sandboxes.
How DeepMind's AlphaCode reached the top 54.3% on Codeforces and built CodeContests to validate AI-generated code against hidden tests, not luck.
How DeepMind benchmarked AlphaCode on 10 Codeforces contests with a generate-filter-cluster-rerank pipeline, and what it teaches AI code QA.
How Google's MBPP benchmark grades AI-generated Python by execution, not text similarity, across model scales from 244M to 137B parameters.
OpenAI's Codex paper introduced HumanEval, 164 Python problems scored by pass@k functional correctness against unit tests. Here is how it works and why it matters.
A practitioner's guide to CodeXGLUE, Microsoft Research's 14-dataset, 10-task benchmark for code understanding and generation, plus what its metrics miss.
CodeBLEU is Microsoft Research's metric for grading generated code with AST and data-flow matching, correlating with human judgment better than BLEU.