Code Generation QA

BigCode Eval Harness: pass@k Testing for Code Models

How the BigCode Evaluation Harness scores code-generation models on HumanEval, MBPP, APPS, DS-1000 and MultiPL-E using pass@k in Docker sandboxes.

August 9, 2022 4 min read

Code Generation QA

AlphaCode Explained: How DeepMind Tests AI-Written Code

How DeepMind's AlphaCode reached the top 54.3% on Codeforces and built CodeContests to validate AI-generated code against hidden tests, not luck.

February 8, 2022 4 min read

Code Generation QA

How AlphaCode Tested AI Code on Codeforces

How DeepMind benchmarked AlphaCode on 10 Codeforces contests with a generate-filter-cluster-rerank pipeline, and what it teaches AI code QA.

February 2, 2022 4 min read

Code Generation QA

MBPP: How Google Tests LLM Code Generation

How Google's MBPP benchmark grades AI-generated Python by execution, not text similarity, across model scales from 244M to 137B parameters.

August 16, 2021 4 min read

Code Generation QA

How OpenAI’s HumanEval and pass@k Score AI Code

OpenAI's Codex paper introduced HumanEval, 164 Python problems scored by pass@k functional correctness against unit tests. Here is how it works and why it matters.

July 7, 2021 4 min read

Code Generation QA

CodeXGLUE Explained: Microsoft’s Code Generation Benchmark

A practitioner's guide to CodeXGLUE, Microsoft Research's 14-dataset, 10-task benchmark for code understanding and generation, plus what its metrics miss.

February 9, 2021 4 min read

Code Generation QA

How CodeBLEU Scores AI-Generated Code (Microsoft Research)

CodeBLEU is Microsoft Research's metric for grading generated code with AST and data-flow matching, correlating with human judgment better than BLEU.

September 22, 2020 4 min read