Code Generation QA

DeepSeek-Coder-V2: How HumanEval and MBPP Are Scored

DeepSeek-Coder-V2 hits 90.2% HumanEval and 76.2% MBPP via EvalPlus. A practitioner guide to what each code benchmark actually measures and where it breaks.

June 17, 2024 4 min read

Code Generation QA

Codestral: How Mistral Evaluates Its First Code Model

How Mistral evaluates Codestral, its first 22B code model, across HumanEval, MBPP, CruxEval, RepoBench, Spider, and fill-in-the-middle benchmarks.

May 29, 2024 4 min read

Code Generation QA

CyberSecEval 2: How Meta AI Tests LLM Code Security

Meta AI's CyberSecEval 2 benchmark tests LLMs for prompt injection, code interpreter abuse, and exploit generation. What its findings mean for code QA.

April 19, 2024 4 min read

Code Generation QA

CodeQwen1.5 Explained: How Alibaba Benchmarks Code AI

How Alibaba evaluated CodeQwen1.5-7B across HumanEval, MBPP, LiveCodeBench, SWE-Bench and more, plus the LeetCode contamination caveat it admits.

April 16, 2024 4 min read

Code Generation QA

How BigCode Benchmarks StarCoder2: A Code-LLM Eval Guide

How BigCode evaluated StarCoder2 (3B/7B/15B) on HumanEval+, MBPP+, DS-1000, and MultiPL-E, and what reproducible execution-based scoring means for code QA.

February 29, 2024 4 min read

Code Generation QA

DeepSeek-Coder Explained: HumanEval, MBPP and Code Evals

How DeepSeek-Coder was trained and benchmarked on HumanEval, MBPP, DS-1000 and LeetCode, and what its 50.3% pass@1 means for AI code-generation QA.

January 25, 2024 4 min read

Code Generation QA

Inside Meta CyberSecEval: Testing LLMs for Insecure Code

How Meta CyberSecEval measures insecure code from LLMs across 8 languages and 50 CWEs, plus the finding that stronger models write more unsafe code.

December 7, 2023 4 min read

Code Generation QA

How AlphaCode 2 Uses Gemini to Solve 43% of Codeforces

Inside Google DeepMind's AlphaCode 2: how Gemini Pro, million-sample generation, and execution-based filtering hit the 85th percentile on Codeforces.

December 6, 2023 4 min read

Code Generation QA

Code Llama Evals Explained: HumanEval, MBPP, MultiPL-E

How Meta evaluated Code Llama on HumanEval, MBPP, and MultiPL-E with pass@k execution grading, what the 34B scores mean, and the QA caveats.

August 24, 2023 4 min read

Code Generation QA

WizardCoder: How Microsoft Evolved Code LLM Benchmarks

How Microsoft's WizardCoder uses Evol-Instruct to hit 57.3 pass@1 on HumanEval, beating Claude and Bard on code benchmarks at just 15B parameters.

June 14, 2023 4 min read