DeepSeek-Coder-V2: How HumanEval and MBPP Are Scored
DeepSeek-Coder-V2 hits 90.2% HumanEval and 76.2% MBPP via EvalPlus. A practitioner guide to what each code benchmark actually measures and where it breaks.
Testing, evaluating, and validating AI-generated code and code-generation models.
DeepSeek-Coder-V2 hits 90.2% HumanEval and 76.2% MBPP via EvalPlus. A practitioner guide to what each code benchmark actually measures and where it breaks.
How Mistral evaluates Codestral, its first 22B code model, across HumanEval, MBPP, CruxEval, RepoBench, Spider, and fill-in-the-middle benchmarks.
Meta AI's CyberSecEval 2 benchmark tests LLMs for prompt injection, code interpreter abuse, and exploit generation. What its findings mean for code QA.
How Alibaba evaluated CodeQwen1.5-7B across HumanEval, MBPP, LiveCodeBench, SWE-Bench and more, plus the LeetCode contamination caveat it admits.
How BigCode evaluated StarCoder2 (3B/7B/15B) on HumanEval+, MBPP+, DS-1000, and MultiPL-E, and what reproducible execution-based scoring means for code QA.
How DeepSeek-Coder was trained and benchmarked on HumanEval, MBPP, DS-1000 and LeetCode, and what its 50.3% pass@1 means for AI code-generation QA.
How Meta CyberSecEval measures insecure code from LLMs across 8 languages and 50 CWEs, plus the finding that stronger models write more unsafe code.
Inside Google DeepMind's AlphaCode 2: how Gemini Pro, million-sample generation, and execution-based filtering hit the 85th percentile on Codeforces.
How Meta evaluated Code Llama on HumanEval, MBPP, and MultiPL-E with pass@k execution grading, what the 34B scores mean, and the QA caveats.
How Microsoft's WizardCoder uses Evol-Instruct to hit 57.3 pass@1 on HumanEval, beating Claude and Bard on code benchmarks at just 15B parameters.