Inside Meta CyberSecEval: Testing LLMs for Insecure Code

Meta’s CyberSecEval is the first benchmark built to measure how often LLM coding assistants emit insecure code, spanning 8 programming languages and 50 CWEs drawn from real open-source repositories.
Across seven models from the Llama 2, Code Llama, and OpenAI GPT families, the LLMs suggested vulnerable code roughly 30% of the time.
The uncomfortable finding: more capable models tended to produce more insecure code, not less, because stronger autocomplete reproduces flawed real-world patterns more faithfully.

Overview

Published by Meta AI in December 2023, CyberSecEval reframes a question most code benchmarks ignore. HumanEval and SWE-bench ask whether generated code works. CyberSecEval asks whether it is safe. Led by Manish Bhatt and a team of 21 authors, the paper introduces a repeatable way to quantify how often a model writes code containing known security weaknesses, and how readily it complies with requests to help carry out a cyberattack.

For anyone shipping AI-written code, this is the part of the eval picture that a passing unit test will never catch. A patch can be functionally correct and still hand an attacker a SQL injection.

How the benchmark is built

The engine underneath CyberSecEval is the Insecure Code Detector (ICD), a static analyzer the authors wrote to recognize insecure coding practices. The ICD encodes roughly 50 Common Weakness Enumerations (CWEs) as detection rules across 8 languages, including C, C++, C#, JavaScript, PHP, Python, Rust, and Java. Rather than hand-author test cases, the team ran the ICD over large volumes of open-source code to find genuine examples of insecure patterns. Those real snippets become the seeds for the test set.

That sourcing decision matters. The vulnerabilities the benchmark probes for are not synthetic textbook bugs. They are the kinds of mistakes that already exist in public repositories, which is exactly the corpus these models trained on.

Two ways to provoke insecure code

CyberSecEval splits the insecure-code evaluation into two test styles, because models fail differently depending on how you ask.

Autocomplete tests take a real code snippet that leads into an insecure pattern, cut it off, and let the model finish the line or block. This measures the model’s instinct when it is simply continuing existing code, the most common way coding assistants are used.
Instruct tests translate an insecure code example into a natural-language instruction, then ask the model to write code from that prompt. This isolates whether the model introduces the weakness on its own when given a task description rather than a code prefix.

The grading is mechanical and consistent: the same Insecure Code Detector that built the tests re-scans the model’s output. If the ICD flags a CWE in the generated code, that response counts as insecure. Using one analyzer for both construction and scoring keeps the measurement reproducible, though it also ties the results to whatever the ICD can and cannot see.

What the models actually did

Across the seven models tested, insecure suggestions landed at about 30% on average. The number alone is striking, but the breakdown is the real lesson. The headline result is counterintuitive: the more capable a model was at coding, the more often it generated insecure code. A better autocomplete model is better at reproducing the patterns in its training data, and a large share of those patterns are themselves insecure. Capability and safety pulled in opposite directions.

The second domain the paper measures is cyberattack compliance: how willing a model is to assist with offensive operations when asked. CyberSecEval probes whether the model helps with tasks mapped to attack frameworks rather than refusing. Together the two domains give a fuller risk profile than either would alone, one covering accidental harm in everyday coding, the other covering deliberate misuse.

Why it matters for code-generation QA

The structural insight here is worth more than the 30% figure. Functional correctness and security are independent axes, and most teams only test one of them. If your evaluation pipeline grades AI-generated code purely on whether tests pass, you have no signal at all on whether that code is exploitable. CyberSecEval shows you can automate the second axis: encode known weakness patterns as static-analysis rules, then run those rules over model output as a hard gate. That is a pattern any team can adopt with an off-the-shelf SAST tool.

The inverse-scaling finding should change how teams reason about model upgrades. Swapping in a stronger coding model is usually treated as a pure win. CyberSecEval is evidence that a more capable model can raise your security risk while raising your pass rate, because it mirrors insecure training data more fluently. Re-run your security evals after every model change, not just your functional ones.

The caveats are real and the authors are clear about them. A static analyzer has finite coverage: the ICD detects the CWEs it has rules for and misses everything else, so a clean score means “no detected weakness,” not “secure.” The benchmark also reflects a December 2023 snapshot of those specific models, and later releases with security-tuned training will behave differently. Treat CyberSecEval as one layer. Pair it with dynamic testing, dependency scanning, and human review on any code path that touches authentication, data, or untrusted input. As a way to make insecure-code risk measurable and repeatable, though, it set the template that later security benchmarks built on.

Read the original: Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models — Meta AI, 2023-12-07.

Overview

How the benchmark is built

Two ways to provoke insecure code

What the models actually did

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation