CyberSecEval 2: How Meta AI Tests LLM Code Security

Meta AI’s CyberSecEval 2 (April 2024) adds three new attack categories to its LLM security suite: prompt injection, code interpreter abuse, and software exploit generation.
Every model tested fell for prompt injection between 26% and 41% of the time, including GPT-4, Mistral, Llama 3 70B-Instruct, and Code Llama.
The benchmark introduces the False Refusal Rate (FRR), a metric that quantifies how often a model rejects harmless requests in the name of safety.

Overview

CyberSecEval 2 is Meta AI’s expanded benchmark for measuring the security behavior of large language models. The first version focused narrowly on whether models generate insecure code and whether they help with cyberattacks. This release widens the scope to cover how models behave when an attacker tries to hijack them, when they run code in an interpreter, and when they are asked to write working exploits.

The work matters because teams now ship LLMs into agents, IDEs, and code-execution sandboxes where a single susceptible model becomes an attack surface. CyberSecEval 2, published 19 April 2024 with Joshua Saxe among the authors, gives those teams a repeatable way to put numbers on that risk instead of guessing.

What the benchmark actually tests

The suite is organized around four problem areas. Two carry over and sharpen the original focus on cyberattack helpfulness and insecure code; three are new attack surfaces that reflect how LLMs are deployed in production systems.

Prompt injection

This is the headline category. The authors built test cases where untrusted input tries to override the model’s original instructions, the same way a malicious web page or document might subvert an agent reading it. The result was uncomfortable for everyone: across the models evaluated, between 26% and 41% of injection attempts succeeded. That range held across GPT-4, Mistral, Llama 3 70B-Instruct, and Code Llama, which tells you this is not a quirk of one vendor’s alignment work. It is a structural weakness in how current models separate instructions from data.

Code interpreter abuse

Many assistants can execute code in a sandbox. CyberSecEval 2 probes whether a model will use that capability to do something harmful: break out of the sandbox, exfiltrate data, or run privilege-escalation logic. The test measures whether the model complies with requests that abuse interpreter access rather than refusing them.

Software exploit generation

Here the benchmark asks models to produce working exploits for vulnerabilities, then checks whether the generated code actually functions. This is the hardest task in the suite because a half-correct exploit does nothing. The finding is two-sided. Models with coding ability outperform general-purpose ones, which is intuitive. But none of them are good at it yet. The paper is explicit that further work is needed before LLMs become proficient at exploit generation, which is a meaningful safety result on its own.

The False Refusal Rate

The most useful conceptual contribution is the False Refusal Rate. A model can score perfectly on safety by simply refusing everything, which makes it useless. FRR measures the opposite failure: how often a model rejects a benign request because it pattern-matches to something risky. Think of a developer asking how to write a port scanner for their own network, or how a buffer overflow works, and getting stonewalled.

By pairing FRR with the attack-success metrics, CyberSecEval 2 forces the safety-versus-utility tradeoff into the open. The authors report that many models managed to comply with borderline-but-legitimate requests while still rejecting most genuinely unsafe ones, which is the behavior you actually want. A benchmark that only counted refusals would have rewarded the most paranoid model; FRR penalizes that.

Why it matters for code-generation QA

If you evaluate AI-generated code, CyberSecEval 2 reframes what “good” means. Correctness and test-pass rate are table stakes. The harder question is whether the model that writes your code can be turned against you, and the 26-41% injection numbers say the answer is often yes. Any pipeline that feeds external content into a code-writing model, code review on pull requests, documentation ingestion, dependency analysis, inherits that exposure.

Two caveats keep this honest. First, the absolute scores are a snapshot of mid-2024 models; the methodology outlasts the specific numbers, so treat the framework as the durable part and re-run it against your current model. Second, a static benchmark can only approximate adversaries who adapt, so a passing grade here is a floor, not a guarantee. Pair it with red-teaming on your own prompts and data.

The deeper lesson is that security and utility are not separate dashboards. FRR makes that concrete by refusing to let a team optimize one while quietly wrecking the other. For QA teams, that means any safety gate on generated code needs a matching false-positive measurement, or you will ship a model that blocks your own engineers more than it blocks attackers. CyberSecEval 2 is one of the clearer attempts to measure both sides at once, and that balance is what makes it worth adopting.

Read the original: CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models — Meta AI, 2024-04-19.

Overview

What the benchmark actually tests

Prompt injection

Code interpreter abuse

Software exploit generation

The False Refusal Rate

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation