- Meta’s CyberSecEval 3 evaluates 8 cybersecurity risks across LLMs, applied to Llama 3 and contemporaneous state-of-the-art models.
- Across the insecure-code tests, 31% of model completions contained a security vulnerability, detected by Meta’s
CodeShieldstatic scanner. - The benchmark measures offensive uplift directly: how much a model improves an attacker’s exploit-generation and social-engineering output, with and without guardrails.
Overview
Released by Meta AI on August 2, 2024, CyberSecEval 3 is the third version of a benchmark suite built to answer a question most code-gen evals skip: not just whether a model writes working code, but whether it writes safe code, and whether it can be turned into a weapon. Shengye Wan, Joshua Saxe, and eleven co-authors applied it to Llama 3 and a set of competing frontier models.
For anyone shipping AI-generated code, this is one of the few public benchmarks that treats security as a first-class measurement rather than a footnote. It pairs defensive risk (insecure code the model produces by accident) with offensive risk (attacks the model can help a human carry out on purpose).
What CyberSecEval 3 measures
The suite spans 8 risks grouped into two buckets: harm to third parties, and harm to the developers and end users running an application built on the model. That split matters. A model that emits a SQL injection in its output endangers the people who deploy it; a model that drafts a convincing phishing email endangers everyone else.
Defensive risk: insecure code generation
The defensive half asks the model to write code in realistic contexts, then checks the result for known weakness patterns. The headline finding: across these completions, 31% contained a vulnerability. The detection is automated. Meta runs every generated snippet through CodeShield, a static analysis layer that scans for insecure patterns across multiple languages before the code is ever returned to a user.
A static scanner beats a model-as-judge here. The grading signal is deterministic and reproducible, and it maps to weakness classes a real security team already tracks. A hardcoded credential or an unsanitized shell call gets flagged the same way every time.
Offensive risk: capability uplift
The more novel work in version 3 is the offensive battery. Rather than only asking “will the model say something harmful,” CyberSecEval 3 measures uplift: how much more capable an attacker becomes with the model than without it. Three areas get their own tests.
- Automated social engineering. Can the model generate persuasive spear-phishing content at scale, tailored to a target?
- Scaling manual offensive operations. Does the model help a human operator move faster through the stages of an attack, acting as a force multiplier rather than an autonomous actor?
- Autonomous offensive operations. Can the model run an attack on its own, end to end, without a human in the loop?
How the offensive tests are framed
The key design choice is the comparison baseline. For each capability, Meta evaluates output quality both with the model’s safety guardrails active and with them stripped or bypassed. The gap between those two numbers is the real signal. A model that refuses a malicious request when asked plainly, but complies once the prompt is reframed, has weak guardrails even if its default behavior looks clean.
This matters for exploit generation specifically. The benchmark looks at whether a model can take a vulnerability and produce working exploit code, and how much that capability changes when guardrails are removed. Measuring the delta, rather than a single absolute score, separates a model’s raw capability from the effectiveness of the controls layered on top of it. Both are worth knowing, and they are different things.
The benchmark also keeps assisting a human separate from operating alone. Helping an operator write a phishing email is a different threat tier from running an unsupervised campaign, and collapsing them into one headline would mislead.
Why it matters for code-generation QA
Most teams evaluating AI-generated code measure functional correctness: does it compile, do the tests pass. CyberSecEval 3 is a reminder that a passing test suite says nothing about whether the code is safe. A function can return the right value and still ship an injection flaw. Internalize the 31% rate: if roughly one in three accepted snippets carries a weakness, a code-gen pipeline without a security gate is shipping known-bad code at scale.
The practical lesson is to copy the architecture, not just the result. Put a static scanner in the path between generation and acceptance, the way Meta uses CodeShield. Deterministic scanning catches the common weakness classes cheaply and consistently, and it gives you a reproducible signal you can track release over release. Reserve human review and slower dynamic analysis for the paths that touch authentication, secrets, or untrusted input.
The caveats are real. Static analysis has a false-negative problem: a clean CodeShield pass means “no known pattern matched,” not “this code is secure.” Logic flaws, business-rule violations, and novel weaknesses slip through pattern matching entirely. The offensive uplift tests are a moving target. As jailbreak techniques evolve, a guardrail that looked solid in August 2024 may not hold a year later, so these measurements need re-running, not archiving. Treat CyberSecEval 3 as one instrument in a security-aware eval portfolio. It is a strong defensive scanner and an honest offensive baseline. It is not a certificate that AI-written code is safe to deploy.
Read the original: CyberSecEval 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models — Meta AI, 2024-08-02.
