SWE-bench Verified: OpenAI’s Cleaned-Up Coding Benchmark

SWE-bench Verified is a 500-task subset of SWE-bench, hand-checked by 93 professional Python developers to remove broken evaluation cases.
The screening exposed how flawed the original set was: 38.3% of sampled tasks had underspecified problem statements and 61.1% had unit tests that could fail correct fixes.
On the cleaned set, GPT-4o with the Agentless scaffold resolved 33.2% of tasks, roughly double its 16% score on the original benchmark, mostly because the noise was gone.

Overview

SWE-bench Verified is a curated slice of SWE-bench, the benchmark that measures whether an AI agent can fix a real GitHub issue by editing a codebase until the project’s hidden tests pass. OpenAI’s Preparedness team built it in August 2024 alongside the original SWE-bench authors, and it answers a question that had been quietly undermining coding-agent scores: how many of the original tasks were even solvable as written?

The answer was uncomfortable. A large share of SWE-bench tasks were either impossible to solve from the issue text alone or graded by tests that would reject a perfectly good fix. Verified strips those out, leaving 500 tasks that a human engineer could plausibly complete and that the grader can score fairly.

How SWE-bench works in the first place

Each SWE-bench task starts from a closed GitHub issue in a popular open-source Python project. The agent sees the repository at the commit before the fix, plus the issue text. Its job is to produce a code change. Grading is automatic: the project ships a set of FAIL_TO_PASS tests that were broken before the real fix and should pass after, plus PASS_TO_PASS tests that must stay green so the agent does not break unrelated behavior.

That design is appealing because nothing about it is synthetic. The bugs are real, the tests are the maintainers’ own, and a passing run means the agent’s patch satisfies the same checks the original human PR did. The weakness is that the benchmark trusts the issue text and the test suite to be a fair specification. Often they are not.

What the 93 developers actually found

OpenAI hired 93 developers experienced in Python and had them review 1,699 SWE-bench samples by hand. Each annotator rated two things on a 0 to 3 scale: whether the issue description was underspecified, and whether the FAIL_TO_PASS tests would reject a valid solution. A score of 2 or 3 on either axis marked the task as too broken to keep.

Two failure modes dominated

The numbers tell the story. Among flagged samples, 38.3% had problem statements too vague to solve from the text, and 61.1% had unit tests that could fail a correct implementation. A test might assert one exact error message, one specific function signature, or behavior the issue never mentioned. Add solution details that leaked into the issue itself, and 68.3% of reviewed samples were filtered out. What survived became the 500-task Verified set.

A new evaluation harness

Screening tasks was only half the work. The team also rebuilt the execution layer with the SWE-bench authors, moving evaluation into containerized Docker environments. The original setup was notoriously fragile to reproduce, so dependency drift and environment failures were polluting scores in ways that had nothing to do with model skill. Standardized containers made a passing test mean the patch worked, not that someone got the environment to cooperate.

Why the scores jumped

After cleanup, GPT-4o paired with the open-source Agentless scaffold resolved 33.2% of Verified tasks, up from 16% on the original benchmark. Read that carefully. The model did not get better between the two runs. The benchmark stopped penalizing correct answers. Roughly half of the apparent failures on old SWE-bench were the harness’s fault, not the agent’s. That gap is the single most useful finding here for anyone who reads agent leaderboards.

Why it matters for code-generation QA

The lesson generalizes well beyond this one benchmark: a coding eval is only as trustworthy as its specification and its grader. If you score AI-generated code with a test suite, two silent failure modes will inflate or deflate your numbers exactly as they did here. Underspecified tasks reward guessing and punish reasonable interpretations. Brittle tests reject valid code because it phrased an error differently or chose a different valid signature. Neither has anything to do with whether the code is correct.

For teams building internal evals, the takeaway is to audit your fixtures the way OpenAI audited SWE-bench. Have humans try to solve a sample of tasks from the prompt alone, and check whether your assertions accept more than one correct answer. A test that pins an exact string is measuring conformance to your phrasing, not the model’s engineering ability.

The caveats are worth stating plainly. Verified is still 500 Python tasks drawn from a handful of mature repositories, so a high score does not promise the same agent will handle your TypeScript monolith or your undocumented internal service. It rewards self-contained bug fixes, not multi-file features, design decisions, or ambiguous product requirements. And because it became the dominant coding-agent benchmark almost immediately, there is real pressure to optimize toward it, which makes a single number a poor proxy for production readiness. Treat it as one well-controlled signal in a wider eval suite, not the verdict.

Read the original: Introducing SWE-bench Verified — OpenAI, 2024-08-13.

Overview

How SWE-bench works in the first place

What the 93 developers actually found

Two failure modes dominated

A new evaluation harness

Why the scores jumped

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation