How Anthropic Evaluates Claude on SWE-bench Verified

Claude 3.5 Sonnet resolved 49% of SWE-bench Verified, a 500-problem human-reviewed subset of real GitHub issues, beating the prior 45% state of the art.
The result came from a deliberately thin agent: a prompt plus two tools, a Bash tool and an Edit tool, with no rigid workflow imposed on the model.
Every patch was graded by the original repository’s own unit tests, the same tests that the human pull request had to pass to close the issue.

Overview

In October 2024, Anthropic published its canonical account of how it measures Claude’s coding ability on SWE-bench Verified. The headline number, 49%, is easy to quote and easy to misread. What makes the write-up worth studying is the method underneath it: a stripped-down agent that hands almost all decision-making to the model, then checks its work against tests written by real maintainers.

For anyone building evaluation pipelines for AI-generated code, this is a useful reference point. It shows what a credible code-gen benchmark looks like, where the grading signal comes from, and how much compute a single correct answer can actually consume.

What SWE-bench Verified measures

SWE-bench draws its tasks from closed GitHub issues across popular open-source Python projects. Each task gives the model a repository snapshot and an issue description, then asks it to produce a patch that fixes the bug or implements the feature. The original benchmark contained noisy tasks, some unsolvable, some with broken setups. SWE-bench Verified is the cleaned-up answer: a 500-problem subset that human reviewers checked to confirm each task is actually solvable.

This matters because the score is only as honest as the task pool. A benchmark padded with impossible problems caps the ceiling artificially and rewards luck. Verifying solvability up front turns the percentage into something closer to a real pass rate.

The minimal agent scaffold

Anthropic’s stated design philosophy was to give as much control as possible to the model and keep the scaffolding minimal. In practice that meant two tools and a prompt.

The two tools

Bash tool: runs commands in a real shell, with access to Linux packages through apt and Python packages through pip. This is how the model explores the codebase, runs scripts, and executes tests.
Edit tool: a custom file editor supporting view, create, str_replace, insert, and undo_edit. It requires absolute paths and uses string-replacement matching so edits land precisely where intended.

The prompt

The prompt sketched a sensible path rather than a fixed pipeline: explore the repo to learn its structure, write a script that reproduces the bug, edit the source, rerun the script, and think about edge cases. The model chose how to move between those steps. There were no strict, discrete transitions, which let Claude backtrack, re-investigate, or skip ahead as the problem demanded.

The loop ran until the model declared itself finished or hit the 200k-token context limit. Successful runs often took hundreds of turns and burned through more than 100k tokens. That detail is easy to skim past and important to keep: a 49% pass rate is not cheap, and the cost per solved task is non-trivial.

How the grading works

The grading signal is the strongest part of the design. A patch counts as correct only if it passes the real unit tests from the pull request that closed the original issue. No model-as-judge, no fuzzy similarity score, no human spot-check deciding whether the code “looks right.” The patch either makes the maintainer’s tests go green or it does not.

That is the same bar a human contributor had to clear. It also means the evaluation rewards working behavior, not plausible-looking code. Anthropic was candid that grading is not frictionless: environment setup issues and install patches being applied twice occasionally interfered with scoring. Honest accounting of that noise is part of what makes the report trustworthy.

Why it matters for code-generation QA

The lesson for QA teams is not the 49% figure. It is the shape of the evaluation. Two ideas transfer directly to internal pipelines.

First, ground your pass/fail signal in executable tests the AI cannot see or game. The original repository’s unit tests are the closest thing to an objective oracle that code generation offers. If your eval grades generated code by asking another model whether it looks correct, you are measuring persuasiveness, not correctness. Run the code.

Second, keep the scaffold thin and the cost visible. A minimal agent makes results legible: when something fails, you can usually attribute it to the model rather than to a tangle of orchestration logic. The flip side is the compute bill. Hundreds of turns and six-figure token counts per task mean a serious eval suite has a real budget, and that budget should be planned, not discovered.

The caveats are worth stating plainly. SWE-bench Verified is Python-only and drawn from open-source projects that may sit in training data, so it is a proxy, not a guarantee for your private codebase. The tasks are bug-fixes and small features, not large architectural changes. And a 49% pass rate means more than half of real issues still went unsolved. As a single data point in a broader eval portfolio, this benchmark is excellent. As the only thing you trust before shipping AI-written code to production, it is not enough. Pair it with domain-specific tests, security review, and human sign-off on anything that touches sensitive paths.

Read the original: Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet — Anthropic, 2024-10-30.

Overview

What SWE-bench Verified measures

The minimal agent scaffold

The two tools

The prompt

How the grading works

Why it matters for code-generation QA

Related Articles

Building a Culture of Continuous Testing in Startups

Scriptless vs. Scripted Testing: Which Suits Your Startup?

Streamlining QA with AI-Powered Code Generation