Anthropic

Code Generation QA

How Container Limits Skew Anthropic Coding Evals

Anthropic found container resource limits alone swing Terminal-Bench 2.0 scores by 6 points. What that means for trusting AI coding eval leaderboards.

February 5, 2026 4 min read

Code Generation QA

How Anthropic Evaluates Claude on SWE-bench Verified

How Anthropic measured Claude 3.5 Sonnet at 49% on SWE-bench Verified using a minimal Bash and Edit agent, graded by real repo unit tests.

October 30, 2024 4 min read