Code Generation QA
How Container Limits Skew Anthropic Coding Evals
Anthropic found container resource limits alone swing Terminal-Bench 2.0 scores by 6 points. What that means for trusting AI coding eval leaderboards.
Anthropic found container resource limits alone swing Terminal-Bench 2.0 scores by 6 points. What that means for trusting AI coding eval leaderboards.
How Anthropic measured Claude 3.5 Sonnet at 49% on SWE-bench Verified using a minimal Bash and Edit agent, graded by real repo unit tests.