Inside OpenAI’s Case for Retiring SWE-bench Verified
OpenAI dropped SWE-bench Verified after finding 59.4% of audited o3 failures had flawed tests and frontier models reciting gold patches verbatim.
OpenAI dropped SWE-bench Verified after finding 59.4% of audited o3 failures had flawed tests and frontier models reciting gold patches verbatim.
How OpenAI evaluated o1, o1-ioi, and o3 on IOI 2024 and Codeforces with formal test-case grading, and why o3 reached gold-medal coding without custom tooling.
OpenAI's MLE-bench scores AI agents on 75 real Kaggle competitions against human medal thresholds. How it works and what it means for code-gen QA.
How OpenAI used 93 developers to build SWE-bench Verified, a 500-task coding benchmark that fixes the broken tests inflating AI agent scores.
OpenAI's Codex paper introduced HumanEval, 164 Python problems scored by pass@k functional correctness against unit tests. Here is how it works and why it matters.