OpenAI | Gen.QA

Code Generation QA

Inside OpenAI’s Case for Retiring SWE-bench Verified

OpenAI dropped SWE-bench Verified after finding 59.4% of audited o3 failures had flawed tests and frontier models reciting gold patches verbatim.

February 23, 2026 4 min read

Code Generation QA

How OpenAI’s o3 Hit Gold-Medal Coding at IOI 2024

How OpenAI evaluated o1, o1-ioi, and o3 on IOI 2024 and Codeforces with formal test-case grading, and why o3 reached gold-medal coding without custom tooling.

February 3, 2025 4 min read

Code Generation QA

MLE-bench: How OpenAI Grades AI Agents on Real Kaggle ML

OpenAI's MLE-bench scores AI agents on 75 real Kaggle competitions against human medal thresholds. How it works and what it means for code-gen QA.

October 9, 2024 4 min read

Code Generation QA

SWE-bench Verified: OpenAI’s Cleaned-Up Coding Benchmark

How OpenAI used 93 developers to build SWE-bench Verified, a 500-task coding benchmark that fixes the broken tests inflating AI agent scores.

August 13, 2024 4 min read

Code Generation QA

How OpenAI’s HumanEval and pass@k Score AI Code

OpenAI's Codex paper introduced HumanEval, 164 Python problems scored by pass@k functional correctness against unit tests. Here is how it works and why it matters.

July 7, 2021 4 min read