MLE-bench: How OpenAI Grades AI Agents on Real Kaggle ML

  • OpenAI’s MLE-bench tests AI agents on 75 real Kaggle competitions, grading end-to-end ML engineering work against the same leaderboards human data scientists competed on.
  • Agents earn bronze, silver, or gold based on actual medal cutoffs; the best run, o1-preview with the AIDE scaffold, hit at least bronze in 16.9% of competitions.
  • The benchmark separately measures how performance scales with compute and attempts, and audits for pre-training contamination, two failure modes most code-gen evals ignore.

Overview

MLE-bench is OpenAI’s attempt to score AI agents the way you would score a junior machine learning engineer: by dropping them into a real competition and seeing whether the artifact they produce would have won a medal. Released in October 2024 by a team including Lilian Weng, Aleksander Madry, and Tejal Patwardhan, it pulls 75 competitions from Kaggle and asks an agent to do the whole job, from reading the data to submitting a trained model’s predictions.

That framing matters because most code-generation benchmarks stop at “does this function pass its unit tests.” MLE-bench grades the outcome of a multi-step engineering process against humans who spent weeks on the same problem.

How the benchmark is built

Each of the 75 tasks is a self-contained Kaggle competition. The agent receives the competition description, the training and test datasets, and a local grading harness. It has to prepare the data, choose and train a model, run experiments, and write out a submission file in the exact format the competition expects. No human picks the model architecture or tunes the hyperparameters; the agent owns the full pipeline.

The competitions were curated to be diverse, spanning tabular prediction, image classification, natural language, and signal data. They were also chosen to be hard enough that solving them requires genuine experimentation rather than a one-line library call.

Grading against human leaderboards

This is the design decision that gives MLE-bench its teeth. Instead of an absolute accuracy threshold, each submission is scored on the competition’s own metric and placed on the historical Kaggle leaderboard. The agent then receives a medal using the same percentile cutoffs Kaggle awarded to real competitors: bronze, silver, or gold.

So a passing grade is not “the code ran.” It is “this result would have beaten a meaningful fraction of the humans who entered.” The headline metric is the percentage of competitions where the agent earns at least a bronze medal.

What the results showed

The strongest configuration paired OpenAI’s o1-preview model with AIDE, an open-source agent scaffold that structures the model’s work into an iterative experiment loop. That setup reached at least a bronze medal in 16.9% of the 75 competitions. The team tested several frontier models across multiple open-source scaffolds, and the scaffold choice mattered: the same model performs differently depending on how its actions are orchestrated.

Two secondary experiments are the most useful part of the paper for anyone building evals.

  • Resource scaling. Giving an agent more attempts or more compute raised its medal rate. Performance is not a fixed property of the model; it bends with the budget you hand it, which means any single-number score hides a curve.
  • Contamination. Because these are public Kaggle competitions, solutions and discussion threads almost certainly sat in the models’ pre-training data. The authors probed how much that inflates results, a check that separates real engineering ability from memorized answers.

Why it matters for code-generation QA

If you evaluate AI-generated code with pass/fail unit tests, you are measuring a narrow slice of what an agent does on a real task. MLE-bench is a reminder that the interesting failures happen across many steps: misreading the data schema, picking a model that cannot fit the problem, producing a submission in the wrong format, or quietly overfitting. A test suite that only checks the final function signature would miss all of those.

The leaderboard-relative scoring is worth copying. Grading generated code against what skilled humans actually produced, rather than against a synthetic threshold, gives a far more honest read on whether the output is shippable. A 16.9% bronze rate also sets expectations: today’s best agents handle a minority of genuinely hard engineering tasks unaided, and they need a scaffold and multiple attempts to get there.

Two caveats temper the optimism. The contamination work shows that high scores on public problems can reflect recall rather than reasoning, so a benchmark built entirely from web-scraped tasks will drift toward measuring memory over time. And the resource-scaling result means you cannot compare two agents fairly without fixing their compute budget first. For teams running their own evals, the takeaways are concrete: score against real human baselines, hold the budget constant, and assume any public test set has leaked. MLE-bench fits a broader shift in the field, away from static one-shot code benchmarks and toward agentic, end-to-end evaluation of whether generated code actually solves the job.

Read the original: MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering — OpenAI, 2024-10-09.