⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data - Latent Space: The AI Engineer Podcast Recap
Podcast: Latent Space: The AI Engineer Podcast
Published: 2026-02-23
Duration: 26 min
Summary
In this episode, Mia Glaese and Olivia Watkins discuss the limitations of the SWE-Bench Verified coding benchmark, arguing that it has become saturated and contaminated, thus failing to accurately measure coding performance improvements. They advocate for a shift towards newer benchmarks like Superbench Pro.
What Happened
The episode kicks off with Mia Glaese, a VP of Research at OpenAI, and Olivia Watkins from the Frontier Evals team discussing their work on coding benchmarks, particularly SWE-Bench Verified. They highlight that although SWE-Bench Verified was once a leading standard for measuring coding progress, it has recently stalled in effectiveness. Mia explains that the benchmark's saturation and contamination have rendered it less reliable for evaluating improvements in coding performance, which necessitates a transition to new benchmarks that better reflect current capabilities in the field.
Mia and Olivia dive into the origins of SWE-Bench Verified, noting its roots in a Princeton lab's academic benchmark. They explain that the benchmark involved real-world coding tasks sourced from GitHub issues and required agents to solve coding problems while being graded on whether their solutions passed specific tests. However, as they analyzed the benchmark over time, it became clear that many failures were due to poorly specified problems rather than deficiencies in the models. To address this, they undertook a significant human data campaign, enlisting nearly 100 software engineers to curate and review tasks, resulting in a more robust set of coding challenges.
The conversation also touches on the challenges of ensuring fairness in coding tests, particularly how overly narrow tests can misrepresent a model's capabilities. They discuss how contamination from open-source repositories impacts the integrity of benchmarks and emphasize the need for careful scrutiny and better problem setups. The pair concludes by encouraging the community to explore newer benchmarks that can provide a clearer picture of coding advancements, moving away from outdated metrics.
Key Insights
- SWE-Bench Verified has stalled as a benchmark due to saturation and contamination.
- The original SWE-Bench Verified benchmark was built on flawed problem setups that misrepresented model capabilities.
- A significant human data campaign was necessary to curate better coding tasks for evaluation.
- Industry-wide contamination from open-source repositories affects the reliability of coding benchmarks.
Key Questions Answered
What is SWE-Bench Verified?
SWE-Bench Verified is a coding benchmark developed from an academic project at Princeton, designed to evaluate coding performance by having agents solve real-world coding problems sourced from GitHub issues. It aimed to provide a more realistic assessment of coding skills compared to previous benchmarks.
Why has SWE-Bench Verified become ineffective?
Mia and Olivia explain that SWE-Bench Verified has become ineffective due to saturation and contamination, meaning it no longer accurately reflects improvements in coding performance. They argue that the benchmark's effectiveness has stalled, prompting a need for a shift to more reliable alternatives.
What was the human data campaign conducted for SWE-Bench Verified?
The human data campaign involved hiring nearly 100 real-world software engineers to review the benchmarks' problems and ensure they were well-specified and fair. This effort aimed to create a curated set of tasks that could provide a more accurate assessment of model capabilities.
What are the implications of contamination in coding benchmarks?
Contamination arises when benchmarks are sourced from popular open-source repositories, leading to situations where models may have prior knowledge of tasks that unfairly influences their performance. This contamination can skew results and misrepresent the abilities of the models being tested.
What are newer benchmarks suggested by OpenAI?
OpenAI suggests moving towards newer benchmarks like Superbench Pro, which they believe can better reflect current advancements in coding capabilities. These benchmarks are expected to provide a more accurate measure of progress in the field, addressing the limitations of outdated metrics.