[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap - John Yang - Latent Space: The AI Engineer Podcast Recap
Podcast: Latent Space: The AI Engineer Podcast
Published: 2025-12-31
Duration: 18 min
Guests: John Yang
Summary
John Yang discusses the evolution of coding benchmarks and the innovative approaches for evaluating AI's role in software engineering. He highlights the development of Code Clash and the importance of human-AI collaboration in future benchmarks.
What Happened
John Yang, associated with Sui Bench, talks about the trajectory of C-Bench and its impact on coding benchmarks. He highlights the importance of extensions like Superbench Pro, despite them being independent projects. Yang elaborates on the multilingual and multimodal extensions, mentioning that Sui Bench now supports nine languages across 40 repositories, including JavaScript, Rust, Java, C, and Ruby.
Yang discusses the drawbacks of unit tests in verification, proposing Code Clash as a solution to evaluate long-term development and interactions between code bases. He describes how Code Clash involves language models participating in programming tournaments, maintaining and evolving their code bases, and competing in various arenas.
Highlighting the role of multimodal and multilingual benchmarks, Yang addresses the importance of diversifying repositories beyond Django. He expresses interest in how future benchmarks will be curated, noting the trend of benchmarks like Code Clash, which focus on real-world utility and economic value.
Yang also speaks about his work with Ophir's group, mentioning notable projects like SuiFiciency, which aims to optimize code performance without changing behavior, and Psycode, known for its human evaluation capabilities.
The conversation touches on the challenges of creating realistic user simulators and the importance of user interaction data in improving AI systems. Yang points out the need for benchmarks that intentionally include impossible paths to challenge models and improve evaluation methods.
Looking forward, Yang anticipates more Sweetbenches being developed and emphasizes the potential of Terminal Bench in expanding coding environments. He shares his vision of a long-running Sui agent that autonomously works on code bases and the balance needed between human involvement and AI autonomy.
John Yang also reflects on the importance of interactivity in coding tasks and the potential for AI to assist in routine data processing tasks. He emphasizes the need for benchmarks that facilitate human-AI collaboration and adapt to varying levels of human involvement.
Key Insights
- Sui Bench now supports nine programming languages across 40 repositories, including JavaScript, Rust, Java, C, and Ruby, enhancing its multilingual and multimodal capabilities.
- Code Clash evaluates long-term software development by having language models participate in programming tournaments, where they maintain and evolve code bases in competitive arenas.
- SuiFiciency is a project focused on optimizing code performance without altering its behavior, while Psycode is recognized for its human evaluation capabilities.
- Future coding benchmarks are expected to include impossible paths to challenge AI models, aiming to improve evaluation methods and facilitate human-AI collaboration.