The Race to Production-Grade Diffusion LLMs with Stefano Ermon

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) Podcast Recap

Published: 2026-03-26

Duration: 1 hr 3 min

Guests: Stefano Ermon

Summary

Stefano Ermon discusses the development and advantages of diffusion language models over autoregressive models. The episode examines the capabilities, challenges, and future prospects of diffusion models in AI, highlighting their efficiency and scalability.

What Happened

Stefano Ermon, an associate professor at Stanford University and CEO of Inception, talks about the development of diffusion language models, which he argues scale better than autoregressive models. Ermon's lab has been pioneering work in diffusion models since 2019, applying them to images, video, and music, and now extending to text with Inception's Mercury 2 LLM.

Diffusion models are presented as more efficient and cost-effective at inference time compared to their autoregressive counterparts. These models can generate multiple tokens simultaneously and utilize context from both sides of a sentence, unlike autoregressive models that only use left-side context.

The Mercury 2 model is highlighted as the first commercial-scale diffusion language model with reasoning capabilities. It offers significant improvements in speed, being 5-10 times faster than comparable autoregressive models, and is particularly suited for latency-sensitive applications.

Challenges in developing diffusion models for text are discussed, such as the discrete nature of text and the complexity of decoding back into coherent sentences. Techniques from image diffusion research are being adapted to text, with ongoing research efforts at Stanford and collaborations with companies like Nvidia.

Ermon explains how diffusion models allow for more control over outputs, which could be advantageous in applications requiring precise outcomes. The episode also touches on current limitations and the potential for diffusion models to unify various generative tasks, including images, video, and text.

The podcast notes the challenges of serving diffusion models in production environments, as existing serving engines are optimized for autoregressive models. There is growing support in open-source communities, but these are still underdeveloped compared to traditional models.

Stefano Ermon also mentions the broader research landscape, including efforts by companies such as Google, which announced a diffusion language model called Gemini Diffusion. Research into diffusion models is active worldwide, with interest in their application to areas like medical imaging and their potential to challenge autoregressive models at larger scales.

Key Insights

Diffusion language models, like Mercury 2 developed by Inception, are significantly faster and more efficient than traditional autoregressive models, generating multiple tokens at once and using context from both sides of a sentence.
Challenges remain in adapting diffusion models to text, particularly due to the discrete nature of language, but ongoing research is applying techniques from image diffusion to overcome these hurdles.
Diffusion models offer enhanced control over output quality through adjustable denoising steps, making them suitable for applications requiring precise and high-quality outcomes.
The current research and development in diffusion models indicate a potential shift in the foundational approaches to generative models, with active exploration in both academic and commercial sectors.

View all The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) recaps