World Models Are Here - But It's Still the GPT-2 Phase - The Data Exchange with Ben Lorica Recap

Podcast: The Data Exchange with Ben Lorica

Published: 2026-03-19

Duration: 2666

Guests: Jeff Hopps

What Happened

World models represent a significant shift in AI, focusing on generating interactive simulations using visual data. Jeff Hopps, CTO at Odyssey, discusses the company's development of Odyssey 2 Pro, a world simulator trained on large-scale public video data. This approach contrasts with language models, as the volume of video data available surpasses what any single company could handle.

World models have temporal characteristics, allowing them to track sequences and predict potential futures based on visual observations. Unlike language models, which predict the next token, world models use transformers to focus on predicting future states. Though still in early development, similar to the GPT-2 phase, they offer potential applications in gaming, retail, robotics, and live events.

Current world models face challenges in generating long, stable video streams, with recent progress extending stable video from 15-30 seconds to 1-2 minutes. Despite these limitations, they require significant computational resources, often pushing GPUs to their maximum capacity. Jeff Hopps highlights that these models can improve robotics by enhancing sample efficiency and reducing data requirements.

While world models are still experimental, they have practical applications today, such as creating new interactive experiences and game concepts. Advancements in LLM infrastructure could accelerate their development, though video models are unlikely to be on-device soon due to their high computational demands. Jeff Hopps notes that smaller world models remain in the single-digit billions, reflecting their nascent stage compared to language models.

The term 'world models' varies across fields, with the canonical definition involving learning how the world evolves. This originates from robotics and reinforcement learning, distinguishing them from spatial intelligence models and generative video models. Odyssey's approach includes pre-training, mid-training, and post-training stages, leveraging methods from LLMs like RLHF and GRPO.

No public models have reached three-digit billion parameters, and current challenges focus more on data selection than production. Despite this, world models could potentially integrate with foundation models like Gemini or ChatGPT, offering new capabilities. Jeff Hopps notes that Odyssey runs its operations on Kubernetes and uses NVIDIA GPUs and PyTorch for training, with data processing orchestrated by Ray and Flight.

Key Insights