World Models Are Here - But It's Still the GPT-2 Phase - The Data Exchange with Ben Lorica Recap

Podcast: The Data Exchange with Ben Lorica

Published: 2026-03-19

Duration: 2666

Guests: Jeff Hopps

What Happened

World models represent a significant shift in AI, focusing on generating interactive simulations using visual data. Jeff Hopps, CTO at Odyssey, discusses the company's development of Odyssey 2 Pro, a world simulator trained on large-scale public video data. This approach contrasts with language models, as the volume of video data available surpasses what any single company could handle.

World models have temporal characteristics, allowing them to track sequences and predict potential futures based on visual observations. Unlike language models, which predict the next token, world models use transformers to focus on predicting future states. Though still in early development, similar to the GPT-2 phase, they offer potential applications in gaming, retail, robotics, and live events.

Current world models face challenges in generating long, stable video streams, with recent progress extending stable video from 15-30 seconds to 1-2 minutes. Despite these limitations, they require significant computational resources, often pushing GPUs to their maximum capacity. Jeff Hopps highlights that these models can improve robotics by enhancing sample efficiency and reducing data requirements.

While world models are still experimental, they have practical applications today, such as creating new interactive experiences and game concepts. Advancements in LLM infrastructure could accelerate their development, though video models are unlikely to be on-device soon due to their high computational demands. Jeff Hopps notes that smaller world models remain in the single-digit billions, reflecting their nascent stage compared to language models.

The term 'world models' varies across fields, with the canonical definition involving learning how the world evolves. This originates from robotics and reinforcement learning, distinguishing them from spatial intelligence models and generative video models. Odyssey's approach includes pre-training, mid-training, and post-training stages, leveraging methods from LLMs like RLHF and GRPO.

No public models have reached three-digit billion parameters, and current challenges focus more on data selection than production. Despite this, world models could potentially integrate with foundation models like Gemini or ChatGPT, offering new capabilities. Jeff Hopps notes that Odyssey runs its operations on Kubernetes and uses NVIDIA GPUs and PyTorch for training, with data processing orchestrated by Ray and Flight.

Key Insights

World models are in their early stages, akin to the GPT-2 phase for language models. They focus on generating continuous, interactive simulations using visual data, which presents unique challenges and opportunities compared to text-based models.
Odyssey 2 Pro, developed by Odyssey, utilizes large-scale public video data for training, leveraging the abundant visual data available online. This approach contrasts with the data constraints faced by language models.
Current limitations of world models include predicting only near futures and difficulty in generating long, stable video streams. Recent advancements have extended stable video generation from 15-30 seconds to 1-2 minutes.
World models have potential applications in various industries, such as gaming, retail, and robotics, by offering new interactive experiences. They also promise to improve robotics by reducing the need for extensive training data and enhancing sample efficiency.