[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency - Josh McGrath, OpenAI - Latent Space: The AI Engineer Podcast Recap

Podcast: Latent Space: The AI Engineer Podcast

Published: 2025-12-31

Duration: 28 min

Guests: Josh McGrath

Summary

Josh McGrath from OpenAI discusses the evolution from GPT-4.1 to 5.1, focusing on post-training advancements, particularly in reinforcement learning and token efficiency.

What Happened

Josh McGrath, a post-training researcher at OpenAI, dives into the developments from GPT-4.1 to GPT-5.1, highlighting the transition from non-thinking models to more sophisticated systems. He explains that while pre-training remains crucial, post-training offers a more exciting frontier with the potential for significant behavioral changes in AI models.

McGrath describes the complexities of reinforcement learning (RL) compared to pre-training, emphasizing the higher number of moving parts and infrastructure needed for RL tasks. He discusses the challenges of understanding unfamiliar code, whether it's internally developed or from external partners, and how tools like Codex have transformed his workflow by automating routine tasks.

The conversation touches on the release of a shopping model during the Black Friday period, highlighting new interaction paradigms, such as interruptability, that allow users to refine their searches dynamically. McGrath explains why this was launched as a separate model and draws parallels with the deep research model and GPT-5's high reasoning capabilities.

He addresses community debates around the spiritual successor of the deep research model and the introduction of personality toggles in the models, which provide users with more control over the AI's interaction style. McGrath prefers a tool-like interaction, aligning with the 'Anton versus Clippy' divide.

The episode covers the shift from traditional optimization papers to those focusing on input data quality, with McGrath noting that reinforcement learning through human feedback (RLHF) offers a spectrum of signal quality that needs more exploration. He also discusses the ongoing evolution of context windows and token efficiency, which significantly impact the user experience.

Josh McGrath reflects on the long-term future of AI, expressing uncertainty about the ultimate limits of context windows and token usage. He emphasizes the importance of co-designing systems and models and the need for educational systems to produce engineers skilled in both distributed systems and machine learning.

Finally, McGrath highlights the ongoing balance between pre-training and post-training investments, noting that while pre-training is not dead, the industry is witnessing a shift in computational resources towards post-training as both areas evolve.

Key Insights

Post-training in AI models, particularly through reinforcement learning, is becoming a major focus due to its potential for significant behavioral changes, requiring more complex infrastructure compared to pre-training.
The introduction of personality toggles in AI models allows users to customize interaction styles, reflecting a growing trend towards user-controlled AI experiences.
Reinforcement learning through human feedback (RLHF) presents a range of signal quality that is not yet fully understood, suggesting a need for further exploration in optimizing input data quality.
The AI industry is reallocating computational resources towards post-training as both pre-training and post-training continue to evolve, indicating a shift in investment priorities.