Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath - "The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis Recap
Podcast: "The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis
Published: 2026-03-05
Duration: 1 hr 47 min
Summary
Dan Balsam and Tom McGrath of Goodfire discuss their groundbreaking work in interpretability and their new focus on intentional design, aiming to shape AI training processes to control model behavior and reduce errors such as hallucinations.
What Happened
In this episode, host discusses the rapid advancements made by Goodfire, a startup focusing on mechanistic interpretability, in just under two years since its founding. Dan Balsam and Tom McGrath share how they've built an impressive research team and secured a significant Series B fundraise of $150 million, achieving a valuation of $1.25 billion. They emphasize the importance of their new research pillar, intentional design, which aims to complement existing interpretability methods by shaping the loss landscape to guide what models learn during training.
The conversation dives into the shift in interpretability techniques from using sparse autoencoders to newer methods that explore the geometric structures within a model’s latent space. They outline their initial proof of concept for reducing model hallucinations through a detection probe that not only steers the model during runtime but also serves as a reward signal for reinforcement learning. Tom and Dan address concerns about models evading detection, highlighting the need for careful design and the significance of not fighting backpropagation to achieve their goals effectively.
As the episode wraps up, Balsam and McGrath reflect on their recent research efforts, including a collaboration with Prima Mente that uncovered unexpected factors in predicting Alzheimer's diagnoses. They discuss the balance between business growth and public benefit in their research dissemination, all while pondering the implications of their findings on future interpretability techniques.
Key Insights
- Goodfire's focus on intentional design seeks to reshape how models learn, particularly in controlling errors like hallucinations.
- The use of a hallucination detection probe on a frozen model during training can improve learning outcomes.
- Understanding the geometric structures of concepts in latent space is crucial for advancing interpretability techniques.
- Careful design is needed to avoid pitfalls in AI behavior correction, emphasizing the importance of shaping the loss landscape.
Key Questions Answered
What is Goodfire's new research focus?
Goodfire has announced a new pillar in their research agenda called intentional design. This approach aims to expand the scope of interpretability science by complementing reverse engineering methods with an effort to understand and shape the loss landscape. This is intended to control what models learn during training and ultimately improve their generalization abilities.
How does Goodfire plan to reduce AI hallucinations?
Goodfire is working on a technique that uses a probe trained to detect hallucinations. This probe assists in steering the model at runtime and serves as a reward signal for additional reinforcement learning training, aiming to help the model learn to avoid hallucinations more effectively.
What challenges are associated with AI alignment research?
Dan and Tom acknowledge the inherent challenges in alignment research, including concerns that models might learn to trick their monitors rather than genuinely correct their behaviors. They emphasize that while intentional design techniques are still immature and not suitable for frontier models, the rapid advancement of AI capabilities necessitates exploring all possible paths to understanding and control.
What recent findings did Goodfire publish in collaboration with Prima Mente?
In their collaboration with Prima Mente, Goodfire revealed a new research direction by showing that a state-of-the-art model for predicting Alzheimer's diagnoses was basing its predictions on the length of cell-free DNA fragments. This unexpected finding highlights the intricate and sometimes surprising factors that influence model performance.
How does Goodfire balance business growth with public benefit?
As they advance their research, Goodfire is mindful of the need to balance business growth with their public benefit mission. They consider the timing and nature of their research publications carefully, aiming to contribute positively to the field while also ensuring their business objectives are met.