976: NVIDIA’s Nemotron 3 Super: The Perfect LLM for Multi-Agent Systems - Super Data Science: ML & AI Podcast with Jon Krohn Recap

Podcast: Super Data Science: ML & AI Podcast with Jon Krohn

Published: 2026-03-20

What Happened

NVIDIA's new Nemotron 3 Super model, discussed by Jon Krohn, is a major advancement in AI technology, designed to support agentic AI systems capable of reasoning, using tools, and operating autonomously over extended workflows. The model features a mixture of experts (MOE) architecture, with only 12 billion of its 120 billion parameters active at any given time, offering the computational efficiency of a smaller model while retaining the knowledge capacity of a larger one.

Nemotron 3 Super integrates a hybrid architecture combining traditional transformer-based attention layers and Mamba layers, allowing for efficient sequence processing. The Mamba layers operate in linear time, making the model's 1 million token context window practical, while the transformer layers ensure high-fidelity information retrieval. This combination enables the model to balance efficiency with precision.

A new technique called Latent Mixture of Experts is introduced, where tokens are compressed into a smaller latent space before being routed to experts, reducing computational overhead. This allows more experts to be activated simultaneously, enhancing accuracy by bringing specialized knowledge to each prediction.

Nemotron 3 Super's multi-token prediction (MTP) capability allows it to predict multiple future tokens at once, speeding up tasks like code generation by up to three times without requiring a separate draft model. This results in significant throughput improvements; for instance, it achieves up to 2.2 times higher throughput than GPT-OSS120B on benchmark tasks.

The model was natively pre-trained in NVIDIA's 4-bit NVFP4 precision, enhancing inference speed on Blackwell GPUs by up to four times compared to the previous Hopper GPUs, without sacrificing accuracy. These efficiencies are crucial for overcoming context explosion and the thinking tax in multi-agent systems, where maintaining workflow state and reasoning at each step is vital.

Nemotron 3 Super has achieved top rankings on several leaderboards, demonstrating its capabilities in multi-step research tasks. NVIDIA is also releasing the model with open weights and additional resources, such as pre- and post-training datasets and reinforcement learning environments, to facilitate its adoption and development. Companies like Perplexity and Siemens are already integrating the model into their systems for various applications.

Key Insights

Nemotron 3 Super is a 120 billion parameter model with only 10% of its parameters active at any time, using a mixture of experts architecture for efficiency.
The model combines transformer-based and Mamba layers to enable efficient and precise sequence processing with a practical 1 million token context window.
Latent Mixture of Experts reduces computational costs by compressing tokens into a smaller latent space before routing, allowing for more expert activation and improved accuracy.
Nemotron 3 Super's multi-token prediction heads enable faster code generation, achieving up to three times speed increases without external draft models.