The Future of Speech-to-Speech AI: Inside Gradium and Kyutai's Approach to Full Duplex Conversation

Feb 15, 2026

Conversations in Conversational AI - Latest Episode

What if voice AI could interrupt you the moment it figured out your question—not because it's rude, but because it's actually that natural? This week on Conversations in Conversational AI, we sat down with Neil Zeghidour, CEO and co-founder of Gradium and co-founder of Kyutai, to explore the cutting edge of speech-to-speech models and why the future of voice AI might look very different from today's cascaded systems.

The Nonprofit-to-Startup Model: Kyutai and Gradium

Neil's journey started with a deliberate choice. After working on generative models at Google and Meta, he and his co-founders faced a decision: build a startup or create a nonprofit research lab?

They chose the nonprofit route first, launching Kyutai in late 2023 as an open science research lab. The reasoning? Breakthrough innovation happens in research environments where teams can take risks and rethink everything from first principles.

"Only a researcher and a research team can take enough risks to rethink everything from first principles," Neil explains. "You can even look at LLMs. It was invented at Google, but it was invented at Google Brain and not in Google product divisions."

After two years of open research, publications, and millions of model downloads monthly, market traction around real-time voice led them to create Gradium—the commercial arm building production-ready products. Today, both organizations operate from the same offices, with Kyutai as Gradium's shareholder, creating a unique structure that balances fundamental research with applied development.

Moshi: The Full Duplex Breakthrough

Kyutai’s first major project was Moshi, a conversational AI model that fundamentally reimagined how voice agents work. Unlike traditional systems with strict turn-taking, Moshi introduced full duplex conversation—both the AI and user can speak simultaneously, with no voice activity detection managing speaker turns.

"The latency was sometimes negative," Neil notes, "so if the model would figure out the end of your question, it would start answering before you're done, which was amazing and frustrating for some people because they felt it was a bit impolite."

What made this possible? The team achieved this in just six months with a small team of four to six people and around 1,000 GPUs—roughly 10-20x smaller than similar teams at big tech companies.

The secret was their technical approach: audio language models instead of diffusion models.

Audio Language Models: A Different Architecture

While most audio generation uses diffusion models (treating audio spectrograms like images), Neil's team pioneered audio language models (ALMs) over the past five years, starting with work at Google.

The foundation was Soundstream, a neural codec initially developed for video conferencing that compressed audio incredibly efficiently. The breakthrough came when they realized these compressed audio codes behaved like text tokens—enabling them to train transformer models that predict audio tokens instead of text tokens.

This architecture unlocked instant voice cloning, music generation (MusicLM), podcast dialogue generation (Notebook LM), and translation—all with the same flexibility as text LLMs. Even better, every advancement in text LLM research (distillation, speculative decoding, RLHF) could be applied directly to audio models.

"We were just riding for free the LLM wave, but just with audio," Neil explains.

The Intelligence Gap: Why Speech-to-Speech Still Struggles

Despite Moshi's impressive naturalness and latency, speech-to-speech models face a critical limitation: they're less intelligent than the text models they're based on.

"Even if you ignore the fact that it's less modular, it will be much dumber than the original text model you started from," Neil admits. "And honestly, we don't know why exactly."

The leading theory? Audio training data fundamentally differs from text data. You don't have Wikipedia or Stack Overflow in audio form—mostly just daily conversations. Text LLMs train on trillions of tokens; the equivalent for audio would require hundreds of millions to billions of hours of speech data.

Additionally, when models learn to speak, they get distracted by irrelevant details: replicating recording conditions, reverberation, specific voice characteristics. "It's going to waste a lot of capacity on things that are irrelevant to your task," Neil notes.

This creates what he calls "the intelligence gap"—a measurable loss of reasoning capability when converting text models to speech models.

Why Cascaded Systems Still Dominate (For Now)

Despite speech-to-speech's advantages in naturalness and latency, cascaded systems (speech-to-text → LLM → text-to-speech) remain the production standard. The reasons are practical:

Modularity: You can easily swap the text LLM backbone without retraining the entire system. When GPT-5 comes out, you plug it in. With speech-to-speech, you need to fine-tune significantly.

Steerability: The text bottleneck makes it easy to filter inappropriate content, implement function calling, and control outputs. Filtering audio representations? Much harder.

Tool use and function calling: Current speech-to-speech models struggle with the structured interactions that make voice agents actually useful.

Neil's team is working to close these gaps. Gradium now offers best-in-class streaming speech-to-text and text-to-speech models that can deliver cascaded systems with sub-500 millisecond latency—fast enough for natural conversation while maintaining all the practical advantages.

The Emotion and Expression Challenge

One fascinating technical challenge Neil discussed: current TTS models struggle with appropriate emotional expression. The problem traces back to voice cloning.

"For the model, from the point of view of the model and its loss function, replicating consistently the emotion that is in its original voice sample is just doing its job as voice cloning," Neil explains. If you clone a serious voice, the model speaks seriously. Clone an enthusiastic voice, it stays enthusiastic—regardless of context.

The solution requires disentangling intrinsic voice characteristics from sentence-specific emotions. This means better data with better annotations, but also smarter training approaches that teach models when to vary emotional expression naturally.

This challenge extends beyond just TTS. Notebook LM succeeds partly because it captures podcast-specific speech patterns—utterly different from customer service calls, news anchors, or YouTube creators. Creating models that can adopt appropriate speaking styles for different environments remains an active area of research.

Miniaturization: Running TTS on Your Phone's CPU

Last week, Kyutai open-sourced Pocket TTS—the first text-to-speech model with voice cloning that runs on CPU. Not just on-device. On CPU. On old smartphones.

This represents a strategic direction: miniaturization rather than maximization. While large labs chase bigger and bigger models, Kyutai focuses on smaller, targeted models for specific verticals.

"Right now I think there is a big split between the premium experience of voice and the low-end scalable experience," Neil notes. "You can have robotic but very affordable or high quality but very expensive. Our goal is that in all use cases you can have access to human-like level of interaction."

The reasoning is economic: many voice interactions create minimal value. Resetting a password shouldn't require routing through a massive mixture-of-experts model. But today's alternatives are often robotic and frustrating. Pocket TTS enables natural voice anywhere.

Learning from Babies: The Efficiency Problem

Neil's PhD research studied language acquisition in babies—work that informs his current AI research. The efficiency gap is staggering.

By age five, babies hear less than 5,000 hours of speech and can speak fluently, learn new words, and handle novel situations. Current speech models train on millions of hours.

Even more striking: humans learn to speak before learning to read. We master semantics, word segmentation, phonemes, verb morphology—all in a few hundred hours of audio-only learning.

"There is a way to learn much more efficiently and I think that's a very interesting research problem," Neil reflects.

The Contrarian AI: What's Next

Looking ahead, Neil's vision includes something delightfully specific: the first truly contrarian conversational AI.

"The best way of showcasing that will be the first contrarian conversational AI that just interrupts you and says no, no, no, I disagree fundamentally and explains why you're wrong," he laughs. "It will create value because it creates new use cases where you want to confront an idea you're not sure about with someone who plays devil's advocate. And it will be also much funnier."

This playful vision highlights a serious point: current voice agents are too polite, too disciplined. They wait for their turn, avoid interruptions, require users to speak carefully to avoid triggering voice activity detection. Natural conversation isn't like that.

"As we fix all of that and allow for arbitrary conditions and the model just understanding that, 'this person is thinking right now, I should just shut up and wait for them to finish their question,' this is going to be much more fun."

The Road Ahead: 2025 and Beyond

For the coming year, Neil sees cascaded systems maintaining dominance while research advances on multiple fronts:

Expressivity and contextualization: Models that understand when to sound excited, serious, empathetic—not just cloning whatever emotion was in the training sample.

Handling LLM latency: As reasoning models take longer to think, how do you keep conversations natural? Fake keyboard clicks work for a few seconds. But 15 seconds of silence? The model needs to communicate what's happening without breaking the flow.

Miniaturization: Smaller, faster models that run efficiently for specific use cases rather than throwing massive compute at every interaction.

Speech-to-speech advancement: Closing the intelligence gap, improving steerability, and eventually combining the naturalness of end-to-end models with the modularity of cascaded systems.

The long-term vision? Speech-to-speech as the standard solution, delivering the best experience while solving the practical challenges that keep cascaded systems dominant today.

Beyond the Phone: Robotics and Spatial Audio

One of the most compelling future applications Neil discussed: robotics. Voice agents work reasonably well in controlled phone environments. But imagine a factory robot interacting with multiple people speaking from various positions, with background noise and machines running.

"There is no framework for that," Neil notes. "There is no model that addresses this kind of challenging environment."

Home robots face even more complex scenarios: people shouting from other rooms, 3D spatial audio, distinguishing TV audio from human speech, understanding who should be listened to based on who's being looked at.

These environments break current speech-to-text and turn-taking models completely. They also represent massive future applications of voice AI—applications that will require fundamentally different approaches than today's phone-based systems.

Key Takeaways

For developers: Cascaded systems aren't going anywhere soon, but invest in understanding speech-to-speech architecture. The modularity advantages are significant, but the naturalness gap will close.

For researchers: The intelligence gap between text and speech models remains the critical unsolved problem. Solutions likely involve better data, smarter architectures, and techniques that preserve text model capabilities while adding speech.

For the industry: Miniaturization matters as much as capability. Not every interaction needs GPT-5. Efficient, targeted models for specific use cases will enable voice AI in places impossible with current economics.

For everyone: The future of voice AI isn't just about making robots sound human. It's about creating genuinely natural interactions—interruptions, emotion, spatial awareness, and all the messy richness of actual human conversation.

Listen to the full episode to hear more about audio language models, the technical details of full duplex conversation, and why Neil believes we'll look back at current cascaded systems as "archaic and brittle."

Conversations in Conversational AI explores the cutting edge of voice technology through in-depth interviews with researchers and builders shaping the future of how we interact with AI.