The State of Voice AI Instruction Following in 2026: A Conversation with Kwindla from Pipecat and Zach from Ultravox

Jan 27, 2026

Why are production voice agents still running on 18-month-old models? Why is instruction following the hardest problem to benchmark? And what's actually missing from voice AI evaluation today? We sat down with two of the sharpest minds in the space to find out.

As part of our State of Voice AI 2026 research, we brought together Kwindla Hultman Kramer, co-founder of Daily and creator of the open-source PipeCat framework, and Zach Koch, co-founder and CEO of Ultravox AI, which trains real-time speech-native models. The conversation that followed was one of the most candid discussions we've had about what's actually working in voice AI evaluation—and what's still broken.

Check out the full episode here:

The Benchmark That Actually Matters

Kwin recently published something the voice AI community has desperately needed: a public benchmark for instruction following and function calling in long, multi-turn conversations.

"I wanted to publish something that people could criticize and try to help make better," Kwin explained. "We all have kind of tests and vibes that we do internally, but I wanted something that reflects the hard workloads in voice AI—instruction following, function calling reliability, turn-taking reliability."

The benchmark simulates a real-world voice AI scenario: knowledge dumped into a system prompt, tools that need to be called, and a 30-turn conversation that tests whether the model can maintain coherent behavior deep into the dialogue.

What surprised Kwin most? The frontier models saturated it.

"GPT-5, the latest Claude, Gemini 3—they all saturated what I thought was a really hard benchmark. But here's the thing: they're all too slow to use for a voice agent."

This is the central tension in voice AI today: the smartest models are too slow, and the fast models aren't smart enough.

Why Production Is Still Running 18-Month-Old Models

Here's a reality check that might surprise people outside the voice AI space: most production voice agents are still running on GPT-4o and Gemini 2.5 Flash—models that are now a year and a half old.

"Those are the models that have the right mix of intelligence and latency," Kwin noted. "And because people have gotten prompts optimized for them, they're pretty safe choices that a lot of people are sticking with."

But it's not just about capability. Switching models in voice AI is uniquely painful.

"It's so tricky to switch models," I explained during our conversation. "You have so many models in concert together—you're not just seeing if it performs as expected with your prompts, but how it interacts with all the other models. And the testing is much more expensive. The eval process is often very manual."

This creates a vicious cycle: teams stick with older models because evaluation is hard, which means newer models don't get battle-tested, which means teams stick with older models.

The Hardest Benchmark Problem in Voice AI

We spend a lot of time at Coval thinking about what makes voice AI evaluation uniquely difficult. Instruction following is, without question, the hardest piece.

Why? Because you can't just run the same prompt across different models and call it a fair comparison.

"Different prompts do well on different systems," I explained. "What people actually want to know is: what's the best I can get out of each system? It's not useful to compare something out of the box if there's an obvious optimization."

This is why benchmarks like Kwin's are so valuable—they help you rough-cut which models to even consider, before you invest in the expensive work of testing on your specific data and use case.

But there's a deeper problem. Traditional benchmarks test the first few turns of a conversation. Voice AI conversations are fundamentally long, multi-turn interactions—and that data is massively underrepresented in training datasets.

"I would talk to people at foundation labs and they'd say, 'We fixed function calling,'" Kwin recalled. "And function calling on the first three turns would be noticeably better. But function calling 20 turns into the conversation? No better at all."

The Vibes vs. Evals Debate

One of the most honest moments in our conversation came when Zach admitted something many AI practitioners secretly believe:

"I'm a king of vibes. I haven't figured out any benchmark that I trust fully more than putting in my AirPods and talking to the models for 20 minutes. Nothing is quite as brutal as that test."

But Kwin pushed back—gently—on the idea that vibes are enough:

"For the purpose of this conversation, I'm going to pretend to disagree. The pain point I see is: I got this prompt right, the 20 people at our company tested it and had a good experience, but then I put it in production and people did weird things and it's not good enough."

This is the gap that quantitative evaluation fills. It's not about capturing the entire space of what makes a conversation feel good. It's about drawing a box around expected behavior so you can tell which models are clearly inside the box and which aren't.

"If you can draw a box and say this model is clearly in the box, this model is not—that's a useful point of comparison for what it feels like to deploy these things into production with a wide variety of real-world user behavior."

What's Actually Missing from Voice AI Benchmarks

When I asked Zach what's still not captured in benchmarks, his answer was illuminating:

Back-channeling. Those little "mm-hmm" and "uh-huh" moments that humans do perfectly and AI does awkwardly—or not at all.

"Any attempt to back-channel as a system-level thing has failed catastrophically," Zach said. "They're either exactly correct and on the mark, or they're awkward. And we have no evals for this."

Prosody matching. The way your tone affects my response, and vice versa.

"If I say something in a particular tone, the interpretation of that tone should change how you respond—not just the words, but your prosody. My anger might induce your anger, or slow you down. We have no mechanisms to measure any of this."

The "one beat off" problem. The uncanny valley of voice AI isn't about obviously wrong responses—it's about timing that's slightly off.

"Capturing what makes it really unnatural—things are out of order, or it's repeating itself, or getting stuck in loops—those we can catch," I noted. "But when it's just one beat off? That's the hardest to get."

Multi-Model Architectures: Thinking Fast and Slow

One of the most interesting threads in our conversation was about how production voice AI is evolving toward multi-model architectures.

"We're increasingly living in a world where multiple models and multiple inference loops are really valuable," Kwin explained. "A lot of what we're helping customers deploy now feels like a thinking fast and slow split—a fast voice loop, and then various kinds of async or long-running or parallel inference processes."

This includes:

Guardrails running in parallel (though by the time a guardrail kicks in, you may have already moved past the moment)
Tool calling pulled out of the fast loop to avoid latency penalties
Long-running processes that inject back into the voice context

But this creates new evaluation challenges. As Zach pointed out: "The evals can mislead me when I look at them, because you get this boost from thinking performance that helps tool calling, but when I have the actual conversation, it feels awkward."

The text-based evaluation might look accurate, but the user experience of two AI brains trying to coordinate can feel disjointed.

The Chat vs. Voice Trap

One pattern we're seeing—and warning customers about—is trying to reuse the same agents for chat and voice.

"This is where people are running into a lot of issues," I explained. "What you want to see in chat looks very different than what you want to hear in a voice system. You're trying to use the same reasoning for two very different systems, and it just doesn't work."

The benchmarks might say your instruction following is great. But when you add all the layers of abstraction to retrofit a chat agent for voice, the real-world performance falls apart.

The Biggest Signal: Your Problems

We ended the conversation with what might be the most important takeaway for anyone building with voice AI:

Share your problems with your vendors.

"Everyone is trying to figure it out right now," I said. "Hearing from users about what's working and what's not is the biggest signal above all else. We learn so much from our customers."

Zach agreed: "We'd give ourselves a high five on some model performance eval, and then I'd throw it to a customer and they'd be like: garbage, garbage, garbage. There's a gap in our methodology—and we made a lot of mistakes in 2025 training without keeping that applied reality in mind."

The voice AI space is moving fast, but it's still early. The benchmarks are getting better. The models are getting better. But the feedback loop between real production pain and model improvement is still the most valuable signal any of us have.

Key Takeaways

Frontier models saturate hard benchmarks but are too slow for production. The intelligence-latency trade-off is the defining constraint of voice AI in 2026.
Most production systems still run 18-month-old models because switching is expensive and evaluation is hard.
Instruction following is the hardest problem to benchmark because different prompting techniques work for different models, and voice conversations are long multi-turn interactions that training data doesn't represent well.
What's missing from benchmarks: back-channeling, prosody matching, and the subtle timing issues that make conversations feel "one beat off."
Multi-model "thinking fast and slow" architectures are emerging, but they create new evaluation challenges around coordination and user experience.
Don't reuse chat agents for voice. The systems require fundamentally different reasoning and evaluation approaches.
Share your production problems. The feedback loop between real-world deployment and model improvement is the most valuable signal in the industry.

Want to see how your voice agent performs on instruction following? Learn how Coval's simulation and evaluation platform helps teams test before production → Coval.dev

About the Participants:

Brooke Hopkins is the founder of Coval, building simulation and evaluation for voice agents. Her background is from Waymo, where she led the evaluation infrastructure team responsible for all simulation tooling.

Kwindla Hultman Kramer is co-founder of Daily, which makes global infrastructure for real-time audio, video, and AI. Pipecat is part of Daily - the most widely used open-source framework for building voice and real-time multimodal AI agents.

Zach Koch is co-founder and CEO of Ultravox AI, which trains real-time speech-native models and runs dedicated inference to achieve increasingly human-like conversations with AI.