What Voice AI Teams Can Learn from Hamel Husain: Beyond Vibe-Checks to Data-Driven Voice AI QA

Blog Articles

What Voice AI Teams Can Learn from Hamel Husain: Beyond Vibe-Checks to Data-Driven Voice AI QA

Nov 17, 2025

Why Evaluations Are Critical for Voice AI Agent Testing

The latest episode of Conversations in Conversational AI featured Hamel Husain, one of the world's leading voices on AI evaluation. Here's what the conversational AI industry needs to hear:

If you're building voice AI products and think evaluations are just fancy unit tests that engineers should worry about later, you're already behind. That's the core message from our latest podcast conversation with Hamel Husain, an independent AI consultant who has trained over 3,000 engineers and PMs from companies like OpenAI, Anthropic, and Google on AI evaluation and conversational AI quality assurance best practices.

Hamel's background spans machine learning engineering at companies like Airbnb and GitHub, where he contributed to early LLM research that influenced OpenAI's code understanding work. Now, through his widely-referenced consulting practice and popular AI Evals course co-taught with Shreya Shankar, he's become the go-to expert for companies struggling with the same fundamental question: how do you test if your voice AI agent actually works?

Voice AI Evals Aren't Optional - They're Your Development Lifecycle

"We have this word evals and I almost feel like it's not really its own distinct thing," Hamel explained. "It's just really part of the software development life cycle for AI."

This reframes everything for voice AI testing teams. Evals aren't something you budget for separately or bolt on after development. They're how you should be developing voice AI agents from day one. For conversational AI teams especially, this shift is critical because voice systems operate in the messy, unpredictable real world where users interrupt, background noise interferes, and timing matters.

The conversational AI space is particularly prone to what Hamel calls "vibe-check development"—where teams test their voice agent once, it seems to work, and they ship it. But as anyone who's deployed a production voice system knows, real conversations are infinitely more complex than your test scripts. Automated voice AI testing at scale is the only way to ensure quality.

The 60-80% Rule: Why Voice AI Evaluation Should Dominate Your QA Time

Hamel advocates that 60-80% of your time building AI products should be spent on evaluation and testing. For engineering teams used to spending maybe 20% of their time on QA, this sounds radical. But here's the key insight: in conversational AI quality assurance, evaluation isn't just testing—it's how you understand your product, prioritize improvements, and make strategic decisions.

As our host Brooke Hopkins (CEO of Coval, a voice AI testing and evaluation platform) pointed out, this mirrors what she experienced at Waymo developing autonomous vehicles. The goal is crystal clear: drive safely from point A to point B anywhere in the city. But how to get there and where to focus engineering effort requires sophisticated evaluation and simulation systems.

Voice agents face the same challenge. Everyone knows the goal: handle 100% of call volume successfully. But which conversation failures should you fix first? Which edge cases matter most for your voice bot testing? That requires systematic evaluation and continuous voice AI agent testing.

How to Start Voice AI Testing: Error Analysis Before Automation

One of the most counterintuitive insights from Hamel for voice AI QA teams: don't start by writing automated tests. Start by analyzing your voice conversation data.

"Error analysis is the most important activity in evals," Hamel emphasized. "Error analysis helps you decide what evals to write in the first place."

Voice AI Error Analysis Framework

For conversational AI testing, this means:

Gather real conversation traces from your voice system (or generate synthetic test data if you're just starting)
Manually review 50-100 voice conversations and take notes on failures
Look for patterns in how and where your voice agent breaks down
Only then write targeted automated tests for high-value failure modes

Hamel shared that through this voice AI testing process, you'll often discover simple bugs you can just fix immediately—no eval needed. You'll also find workflow issues, timing problems, and edge cases that aren't obvious from your test scripts.

For voice AI specifically, Brooke highlighted how seemingly simple requirements like "ask what time they're available, then book the appointment" quickly become complex when users interrupt, take time to check their calendar, or have a TV playing in the background. You only discover these patterns by looking at real voice conversation data.

Voice AI Evals as Living Product Requirements for Conversational AI

Perhaps the most provocative idea from the conversation: "Evals are the new PRDs" for voice AI development.

Well-crafted evaluation prompts effectively become living product requirements documents that continuously test your voice AI system. Instead of a static document that gets written once and forgotten, your voice agent tests actively measure whether your conversational AI meets its requirements—every single day, for every code change.

This is particularly powerful for voice AI quality assurance because requirements are often implicit. What does "natural conversation flow" mean? What's an acceptable interruption handling rate for your voice bot? Your automated voice AI tests make these expectations explicit and measurable.

Organizational Shifts Required for Voice AI QA

Hamel was clear that getting voice AI evaluation right isn't just an engineering problem—it requires organizational change across your conversational AI team.

"It's not like, I need to go do evals or budget for evals," he explained. "It's really just how you should be developing."

For voice AI companies implementing quality assurance programs, this means:

Product managers need to think in terms of conversation success rates and failure mode prioritization for voice agents
Operations teams need access to voice AI evaluation data to understand deployment readiness
Engineering teams need to shift from deterministic testing to probabilistic evaluation for conversational AI
Sales and strategy teams need voice AI testing insights to position products and set customer expectations

Brooke drew parallels to how multiple teams at Waymo consumed simulation data: onboard teams developing driving models, product managers deciding on roadmap priorities, and operations teams planning deployments. Voice AI needs similar cross-functional evaluation systems.

Why Voice AI Testing Matters More Than Ever in 2025

The conversational AI space is exploding right now. Agentic voice systems that can handle multi-step conversations have only become viable in the past six months. But as Hamel pointed out, "if your system works perfectly, it's a little bit suspect—maybe you should think about, is that defensible?"

The companies winning in voice AI won't be those with perfectly curated demos. They'll be the ones who can:

Systematically identify where their voice agents fail through automated testing
Prioritize which failures to fix based on user impact using voice AI analytics
Iterate rapidly with confidence they're making progress on conversational AI quality
Prove to customers exactly how well their voice system performs with data

This requires moving beyond hoping your voice agent works to knowing exactly when and why it succeeds or fails through comprehensive voice AI testing.

Getting Started with Voice AI Agent Testing: Practical Implementation Steps

If you're building conversational AI products and want to adopt evaluation-driven development, Hamel's advice is refreshingly practical for voice AI QA teams:

Voice AI Testing Implementation Roadmap

Don't try to sell your team on "evals" as a concept, but show them what you find in the voice conversation data
Start small with manual review of 50-100 conversations from your voice agent
Tell stories with your findings: top failure modes, surprising user behaviors in voice interactions, bugs you prevented
Build trust incrementally by fixing issues and showing the error rate decrease in your voice AI system
Only then start writing automated evaluations for high-value patterns in conversational AI

For voice or chat AI teams specifically, focus on the failures that actually impact business outcomes: dropped calls, incorrect information from your voice bot, poor interruption handling, unnatural conversation flow.

The Bottom Line for Voice AI QA

Building production voice AI without systematic evaluation is like developing self-driving cars without simulation; technically possible but practically foolish. The real question isn't whether to do voice AI testing, but how quickly you can shift your organization to evaluation-driven conversational AI development.

As Hamel put it: these are stochastic, non-deterministic systems. The tools to evaluate them exist—machine learning teams have been doing it for years. The conversational AI industry just needs to adopt these voice agent testing practices before shipping products into the messy reality of production voice conversations.

Want to learn more about voice AI evaluation frameworks? Check out Hamel's widely-referenced blog at hamel.dev and his AI Evals course with Shreya Shankar. For voice-specific evaluation and automated testing, Coval applies these same simulation principles from autonomous vehicles to conversational AI quality assurance.

Key Takeaways for Voice AI Testing Teams

Error analysis comes first: Review 50-100 real voice conversations before writing automated tests
Invest 60-80% of development time in voice AI evaluation and quality assurance
Make evals your living PRDs: Use automated tests as continuous requirements validation for voice agents
Cross-functional approach: Voice AI testing requires buy-in from product, ops, engineering, and sales teams
Focus on impact: Prioritize testing conversation failures that affect business outcomes

This episode of Conversations in Conversational AI featured Hamel Husain, independent AI consultant and creator of the world's #1 AI evals course, in conversation with Brooke Hopkins, CEO of Coval. Listen to the full episode to dive deeper into error analysis techniques, voice AI testing strategies, and the future of conversational AI quality assurance.