How Krew Runs 10,000+ Voice AI Evaluations Monthly with Coval (And Turns QA Into Sales Acceleration)

Oct 26, 2025

The Challenge: Building Enterprise-Grade Voice AI Without Breaking Compliance

When Mike and his team at Krew set out to build voice agents for credit servicing, they knew they were tackling one of the most regulated industries in AI. Every conversation could impact real consumers. Every edge case needed coverage. Every compliance requirement demanded rigorous testing.

The stakes? Tens of thousands of calls affecting people's financial lives, overseen by strict regulatory bodies that don't forgive slip-ups.

Before partnering with Coval, Krew faced what most voice AI companies eventually hit: the evaluation bottleneck. They had server-side performance QA and basic quality checks, but scaling comprehensive testing across perceived latency, turn-taking, barge-in handling, and compliance metrics required infrastructure that would take months to build.

"Building your own evaluation framework is difficult," Mike explains. "Setting up the infrastructure to deploy at scale and understand perceived latency—which is what a consumer actually hears—is a lot of work."

Watch the full conversation:

Why Traditional QA Falls Short for Voice AI

Voice AI testing isn't like traditional software QA. You're not just checking if a button works. You're evaluating:

Perceived latency across different network conditions
Turn-taking and natural conversation flow
Barge-in handling when users interrupt
Compliance with industry regulations
Empathy and appropriate emotional responses
Edge cases that only emerge at scale

Mike had previously benchmarked speech models for leading AI labs including OpenAI and Amazon while at Artificial Analysis. His insider perspective revealed a harsh truth: most voice AI testing platforms don't truly understand evaluations.

"Other platforms... they don't really understand evals. The results you get are not as consistent. You don't know what you're looking at, essentially. Whereas the consistency with Coval is a big plus."

The Coval Difference: From Setup to Scale in Weeks

Krew needed a partner that understood voice AI deeply enough to help them move fast without breaking things. Within two months, Coval became core infrastructure for their engineering team.

Rapid Implementation

Where building internal evaluation infrastructure could take quarters, Coval's plug-and-play approach got Krew testing in days:

White-labeled phone number testing that integrated with enterprise security requirements
Manual flow testing alongside automated evaluation pipelines
Consistent benchmarking for perceived latency over time
Pre-built best practices from working with cutting-edge voice AI companies

"The ease of use... being able to say, 'these are the phone numbers I need to whitelist so you can call in and test things out,' while restricting it so we're secure and can serve enterprise clients—that was a big plus," Mike notes.

White-Glove Expertise That Scales

Unlike platforms that leave you reading docs alone, Coval combines hands-on guidance with scalable infrastructure.

"Brooke is much more of a voice expert than I am right now," Mike admits. "When you're creating evaluation test cases, knowing what best practice looks like today is incredibly helpful. That hand-holding, that guidance - that's the biggest plus."

This expert partnership helped Krew:

Design comprehensive evaluation test cases
Identify critical metrics for credit servicing
Establish baselines and improvement targets
Scale testing alongside product development

Measurable Results That Matter

Krew runs tens of thousands of evaluations on their agents. With Coval, they achieved:

Consistent improvement tracking across latency, barge-in, and turn-taking
Reduced time-to-insight from days to minutes
Third-party validation they can share with enterprise customers
Engineering velocity without compromising safety

"We definitely rely a ton on Coval for perceived latency benchmarking. This part is really hard—I've done this stuff before. It's a lot of effort to build something that works. I'm glad we have a benchmark we can easily reference as a source of truth over time. That's worth whatever we're paying each month."

The Trust Multiplier: Turning QA into Competitive Advantage

In regulated industries, trust isn't nice-to-have—it's table stakes. Krew discovered that Coval transformed their evaluation strategy into a sales accelerator.

External Validation for Enterprise Sales

When going through third-party risk management with banks, Krew can now demonstrate active compliance monitoring through an independent platform.

"Coval is a great platform because it is a third party where it's showing us these numbers. It's a great report I can print out and send to customers and say, 'We did this internally, and there is a trusted third party who has access to our systems giving us a check mark of approval.'"

This external validation proves critical when:

Navigating lengthy enterprise procurement processes
Meeting regulatory audit requirements
Differentiating from competitors who can't prove reliability
Reducing customer churn by demonstrating ongoing quality

The Future: AI Assurance as Infrastructure

Mike sees a future where platforms like Coval function like Vanta for voice AI—providing independent audits that become industry standard.

"You're going to see more requirement for third-party auditors to come in. There will be more guidelines and frameworks around this, just like how we audit financial statements. We're going to want to audit AI as well. And it's going to go beyond SOC 2."

For regulated industries like financial services and healthcare, this isn't optional. As autonomous AI agents handle more consequential decisions, the companies that can prove reliability through rigorous, third-party evaluation will win.

Voice AI Testing at Every Stage

Krew uses Coval across their entire development and sales lifecycle:

Engineering: Daily automated evaluations identify regressions and guide improvements

Product: Latency benchmarks and quality metrics inform roadmap prioritization

Sales: Third-party reports accelerate enterprise procurement cycles

Customer Success: Ongoing monitoring reduces churn by demonstrating continuous improvement

"Whenever there's an edge case, we add it into our eval set and run it again," Mike explains. "Being able to demonstrate that trust, show external parties we're actively thinking about this, actively checking it, actively auditing it—that's critical."

The Build vs. Buy Decision: Why Teams Choose Coval

Many teams initially think they'll build voice AI testing in-house. After all, how hard can it be?

"We get that all the time. People say, 'We can just do this with a horizontal player' or 'How hard is it?' They try it and realize—good God, it's really, really hard."

The underestimation stems from voice AI's deceptive simplicity. You can build a demo in minutes. But building reliable, autonomous voice agents that handle the long tail of edge cases at production scale?

"It's actually not that much easier than building self-driving cars," notes Brooke Hopkins, Founder of Coval. "You just don't have to deal with physics. But it's that long tail problem where all these long tail issues take most of the time."

The difference between a demo and production is exactly where Coval excels—helping teams prove their agents work reliably at scale.

Real-World Impact: When Precision Matters

The importance of rigorous testing becomes visceral in regulated spaces. Mike shares a cautionary example from his benchmarking days:

"I was benchmarking speech-to-text platforms, and there was a big difference between 'hypothermia' and 'hyperthermia' Being able to analyze that delta and understand with pinpoint accuracy if you're doing the right thing is really important because literally what you're doing is impacting consumers at scale."

In credit servicing, healthcare, financial advising—industries where Coval's customers operate—the stakes are real. A misunderstood word, a failed turn-taking, an inappropriate response can have consequences far beyond a poor user experience.

What Sets Coval Apart from Competitors

While platforms like Hamming.ai and Cekura.ai offer voice AI testing, Krew chose Coval for specific differentiators:

Deep Voice AI Expertise

Coval was purpose-built for voice AI evaluation by founders who understand the unique challenges of real-time conversational AI, not adapted from general ML testing frameworks.

Perceived Latency Measurement

True end-to-end latency testing that measures what users actually experience, not just server-side metrics. This is notoriously difficult to implement correctly.

Consistent, Interpretable Results

Evaluation consistency that lets teams track improvements over time and make confident decisions, rather than noisy results that require constant interpretation.

Enterprise-Grade Security

White-labeled testing that works within enterprise security requirements, supporting regulated industries from day one.

Hands-On Partnership

Expert guidance on evaluation best practices, test case design, and metric selection—not just platform access.

Third-Party Validation

Independent audit reports that accelerate enterprise sales and satisfy compliance requirements.

Getting Started: From Demo to Production-Ready

Krew's journey shows the path forward for voice AI companies:

Start with baseline metrics: Establish perceived latency, turn-taking, and core quality metrics
Build comprehensive test cases: Cover happy paths, edge cases, and regulatory requirements
Automate continuous testing: Integrate evaluations into your CI/CD pipeline
Track improvements over time: Use consistent benchmarking to measure progress
Leverage for sales: Turn evaluation results into competitive advantage

"As we continue to scale, what we do impacts the lives of consumers everywhere," Mike reflects. "We take that very seriously. Being able to actively benchmark and evaluate—not just performance, but quality, empathy, and engagement with the consumers we serve—that's really important to us."

The Bottom Line

Voice AI is moving beyond demos into production systems that handle consequential tasks at scale. The companies that win will be those that can prove their agents are reliable, compliant, and continuously improving.

Krew demonstrates what's possible when rigorous evaluation becomes core infrastructure rather than an afterthought. In just 60 days, they went from basic QA to comprehensive, third-party validated testing that accelerates both development and sales.

"I'm really glad we have Coval as a partner in what we do," Mike concludes. "It's been incredible."

Ready to Scale Your Voice AI Testing?

See how Coval can help your team ship production-ready voice agents faster:

Purpose-built for voice AI: Not adapted from general ML testing
Enterprise-ready security: Works within your compliance requirements
Third-party validation: Reports that accelerate enterprise sales
Expert partnership: White-glove guidance on evaluation best practices
Proven at scale: Trusted by leading voice AI companies in regulated industries

Watch the full episode: Mike from Krew on Building Enterprise Voice AI

About Krew

Krew builds voice agents for credit servicing, supporting originating creditors and third-party institutions with servicing small balance accounts for positive payments, with empathy and effectiveness.

About Coval

Coval provides simulation and evaluation infrastructure for voice & chat AI companies building production-ready autonomous agents. Founded by experts in evaluation & simulation infrastructure, Coval helps companies ship reliable agents faster while building the trust needed to win in regulated industries.