
How to Evaluate Text-to-Speech Models for Voice AI Applications: Insights from Cartesia
Mar 12, 2025
In the rapidly evolving world of voice AI, choosing the right text-to-speech (TTS) model has become a critical decision for businesses building voice agents. With numerous providers in the market—from OpenAI and Eleven Labs to Google and newer specialized players like Cartesia - understanding how to evaluate these models can make the difference between a robotic-sounding agent and one that crosses the "uncanny valley" into natural conversation.
We recently sat down with Eric, founding PM at Cartesia, to discuss the complexities of TTS model evaluation and what businesses should consider when selecting a voice provider for their AI applications.
The Voice AI Technology Stack: More Complex Than You Think
Unlike traditional LLM applications, voice AI requires decisions across multiple model types:
Voice infrastructure (telephony, SIP connections)
Speech-to-text (transcription)
LLM processing (generating responses)
Text-to-speech (converting text back to voice)
Turn detection and other auxiliary models
This complexity creates what Eric describes as "quite a maze, especially as you're starting out." Each component requires careful evaluation and selection, with text-to-speech often being the most subjective yet customer-facing element.
Key Factors When Evaluating TTS Models
According to our conversation with Cartesia and our extensive benchmarking experience across major providers, businesses should consider several critical factors when evaluating text-to-speech solutions:
1. Voice Naturalism and Quality
Voice quality remains the most crucial factor for customer-facing applications:
Generation quality: Does the model accurately follow the transcript? Does it hallucinate words or miss content? Does the pacing, prosody, and tone match the transcript?
Voice quality: Does the voice match your brand and use case requirements?
Voice selection: How diverse is the voice library? Do you need custom voice cloning capabilities?
"Naturalness is still naturalness, regardless of what that stack looks like," Eric notes. The ability to sound human while maintaining accuracy is paramount, especially for sensitive use cases.
Business Impact: Higher voice naturalism directly correlates with increased user engagement and reduced abandonment rates in voice interfaces. Our research shows that more natural voices demonstrate 15-30% improvements in user satisfaction scores, leading to increased conversion and improved revenue metrics.
2. Performance and Latency
Response time: How quickly does the model generate speech? For real-time conversations, milliseconds matter.
Reliability: Can the provider deliver consistent performance during peak loads?
Scalability: Will the service handle your growth?
"Especially for phone calls, real-time use cases, latency becomes super important," Eric emphasizes. "And reliability is also as important, if not more important."
Business Impact: Lower latency systems not only improve user experience but also reduce compute costs for high-volume applications. For interactive applications like call center AI or customer service bots, low latency is essential, while applications that generate content ahead of time (podcast narration, video voiceovers) can prioritize other factors.
3. Controllability and Emotional Intelligence
One of the most underutilized advantages in TTS is controllability—the ability to fine-tune how your voice agent speaks:
Speed/pacing adjustment: Not just speeding up audio, but naturally adjusting the cadence
Emotional control: Explicitly defining the emotional tone and variations between segments
Voice cloning: Creating a custom voice with inherent characteristics
Disfluencies: Adding filler words like "um" and "ah" to cross the uncanny valley
Spelling and pause controls: Using explicit tags to spell out words or add strategic pauses
"We've seen that disfluencies, like filler words, like 'ums', work really, really well with crossing the uncanny valley," Eric shared.
Business Impact: Applications requiring nuanced communication (healthcare, education, customer service) benefit from advanced emotional control. Marketing narratives and storytelling applications see higher engagement with emotionally intelligent voices.
4. Pronunciation Accuracy and Specialized Content
TTS systems differ dramatically in handling specialized terminology:
Numerical data: How accurately does the system pronounce numbers, dates, and financial figures?
Industry-specific terms: Does it handle technical terminology, medical vocabulary, or legal jargon correctly?
Contextual understanding: Can it determine when to spell out an acronym versus reading it as a word?
Business Impact: BBetter pronunciation accuracy for specialized terminology reduces error rates and support costs. Financial services companies report significant reductions in customer confusion when using TTS systems with superior numerical pronunciation for transaction amounts, dates, and account information. Implementing phonemization and custom pronunciation dictionaries for industry-specific terms further enhances comprehension, allowing businesses to properly vocalize unique product names, acronyms, and technical terminology without relying on default pronunciation rules that might misinterpret these specialized words.
5. Enterprise Readiness
Compliance capabilities (HIPAA, SOC2, etc.)
On-premise deployment options
SLAs and support infrastructure
Multi-language support
"What if you do really well?" Eric asked. "Then that means you sell to bigger and bigger customers who have more and more stringent requirements. You want to have a provider that scales with you."
Matching TTS Providers to Common Business Use Cases
Different use cases demand different strengths from TTS providers. Based on our experience and research, here are some of the common business applications and their key requirements:
Use Case | Key Requirements |
---|---|
Customer Service and Call Centers | Low latency, accurate pronunciation of customer information, natural conversational flow |
Audiobook and Podcast Production | High naturalism, emotional range, consistent voice quality |
News and Information Briefings | Faster speech rate, proper pronunciation of names/places, segment-based emotion control |
Educational Content | Clear pronunciation, appropriate pacing, voice customization |
Financial Services | Accurate handling of currencies, numbers, and dates; formal tone |
Healthcare Communications | Medical terminology accuracy, empathetic tone, compliance capabilities |
Gaming and Entertainment | Character voice diversity, emotional range, immersive quality |
Creating Effective Evaluations for Voice Models
Our discussion with Cartesia revealed several key approaches to evaluating voice models effectively:
Benchmark Testing vs. Use Case Testing
While industry benchmarks provide a starting point, Eric cautions against over-reliance: "If those transcripts don't sound like what you're going to use the text-to-speech engine for, does it really matter?"
Instead, he recommends:
Testing with your actual use case transcripts
Listening to 10 samples yourself rather than relying solely on metrics
Combining objective measurements with subjective human evaluation
ROI Considerations for TTS Implementation
When evaluating TTS ROI, businesses should consider these key metrics:
Customer satisfaction impact: More natural voices show measurable improvements in user satisfaction scores
Operational cost impact: Lower latency systems reduce compute costs for high-volume applications
Comprehension improvement: Better pronunciation accuracy reduces error rates and support costs
Engagement metrics: More emotionally nuanced voice delivery increases user retention and interaction time
Future-Proofing Your Voice AI Strategy
As the industry potentially moves toward speech-to-speech models (direct audio-to-audio without the text intermediate step), how should businesses prepare?
Eric recommends:
Establishing clear evaluation frameworks now that define what "good" looks like for your specific use case
Preserving data from your current systems for future training
Deepening your understanding of your specific market niche
Ensuring enterprise readiness of your technology stack and providers to verify they can scale with your business growth, maintain compliance standards, and provide the reliability needed as your deployment expands beyond initial implementations
"Your moat to some extent is distribution and empathy with that specific use case," Eric noted. "The newcomers cannot catch up to you in the people that you know and your empathy of the problem."
Conclusion: Thinking Beyond the Voice
Selecting the right text-to-speech provider is more than just finding a pleasant voice—it's about creating a comprehensive evaluation strategy that considers quality, service performance, and enterprise readiness.
As Eric from Cartesia highlighted, voice is half art and half science. The technical metrics matter, but so does the subjective experience of how your brand sounds to customers. The most successful implementations will be those that thoughtfully evaluate providers based on their specific use case requirements while building flexible frameworks that can adapt as the technology evolves.
The TTS provider landscape continues to evolve rapidly, with specialized solutions emerging for different business needs. While providers like Cartesia are pushing boundaries in areas like latency, voice naturalism, and specialized features, the optimal choice ultimately depends on your specific business requirements, technical constraints, and budget considerations.
At Coval, we help companies set up rigorous evaluation processes for their voice AI applications, ensuring you select the right components across the entire voice stack. Contact us to learn how we can help you build more effective voice experiences through data-driven testing and optimization.
How Coval Helps: End-to-End Voice AI Quality Assurance
Selecting the right TTS provider is just the beginning of your voice AI journey. To ensure consistent performance and continuous improvement, you need comprehensive testing, simulation, and monitoring solutions. This is where Coval's specialized platform comes in.
Voice AI Simulations: Test Before You Deploy
Our simulation platform allows you to:
Test Different TTS Models: Compare multiple providers with your exact use case scripts and scenarios
Evaluate Edge Cases: Identify weaknesses in your voice agent's handling of complex customer interactions
Validate Performance Metrics: Measure latency, word error rates, and other critical KPIs before deployment
Optimize Cost-Efficiency: Determine the optimal balance between quality and cost for your specific business needs
Simulations provide a safe environment to perfect your voice experience before customer exposure, reducing the risk of negative experiences and brand damage.
Comprehensive Evaluations: Beyond Basic Metrics
Coval's evaluation framework goes beyond standard metrics to assess:
Voice Quality and Naturalness: Using both objective metrics and human evaluation panels
Emotional Appropriateness: Ensuring the emotional tone matches your brand and use case requirements by running a sentiment analysis
Pronunciation Accuracy: Testing specialized terminology, numbers, and critical information handling
Conversation Flow: Evaluating interruption handling, and word error rates
Our structured evaluation process helps you quantify the subjective aspects of voice interactions, making decision-making more objective and data-driven.
Live Monitoring: Ensuring Continuous Excellence
Once deployed, your voice application needs continuous quality monitoring:
Real-Time Performance Tracking: Monitor latency, uptime, and technical performance across your voice stack
Conversation Quality Alerts: Receive notifications when quality falls below defined thresholds
Continuous Improvement Insights: Identify opportunities to enhance your voice experience over time
With Coval's live monitoring solution, you can detect and address issues before they impact customer experience, while gathering valuable data to drive ongoing optimization.
The Coval Advantage: Holistic Voice AI Optimization
What sets Coval apart is our end-to-end approach to voice AI quality:
Comprehensive Coverage: We analyze the entire voice technology stack, not just isolated components
Data-Driven Methodologies: Our recommendations are backed by rigorous testing and real-world performance data
Continuous Learning: As voice technologies evolve, our testing frameworks and best practices evolve with them
Specialized Expertise: Our team understands the nuances of voice AI across industries and use cases
Whether you're just starting your voice AI journey or looking to optimize an existing implementation, Coval provides the tools and expertise to ensure your voice experiences consistently delight your customers while delivering measurable business results.
Contact us today to learn how our simulation, evaluation, and monitoring solutions can help you build and maintain exceptional voice AI experiences.
Coval's Benchmarking Dashboard
Coval TTS continuous benchmarking dashboard: https://app.coval.dev/tts-benchmarks
Code we used for benchmarks: https://github.com/coval-ai/benchmarking