Blog Articles

Blog Articles

Blog Articles

How to Evaluate Text-to-Speech Models for Voice AI Applications: Insights from Cartesia

Mar 12, 2025

In the rapidly evolving world of voice AI, choosing the right text-to-speech (TTS) model has become a critical decision for businesses building voice agents. With numerous providers in the market—from OpenAI and Eleven Labs to Google and newer specialized players like Cartesia - understanding how to evaluate these models can make the difference between a robotic-sounding agent and one that crosses the "uncanny valley" into natural conversation.

We recently sat down with Eric, founding PM at Cartesia, to discuss the complexities of TTS model evaluation and what businesses should consider when selecting a voice provider for their AI applications.

The Voice AI Technology Stack: More Complex Than You Think

Unlike traditional LLM applications, voice AI requires decisions across multiple model types:

  • Voice infrastructure (telephony, SIP connections)

  • Speech-to-text (transcription)

  • LLM processing (generating responses)

  • Text-to-speech (converting text back to voice)

  • Turn detection and other auxiliary models

This complexity creates what Eric describes as "quite a maze, especially as you're starting out." Each component requires careful evaluation and selection, with text-to-speech often being the most subjective yet customer-facing element.

Key Factors When Evaluating TTS Models

According to our conversation with Cartesia and our extensive benchmarking experience across major providers, businesses should consider several critical factors when evaluating text-to-speech solutions:

1. Voice Naturalism and Quality

Voice quality remains the most crucial factor for customer-facing applications:

  • Generation quality: Does the model accurately follow the transcript? Does it hallucinate words or miss content? Does the pacing, prosody, and tone match the transcript?

  • Voice quality: Does the voice match your brand and use case requirements?

  • Voice selection: How diverse is the voice library? Do you need custom voice cloning capabilities?

"Naturalness is still naturalness, regardless of what that stack looks like," Eric notes. The ability to sound human while maintaining accuracy is paramount, especially for sensitive use cases.

Business Impact: Higher voice naturalism directly correlates with increased user engagement and reduced abandonment rates in voice interfaces. Our research shows that more natural voices demonstrate 15-30% improvements in user satisfaction scores, leading to increased conversion and improved revenue metrics.

2. Performance and Latency

  • Response time: How quickly does the model generate speech? For real-time conversations, milliseconds matter.

  • Reliability: Can the provider deliver consistent performance during peak loads?

  • Scalability: Will the service handle your growth?

"Especially for phone calls, real-time use cases, latency becomes super important," Eric emphasizes. "And reliability is also as important, if not more important."

Business Impact: Lower latency systems not only improve user experience but also reduce compute costs for high-volume applications. For interactive applications like call center AI or customer service bots, low latency is essential, while applications that generate content ahead of time (podcast narration, video voiceovers) can prioritize other factors.

3. Controllability and Emotional Intelligence

One of the most underutilized advantages in TTS is controllability—the ability to fine-tune how your voice agent speaks:

  • Speed/pacing adjustment: Not just speeding up audio, but naturally adjusting the cadence

  • Emotional control: Explicitly defining the emotional tone and variations between segments

  • Voice cloning: Creating a custom voice with inherent characteristics

  • Disfluencies: Adding filler words like "um" and "ah" to cross the uncanny valley

  • Spelling and pause controls: Using explicit tags to spell out words or add strategic pauses

"We've seen that disfluencies, like filler words, like 'ums', work really, really well with crossing the uncanny valley," Eric shared.

Business Impact: Applications requiring nuanced communication (healthcare, education, customer service) benefit from advanced emotional control. Marketing narratives and storytelling applications see higher engagement with emotionally intelligent voices.

4. Pronunciation Accuracy and Specialized Content

TTS systems differ dramatically in handling specialized terminology:

  • Numerical data: How accurately does the system pronounce numbers, dates, and financial figures?

  • Industry-specific terms: Does it handle technical terminology, medical vocabulary, or legal jargon correctly?

  • Contextual understanding: Can it determine when to spell out an acronym versus reading it as a word?

Business Impact: BBetter pronunciation accuracy for specialized terminology reduces error rates and support costs. Financial services companies report significant reductions in customer confusion when using TTS systems with superior numerical pronunciation for transaction amounts, dates, and account information. Implementing phonemization and custom pronunciation dictionaries for industry-specific terms further enhances comprehension, allowing businesses to properly vocalize unique product names, acronyms, and technical terminology without relying on default pronunciation rules that might misinterpret these specialized words.

5. Enterprise Readiness

  • Compliance capabilities (HIPAA, SOC2, etc.)

  • On-premise deployment options

  • SLAs and support infrastructure

  • Multi-language support

"What if you do really well?" Eric asked. "Then that means you sell to bigger and bigger customers who have more and more stringent requirements. You want to have a provider that scales with you."

Matching TTS Providers to Common Business Use Cases

Different use cases demand different strengths from TTS providers. Based on our experience and research, here are some of the common business applications and their key requirements:

Use Case

Key Requirements

Customer Service and Call Centers

Low latency, accurate pronunciation of customer information, natural conversational flow

Audiobook and Podcast Production

High naturalism, emotional range, consistent voice quality

News and Information Briefings

Faster speech rate, proper pronunciation of names/places, segment-based emotion control

Educational Content

Clear pronunciation, appropriate pacing, voice customization

Financial Services

Accurate handling of currencies, numbers, and dates; formal tone

Healthcare Communications

Medical terminology accuracy, empathetic tone, compliance capabilities

Gaming and Entertainment

Character voice diversity, emotional range, immersive quality

Creating Effective Evaluations for Voice Models

Our discussion with Cartesia revealed several key approaches to evaluating voice models effectively:

Benchmark Testing vs. Use Case Testing

While industry benchmarks provide a starting point, Eric cautions against over-reliance: "If those transcripts don't sound like what you're going to use the text-to-speech engine for, does it really matter?"

Instead, he recommends:

  • Testing with your actual use case transcripts

  • Listening to 10 samples yourself rather than relying solely on metrics

  • Combining objective measurements with subjective human evaluation

ROI Considerations for TTS Implementation

When evaluating TTS ROI, businesses should consider these key metrics:

  • Customer satisfaction impact: More natural voices show measurable improvements in user satisfaction scores

  • Operational cost impact: Lower latency systems reduce compute costs for high-volume applications

  • Comprehension improvement: Better pronunciation accuracy reduces error rates and support costs

  • Engagement metrics: More emotionally nuanced voice delivery increases user retention and interaction time

Future-Proofing Your Voice AI Strategy

As the industry potentially moves toward speech-to-speech models (direct audio-to-audio without the text intermediate step), how should businesses prepare?

Eric recommends:

  1. Establishing clear evaluation frameworks now that define what "good" looks like for your specific use case

  2. Preserving data from your current systems for future training

  3. Deepening your understanding of your specific market niche

  4. Ensuring enterprise readiness of your technology stack and providers to verify they can scale with your business growth, maintain compliance standards, and provide the reliability needed as your deployment expands beyond initial implementations

"Your moat to some extent is distribution and empathy with that specific use case," Eric noted. "The newcomers cannot catch up to you in the people that you know and your empathy of the problem."

Conclusion: Thinking Beyond the Voice

Selecting the right text-to-speech provider is more than just finding a pleasant voice—it's about creating a comprehensive evaluation strategy that considers quality, service performance, and enterprise readiness.

As Eric from Cartesia highlighted, voice is half art and half science. The technical metrics matter, but so does the subjective experience of how your brand sounds to customers. The most successful implementations will be those that thoughtfully evaluate providers based on their specific use case requirements while building flexible frameworks that can adapt as the technology evolves.

The TTS provider landscape continues to evolve rapidly, with specialized solutions emerging for different business needs. While providers like Cartesia are pushing boundaries in areas like latency, voice naturalism, and specialized features, the optimal choice ultimately depends on your specific business requirements, technical constraints, and budget considerations.

At Coval, we help companies set up rigorous evaluation processes for their voice AI applications, ensuring you select the right components across the entire voice stack. Contact us to learn how we can help you build more effective voice experiences through data-driven testing and optimization.

How Coval Helps: End-to-End Voice AI Quality Assurance

Selecting the right TTS provider is just the beginning of your voice AI journey. To ensure consistent performance and continuous improvement, you need comprehensive testing, simulation, and monitoring solutions. This is where Coval's specialized platform comes in.

Voice AI Simulations: Test Before You Deploy

Our simulation platform allows you to:

  • Test Different TTS Models: Compare multiple providers with your exact use case scripts and scenarios

  • Evaluate Edge Cases: Identify weaknesses in your voice agent's handling of complex customer interactions

  • Validate Performance Metrics: Measure latency, word error rates, and other critical KPIs before deployment

  • Optimize Cost-Efficiency: Determine the optimal balance between quality and cost for your specific business needs

Simulations provide a safe environment to perfect your voice experience before customer exposure, reducing the risk of negative experiences and brand damage.

Comprehensive Evaluations: Beyond Basic Metrics

Coval's evaluation framework goes beyond standard metrics to assess:

  • Voice Quality and Naturalness: Using both objective metrics and human evaluation panels

  • Emotional Appropriateness: Ensuring the emotional tone matches your brand and use case requirements by running a sentiment analysis

  • Pronunciation Accuracy: Testing specialized terminology, numbers, and critical information handling

  • Conversation Flow: Evaluating interruption handling, and word error rates

Our structured evaluation process helps you quantify the subjective aspects of voice interactions, making decision-making more objective and data-driven.

Live Monitoring: Ensuring Continuous Excellence

Once deployed, your voice application needs continuous quality monitoring:

  • Real-Time Performance Tracking: Monitor latency, uptime, and technical performance across your voice stack

  • Conversation Quality Alerts: Receive notifications when quality falls below defined thresholds

  • Continuous Improvement Insights: Identify opportunities to enhance your voice experience over time

With Coval's live monitoring solution, you can detect and address issues before they impact customer experience, while gathering valuable data to drive ongoing optimization.

The Coval Advantage: Holistic Voice AI Optimization

What sets Coval apart is our end-to-end approach to voice AI quality:

  1. Comprehensive Coverage: We analyze the entire voice technology stack, not just isolated components

  2. Data-Driven Methodologies: Our recommendations are backed by rigorous testing and real-world performance data

  3. Continuous Learning: As voice technologies evolve, our testing frameworks and best practices evolve with them

  4. Specialized Expertise: Our team understands the nuances of voice AI across industries and use cases

Whether you're just starting your voice AI journey or looking to optimize an existing implementation, Coval provides the tools and expertise to ensure your voice experiences consistently delight your customers while delivering measurable business results.

Contact us today to learn how our simulation, evaluation, and monitoring solutions can help you build and maintain exceptional voice AI experiences.

Coval's Benchmarking Dashboard

Coval TTS continuous benchmarking dashboard: https://app.coval.dev/tts-benchmarks
Code we used for benchmarks: https://github.com/coval-ai/benchmarking

© 2025 – Datawave Inc.

© 2025 – Datawave Inc.