
Evaluating Realtime Voice-to-Voice AI Agents: A Practical Guide
May 30, 2025
How to Evaluate Realtime Voice-to-Voice Agents
A practical guide for realtime voice to voice evaluation
If you’re coming from a cascading architecture—speech-to-text (STT) → LLM → text-to-speech (TTS)—you’re used to having visibility and control at each step. You can simulate with text, inspect intermediate outputs, inject guardrails, and evaluate with transcripts alone.
But in a realtime voice-to-voice system, those layers are collapsed. There’s no pause between STT and LLM and TTS. Instead, the entire interaction is streamed end to end—audio in, audio out—with no guaranteed access to intermediate representations.
This shift unlocks lower latency and more natural interactions. But it breaks your old evaluation playbook.
This guide walks through:
What stays the same from traditional voice evals
What fundamentally changes
The new risks to watch for—especially around workflows, tool use, and instruction following
How to adapt your eval strategy for this new paradigmWhat’s Still the Same
Most core evaluation practices transfer cleanly from cascading voice stacks:
Multi-turn dialog testing
Probabilistic metrics (not binary pass/fail)
Simulation-driven testing
Task completion as the north star
What’s Different—and Why It Matters
No Online Guardrails
You can’t intercept or rewrite responses in realtime. Evaluation needs to focus on post-hoc analysis and offline detection of safety or quality issues.
No Text-to-Text Simulations
Text-only simulations are insufficient. Realtime systems operate purely on audio, so you need to simulate audio-in / audio-out flows to test behavior realistically.
Key Failure Mode: Workflow Execution
Realtime models often perform worse at structured tasks like:
Step-by-step instruction following
Form-filling or data capture
API or tool invocation
Why? Because:
They're optimized for low-latency turn-taking, not reasoning depth
Without intermediate text layers, it's harder to catch misunderstandings early
The lack of text-level hooks means tool invocation logic often depends on brittle pattern matching or non-transparent model behavior
What to Test:
Workflow coverage: Can the model reliably complete multi-step flows?
Tool accuracy: Does it call the right tool, with the right inputs, at the right time?
Instruction fidelity: Does it skip steps or hallucinate actions?
Repair behavior: If the user clarifies or corrects, does the agent recover?
Eval Tip: Track not just whether tools were called, but when, how, and why. A tool used too early or with the wrong slot values can be worse than no tool at all.
Building a Realtime Eval Stack That Works
To effectively evaluate realtime voice-to-voice agents, your stack needs to include:
Audio-Driven Simulation
Synthetic or scripted user prompts in voice
LLM-backed user behavior with varied accents, pacing, and interruption
Behavioral Instrumentation
Tool call tracing
Slot value logging
Turn-by-turn latency and overlap
Human + LLM Grading
Accuracy of tool usage
Clarity and completeness of instructions
“Felt natural” and “Did what I asked” scoring
Continuous Regression Testing
Focused tests on workflows you care about
Golden paths with strict success criteria
Edge cases for interruptions, ambiguity, or noise
How Coval Handles This for You
Coval is built for end-to-end evaluation of realtime voice agents. We help you:
Simulate realistic interactions with audio prompts and dynamic user flows
Evaluate workflows with structured success tracking and tool usage accuracy
Monitor performance in production and identify regressions over time
Debug failures with turn-level audio, latency, and tool call visualizations
TL;DR
Realtime voice-to-voice agents feel magical—but evaluating them requires more than just listening to smooth voices. You need to dig into workflow fidelity, tool call correctness, and instruction execution, all while operating without guardrails or intermediate text.
Coval gives you the full-stack eval platform to do exactly that.
→ Test your realtime voice agent with Coval
Catch failures that transcripts miss. Track what matters. Get better, faster.