Voice AI Latency: What Causes Delays and How to Fix Them

Mar 3, 2026

A 200ms pause in a phone conversation feels natural. A 500ms pause feels like the other person is thinking. An 800ms pause feels like something is wrong. And anything past 1,200ms makes the caller start repeating themselves, talking over the agent, or hanging up.

Voice AI latency is the single most common complaint from teams building conversational AI agents. The technology stack between a caller's question and the agent's response involves at least four processing layers, each contributing its own delay. When those delays compound, the result is an AI agent that sounds smart but feels broken.

This is not just a user experience problem. High latency directly causes cascading failures in voice AI systems. Callers start interrupting the agent because they think it has stopped listening. The agent's voice activity detection (VAD) picks up the interruption and cuts off its own response. The conversation derails. What should have been a 90-second call turns into a three-minute frustration spiral -- or a hang-up.

Understanding where latency comes from and how to fix it is not optional for teams shipping voice AI to production.

The Voice AI Latency Stack

Every voice AI interaction follows the same basic pipeline. The caller speaks, the system processes, and the agent responds. But "the system processes" hides an enormous amount of complexity.

Here is the full latency chain for a typical voice AI turn:

Caller speaks
  -> Network transport (telephony/WebRTC)
    -> Voice Activity Detection (VAD)
      -> Speech-to-Text (STT)
        -> LLM inference (understanding + generation)
          -> Text-to-Speech (TTS)
            -> Network transport (telephony/WebRTC)
              -> Caller hears response

Each layer adds latency. The total round-trip time -- from the moment the caller finishes speaking to the moment they hear the first syllable of the response -- is the metric that matters most. In the voice AI community, this is often called "response latency" or "turn latency."

The target? Under 500ms for real-time conversational experiences. Under 800ms is acceptable for most use cases. Anything above 1,200ms is perceptibly broken.

Let's break down each layer.

Layer 1: Network and Telephony Overhead

Before any AI processing happens, audio has to travel from the caller's phone to your system. This is pure network latency, and it varies significantly depending on the connection type.

What Causes Delays

PSTN routing -- Traditional phone network calls traverse multiple carrier switches. Each hop adds 10-30ms. International calls can add 100-200ms just from routing.
Codec transcoding -- Audio often gets re-encoded between network segments. PSTN uses G.711 (narrowband, 8kHz). If your system expects a different format, transcoding adds latency.
SIP trunking overhead -- SIP session establishment and media negotiation add latency on the first turn. Subsequent turns benefit from the established session.
WebRTC jitter buffers -- Browser-based voice AI uses jitter buffers to smooth out network packet variation. These buffers add 40-100ms of intentional delay to prevent choppy audio.
Geographic distance -- If your AI infrastructure is in us-east-1 and your caller is in Tokyo, physics adds ~70ms each way.

What Good Looks Like

Connection Type	Typical One-Way Latency	Notes
WebRTC (same region)	20-50ms	Lowest latency option
SIP trunk (domestic)	30-80ms	Depends on carrier hops
PSTN (domestic)	50-120ms	Standard phone calls
PSTN (international)	100-250ms	Major contributor for global deployments
Mobile (cellular)	60-150ms	Variable based on signal strength

How to Optimize

Co-locate infrastructure with your telephony provider. If you use Twilio, deploy in the same AWS region as Twilio's media servers.
Use SIP trunking instead of PSTN where possible. Direct SIP connections eliminate carrier hops and give you wideband audio (16kHz vs. 8kHz), which also improves STT accuracy.
Minimize codec transcoding by matching your system's expected audio format to your telephony provider's native codec.
Use regional deployment for global products. Serve callers from the closest geographic region.

Layer 2: Voice Activity Detection (VAD)

VAD determines when the caller has finished speaking so the system can begin processing. This is deceptively tricky, and misconfigured VAD is one of the most common sources of perceived latency.

What Causes Delays

Endpointing delay -- VAD needs to confirm the caller has actually stopped speaking, not just paused mid-sentence. Most VAD systems wait 200-600ms of silence before triggering. Too short and the system cuts the caller off mid-thought. Too long and every turn feels sluggish.
Background noise confusion -- In noisy environments (cars, coffee shops, wind), VAD may not detect the end of speech promptly because ambient noise keeps the audio signal above the silence threshold.
Breathing and filler words -- "Um," "uh," and audible breaths can cause VAD to think the caller is still speaking, adding unnecessary wait time.

What Good Looks Like

Endpointing delay: 200-400ms for conversational applications. Below 200ms risks clipping speech. Above 500ms adds noticeable lag to every turn.
False positive rate: Less than 5% of turns should trigger premature endpointing (cutting off the caller).
Total VAD processing: Under 50ms computational overhead on top of the endpointing delay.

How to Optimize

Tune endpointing thresholds per use case. An appointment scheduling agent can tolerate a longer endpointing delay (400ms) because users speak in complete sentences. A fast-paced transactional agent (checking order status) benefits from shorter thresholds (200-250ms).
Use adaptive VAD that adjusts sensitivity based on background noise levels detected at the start of the call.
Implement semantic endpointing where the LLM helps determine if the user's utterance is complete, rather than relying solely on silence detection. This is more computationally expensive but dramatically reduces both premature cutoffs and unnecessary waiting.
Test with realistic noise conditions. If your callers are truck drivers, test with road noise. If they are in hospitals, test with ambient clinical noise.

Layer 3: Speech-to-Text (STT)

Once VAD confirms the caller has stopped speaking, the audio chunk gets sent to an STT service for transcription. This is the first major processing bottleneck.

What Causes Delays

Model size -- Larger, more accurate STT models take longer to process. There is always a tradeoff between transcription accuracy and speed.
Audio chunk size -- Batch STT processes the entire utterance at once. Streaming STT processes audio in small chunks (100-300ms) and returns partial results. Streaming is faster but less accurate.
Network round-trip to STT API -- If you are using a cloud STT service (Deepgram, Google Cloud Speech, AWS Transcribe, Azure Speech), every audio chunk makes a network round-trip. Latency depends on your proximity to the provider's servers.
Language complexity -- Some languages require more processing time. Multilingual models that auto-detect the language add overhead.
Audio quality -- Noisy or low-quality audio takes longer to transcribe accurately. PSTN narrowband audio (8kHz) is harder to transcribe than wideband (16kHz) or WebRTC audio.

What Good Looks Like

STT Mode	Typical Latency	Accuracy Tradeoff
Streaming (partial results)	100-300ms	Lower accuracy, faster
Streaming (final result)	200-500ms	Good accuracy, moderate speed
Batch processing	500-2,000ms	Highest accuracy, slowest
On-device / edge	50-150ms	Model-size limited

Target: STT time-to-first-byte under 200ms for streaming mode. Final transcription under 500ms.

How to Optimize

Use streaming STT and start LLM processing on partial transcription results. You do not need the complete transcript to begin generating a response -- the first few words often indicate the caller's intent.
Choose the right model size. For simple command recognition ("yes," "no," "transfer me"), a smaller, faster model works fine. For complex medical or legal conversations, invest in a larger model and accept the latency tradeoff.
Self-host for latency-critical deployments. Running Whisper or another open-source STT model on your own GPU eliminates the network round-trip to a cloud API.
Match audio quality to STT expectations. Use wideband audio (16kHz+) where possible. Narrowband PSTN audio hurts both accuracy and speed.
Monitor STT word error rate (WER) alongside latency. A faster STT that produces garbled transcripts forces the LLM to spend more time making sense of the input, negating the speed gain.

Layer 4: LLM Inference

This is where the agent "thinks" -- understanding the caller's intent and generating a response. For most voice AI systems, LLM inference is the single largest latency contributor.

What Causes Delays

Time to first token (TTFT) -- The time between sending the prompt and receiving the first token of the response. This is the metric that matters for voice AI because TTS can start synthesizing speech as soon as the first tokens arrive. TTFT depends on model size, prompt length, and provider load.
Context window size -- Longer conversations mean larger prompts. Each turn adds tokens to the context window, increasing processing time. A 20-turn conversation sends significantly more tokens than a 3-turn conversation.
Tool call execution -- When the agent needs to call external APIs (look up an order, check availability, transfer a call), tool execution adds latency. Each tool call is a network round-trip, and some tools are slow.
Provider queue time -- Cloud LLM providers (OpenAI, Anthropic, Google) experience variable load. During peak hours, requests may queue before processing begins. This is invisible to you but real.
Reasoning overhead -- Reasoning models (o1, o3, etc.) deliberately "think longer" before responding. This produces better answers but adds seconds, not milliseconds, of latency. Most voice AI applications cannot tolerate this.

What Good Looks Like

Component	Target	Acceptable	Problematic
Time to first token (TTFT)	<200ms	200-500ms	>500ms
Token generation speed	>80 tokens/sec	50-80 tokens/sec	<50 tokens/sec
Tool call execution	<200ms	200-500ms	>500ms per call
Total LLM turn time	<500ms	500-1,000ms	>1,000ms

How to Optimize

Minimize prompt size. Strip conversation history to only the relevant context. Summarize earlier turns instead of including full transcripts. Use system prompts efficiently.
Use streaming token generation. Start TTS synthesis as soon as the first few tokens arrive rather than waiting for the complete response. This is the single highest-impact optimization for most voice AI systems.
Choose the right model. GPT-4o-mini or Claude Haiku-class models offer dramatically faster TTFT than frontier models. For many voice AI use cases -- appointment scheduling, order status, FAQ -- smaller models are both faster and sufficient.
Cache common responses. If 30% of calls ask "What are your hours?", cache the response and skip the LLM entirely for recognized intents.
Parallelize tool calls. If the agent needs to call two APIs, make both calls simultaneously rather than sequentially.
Avoid reasoning models for real-time voice. Save o1/o3-class models for offline analysis, not live conversations.
Use a dedicated inference endpoint rather than shared API endpoints when latency consistency matters. Provisioned throughput from cloud providers eliminates queue time variability.

Layer 5: Text-to-Speech (TTS)

The LLM's text response needs to be converted back to audio before the caller hears it. TTS latency is the final major processing delay.

What Causes Delays

Model complexity -- Higher-quality, more natural-sounding TTS models take longer to synthesize. The tradeoff between "sounds like a robot" and "sounds like a human" directly maps to latency.
Sentence length -- Longer responses take longer to synthesize. This is another argument for concise agent responses.
Streaming vs. batch -- Like STT, TTS can operate in streaming mode (synthesizing speech as text tokens arrive) or batch mode (waiting for the complete text before synthesizing). Streaming is dramatically faster for perceived latency.
Voice cloning / custom voices -- Custom voice models often run on specialized infrastructure with limited capacity, adding queue time.
SSML processing -- Speech Synthesis Markup Language (SSML) tags for prosody, emphasis, and pauses add processing overhead.

What Good Looks Like

TTS Mode	Typical Time to First Audio	Notes
Streaming (neural)	100-300ms	Best for real-time voice AI
Batch (neural)	300-1,000ms	Higher quality, higher latency
Streaming (basic)	50-150ms	Fast but less natural
Custom voice (streaming)	200-500ms	Depends on provider and model

Target: TTS time-to-first-byte under 200ms. Combined with LLM streaming, the caller should hear the first syllable within 400-600ms of the LLM starting to generate tokens.

How to Optimize

Stream LLM tokens directly into TTS. This is the most impactful architectural decision. Instead of waiting for the full LLM response, pipe tokens into TTS as they arrive. The caller starts hearing the response while the LLM is still generating it.
Use sentence-level chunking. Send complete sentences to TTS rather than individual tokens. This produces more natural prosody while maintaining streaming benefits.
Choose the right voice quality vs. speed tradeoff. For transactional calls (order status, appointment confirmation), a slightly less natural voice at half the latency is the right call. For sales or high-touch support, invest in higher-quality synthesis.
Co-locate TTS with your infrastructure. Same as STT -- minimize network distance to the TTS provider.
Pre-generate common phrases. Greetings, hold messages, and transfer announcements can be pre-synthesized and cached as audio files.

The Complete Latency Budget

Here is a realistic latency budget for a well-optimized voice AI system:

Layer	Optimized	Typical	Unoptimized
Network (one-way)	30ms	60ms	150ms
VAD endpointing	200ms	350ms	600ms
STT processing	150ms	300ms	800ms
LLM TTFT	150ms	350ms	1,000ms
LLM generation (streaming overlap)	0ms*	0ms*	500ms
TTS synthesis	100ms	250ms	600ms
Network (return)	30ms	60ms	150ms
Total round-trip	660ms	1,370ms	3,800ms

*With streaming, LLM generation and TTS synthesis overlap -- the caller starts hearing audio while the LLM is still generating tokens. Without streaming, these are additive.

The difference between the optimized and unoptimized columns is the difference between a voice AI agent that feels natural and one that makes callers hang up.

Measuring Latency in Production

You cannot optimize what you do not measure. And measuring voice AI latency is harder than measuring web page load time because the latency stack has multiple layers, each owned by different services.

What to Measure

End-to-end response latency -- The gold standard metric. Measured from the end of the caller's speech to the beginning of the agent's audible response. This is what the caller experiences.
Per-layer latency -- Break down the total into STT time-to-first-byte, LLM TTFT, LLM generation time, TTS time-to-first-byte, and network overhead. Without this breakdown, you are optimizing blind.
Latency percentiles -- Average latency is misleading. P50, P90, P95, and P99 matter because the worst-performing calls are the ones that cause hang-ups and complaints. If your P50 is 400ms but your P99 is 3,000ms, one in a hundred calls is badly broken.
Latency by turn -- First-turn latency is often higher than subsequent turns due to cold starts, session establishment, and context loading. Track latency per turn position to identify warm-up issues.
Latency under load -- A system that hits 300ms latency with one concurrent call but 2,000ms with fifty concurrent calls has a scaling problem. Test at production concurrency levels.

How to Measure It

Instrumenting the full latency stack requires observability at every layer. OpenTelemetry (OTel) has become the standard approach -- instrument each component (STT, LLM, TTS, tool calls) with spans that capture timing data, then aggregate and visualize in a trace viewer.

Platforms like Coval provide built-in latency metrics that measure response time using voice activity detection to identify speaker transitions, plus trace metrics that break down per-layer latency (LLM TTFT, TTS TTFT, STT TTFT) via OpenTelemetry integration. The advantage of automated measurement is scale -- manually timing calls tells you about those specific calls, while automated measurement across hundreds or thousands of simulated conversations gives you statistically significant latency profiles.

Target latency benchmarks from Coval's documentation:

Under 500ms for real-time conversations
Under 2 seconds for complex query responses
Time to first audio under 1,000ms (fast and responsive)

Continuous Latency Monitoring

Latency is not a one-time optimization. It drifts over time due to:

Model provider changes -- Your LLM provider ships a new model version that is 50ms slower on average.
Traffic patterns -- Peak hours cause queue time at STT/LLM/TTS providers.
Infrastructure changes -- A new deployment, a different availability zone, a networking configuration change.
Conversation complexity drift -- As your agent handles more complex use cases, conversations get longer, context windows grow, and latency increases.

Set up automated latency monitoring with alerting thresholds. If P95 latency exceeds your target, you need to know before callers do.

Common Latency Pitfalls

Sequential Processing Instead of Streaming

The single most common mistake is building a sequential pipeline: wait for complete STT transcript, send full transcript to LLM, wait for complete LLM response, send complete text to TTS. This turns every layer into an additive delay.

The fix is streaming at every stage. Stream partial STT results into the LLM. Stream LLM tokens into TTS. Start playing audio as soon as the first TTS chunk is ready. This turns a 2,500ms sequential pipeline into a 600ms streaming pipeline.

Ignoring VAD Tuning

Many teams spend weeks optimizing their LLM and TTS latency but never tune their VAD endpointing threshold. A 600ms endpointing delay adds 600ms to every single turn, no matter how fast the rest of your stack is.

Over-Engineering the System Prompt

A 3,000-token system prompt adds measurable latency to every LLM call. Be ruthless about prompt efficiency. Every unnecessary sentence in the system prompt costs real milliseconds on every turn.

Not Measuring Per-Layer Latency

"The agent is slow" is not a useful diagnosis. Is it slow because STT is taking 800ms? Because the LLM is queuing? Because tool calls are timing out? Without per-layer instrumentation, you are guessing.

Testing Only on Good Network Conditions

Your lab has perfect Wi-Fi. Your callers are on cellular in a moving car. Test latency under realistic network conditions, including packet loss, jitter, and high-latency connections.

FAQ

What is a good response latency for a voice AI agent?

Under 500ms from the end of the caller's speech to the start of the agent's audible response is the target for natural-feeling conversations. Under 800ms is acceptable for most business use cases. Above 1,200ms, callers will start talking over the agent or hanging up. These targets assume the caller is on a domestic connection -- international calls add unavoidable network latency.

Which layer contributes the most latency?

In most voice AI systems, LLM inference is the largest contributor, accounting for 30-50% of total round-trip latency. However, this varies significantly by architecture. Systems using large context windows or reasoning models may see LLM latency dominate at 60-70%. Well-optimized systems with small models may find VAD endpointing is the largest fixed cost.

Does streaming really make that much difference?

Yes. Streaming transforms the latency equation from additive (each layer waits for the previous one to complete) to overlapping (layers process in parallel). A sequential pipeline with 300ms STT + 400ms LLM + 300ms TTS = 1,000ms total. With streaming, the same components produce first audio in approximately 500-600ms because TTS starts synthesizing as soon as the first LLM tokens arrive, while the LLM started generating as soon as partial STT results arrived.

How do I identify which layer is causing my latency issues?

Instrument each layer with timing spans using OpenTelemetry or a similar observability framework. Measure STT time-to-first-byte, LLM time-to-first-token, TTS time-to-first-byte, and any tool call execution time separately. Compare these against the targets in the latency budget table above. The layer that exceeds its budget is your bottleneck.

Can I use a reasoning model (o1, o3) for real-time voice AI?

Generally no. Reasoning models add seconds of "thinking time" before generating output, which is incompatible with sub-second response latency targets. Use standard generation models (GPT-4o, Claude Sonnet, Gemini Flash) for real-time voice interactions and reserve reasoning models for offline analysis, quality evaluation, or non-real-time applications.

How does Coval measure voice AI latency?

Coval provides built-in latency metrics that use voice activity detection to measure the delay between user input and agent response in milliseconds. For deeper analysis, Coval's OpenTelemetry trace integration captures per-layer timing -- LLM time-to-first-byte, TTS time-to-first-byte, STT time-to-first-byte, and tool call latency -- across automated simulations. Running hundreds of simulated conversations produces statistically meaningful latency profiles rather than anecdotal measurements from a few manual test calls.

Ready to measure latency across every layer of your voice AI stack? See how automated simulation and per-layer trace metrics work.

-> coval.dev