Voice AI Echo Cancellation: Causes, Fixes, and Best Practices

Feb 24, 2026

"We have customers who can't use our voice AI with the speaker on as it hears its own voice and keeps talking."

That complaint, posted by a voice AI founder in a community of 200+ builders, captures the single most frustrating audio issue in voice AI: echo. The agent's TTS output gets picked up by the microphone, fed back into the STT as if the user said it, and the agent starts responding to itself. The conversation spirals. The user gives up.

Echo cancellation in voice AI is a harder problem than it was in traditional telephony. Voice AI agents generate their own audio output (TTS), which loops back through the user's device into the input stream. The TTS voice is often more consistent and clearer than a human voice, making it even harder for echo cancellation algorithms to distinguish it from actual user speech. And the consequences are worse -- an echo in a human-to-human call is annoying, but echo in a voice AI system breaks the conversational loop entirely.

This guide covers why echo happens in voice AI, what WebRTC AEC can and can't do, browser-specific gotchas, architectural solutions, and how to detect and monitor echo issues in production.

What Causes Echo in Voice AI

Echo in voice AI has a specific mechanism that's distinct from traditional telephony echo.

The TTS-to-Mic Feedback Loop

The primary echo path in voice AI applications:

The agent generates a TTS response.
The audio plays through the user's device speaker.
The device microphone picks up the speaker output.
The STT engine transcribes the captured TTS audio as user input.
The LLM processes this as if the user said it.
The agent generates a new response based on its own previous output.
The cycle repeats.

This is fundamentally different from network echo (where delayed audio reflects back through the telephony system) or acoustic echo in conference calls (where room reflections cause heard-back delays). In voice AI, the echo isn't just an auditory annoyance -- it corrupts the input pipeline. The agent doesn't just hear itself; it processes its own words as a new user turn.

Device and Environment Factors

Speakerphone usage: The most common echo trigger. When users use their phone's speakerphone or a laptop without headphones, the physical distance between speaker and microphone is minimal, and the microphone is designed to be omnidirectional -- capturing everything in the room, including the speaker output. One founder reported: "Some of our users are using our app on a laptop with no earbuds, and the TTS is coming back in as user input occasionally."

Speaker-phone on specific Android devices: Echo behavior varies dramatically by device. Some Android phones have excellent hardware echo cancellation; others introduce intermittent echo that's nearly impossible to reproduce in testing. The quality of the device's DSP (Digital Signal Processor) and the manufacturer's audio tuning make a significant difference.

Reflective environments: Hard surfaces (glass, tile, concrete) reflect sound waves back to the microphone with minimal absorption. A user in a glass-walled office or a tiled bathroom can experience echo that defeats software-based cancellation entirely. As one expert in the voice AI community noted, reflective environments like glass showers can defeat AEC entirely -- there's no known fix beyond requiring headphones.

High speaker volume: Higher playback volume means more energy hitting the microphone. Users who turn up their speaker volume (common in noisy environments or for hard-of-hearing users) amplify the echo problem proportionally.

Echo cancellation stabilization delay: Echo cancellation algorithms need time to learn the acoustic characteristics of the current environment. At the start of a call, there's typically a 3-4 second window where AEC hasn't yet adapted, and echo is more likely to occur. This is why the first few seconds of a voice AI conversation are particularly vulnerable.

WebRTC AEC: What It Does and Where It Fails

Most browser-based and WebRTC-based voice AI implementations rely on WebRTC's built-in Acoustic Echo Cancellation (AEC). Understanding its capabilities and limitations is essential.

How WebRTC AEC Works

WebRTC AEC operates by maintaining a model of the audio being played through the speaker (the "reference signal") and subtracting a predicted echo component from the microphone input. The process involves:

Reference signal capture: The AEC algorithm has access to the audio being sent to the speaker.
Adaptive filter: A filter model continuously adapts to the relationship between the speaker output and the microphone input, accounting for room acoustics, device characteristics, and speaker/mic positioning.
Echo estimation and subtraction: The filter predicts what the echo component of the microphone input looks like and subtracts it, leaving (ideally) only the actual user's speech.

Where WebRTC AEC Works Well

Headphone/earphone usage: Echo is minimal or nonexistent because the speaker output doesn't reach the microphone through the air. AEC has almost nothing to do.
Quiet environments with the device close to the user: The echo path is predictable and stable. AEC adapts quickly and maintains high cancellation quality.
Stable acoustic conditions: If the user isn't moving around and the environment doesn't change, AEC's adaptive filter converges and stays effective.

Where WebRTC AEC Fails

Browser-specific implementations: Not all browsers implement AEC equally. This is one of the most impactful and least discussed issues in voice AI.

Browser	AEC Quality	Known Issues
Chrome	Good	Best overall AEC implementation. Regularly updated. Still struggles with speakerphone on some devices.
Safari	Good	Generally strong AEC, especially on Apple devices where hardware and software are tightly integrated.
Firefox	Poor to Fair	Noticeably worse AEC than Chrome and Safari. Multiple voice AI builders have reported significantly higher echo rates on Firefox. One founder noted that Firefox is "worse than Chrome/Safari for AEC."
Edge	Good	Uses Chromium's WebRTC stack, so AEC quality matches Chrome.
Mobile browsers	Varies	Depends heavily on the device's hardware AEC and the browser's ability to use it. iOS Safari is generally good; Android Chrome varies by device.

Speakerphone and speaker mode: Even Chrome's AEC struggles when the user is on speakerphone. The speaker output is loud, the microphone is sensitive, and the acoustic coupling between them is strong. AEC can reduce but not eliminate echo in these conditions.

Non-linear distortion: When speaker output is loud enough to cause distortion in the speaker driver or microphone capsule, the echo signal becomes non-linearly transformed. Linear adaptive filters can't model this distortion, so echo leaks through.

Rapid acoustic changes: If the user moves the phone, turns their head, or if background noise suddenly changes, the adaptive filter needs time to re-converge. During this transition (which can take several hundred milliseconds), echo cancellation degrades.

Double-talk: When both the user and the agent are speaking simultaneously (barge-in scenarios), AEC has the hardest job. It must cancel the echo while preserving the user's speech -- a challenging signal separation problem. False suppression of user speech during double-talk is a common AEC failure mode.

Browser-Specific Issues and Workarounds

Browser differences in AEC implementation deserve special attention because they directly affect the user experience and are completely outside your control.

Firefox

Firefox's WebRTC AEC implementation is the weakest among major browsers. If your voice AI application runs in the browser, Firefox users will likely experience more echo problems than Chrome or Safari users.

Workarounds:

Detect Firefox via user agent and display a recommendation to use Chrome or Safari.
Implement more aggressive server-side echo suppression for Firefox sessions.
Lower the TTS playback volume for Firefox users to reduce the echo signal energy.
Default Firefox sessions to push-to-talk mode if echo becomes unmanageable.

Chrome on Android

Chrome on Android generally has good AEC, but quality varies significantly by device manufacturer. Samsung, Pixel, and OnePlus devices tend to have strong hardware AEC. Some budget Android devices have minimal or no hardware echo cancellation, leaving software AEC to do all the work.

Workaround: If possible, test against a matrix of popular Android devices in your target market. For devices with known poor AEC, implement the same mitigations as Firefox.

Safari on iOS

Safari on iOS generally provides excellent AEC, partly because Apple controls both the hardware and software stack. However, there's a specific issue: iOS may suspend WebRTC audio processing when the app goes to the background or the screen locks, which can reset the AEC adaptive filter and cause echo when the user returns.

Workaround: Implement session keepalive signals and handle the audio context resumption gracefully. Re-mute the agent briefly when detecting a session resume to give AEC time to re-adapt.

Architectural Solutions for Echo

When WebRTC AEC isn't sufficient, architectural changes can reduce or eliminate echo.

Server-Side Echo Cancellation

Instead of relying on the client browser's AEC, process echo cancellation on the server:

The server has the reference signal (the TTS audio it generated).
The server receives the user's microphone audio (with echo).
A server-side AEC algorithm subtracts the echo using the known reference signal.
Only the cleaned audio is sent to STT.

Advantages: You control the AEC algorithm and its configuration. No dependency on browser or device capabilities. Can use more sophisticated algorithms than what's available in WebRTC.

Disadvantages: Adds latency (audio must round-trip to the server for processing). Requires the server to maintain per-session state for the adaptive filter. Higher infrastructure cost.

Libraries and tools for server-side AEC:

SpeexDSP: Open-source, widely used, C-based AEC. Available as Python bindings.
WebRTC Native Code: Google's WebRTC AEC module can be used server-side outside of a browser context.
Pipecat echo cancellation: The Pipecat framework (from Daily) includes built-in echo cancellation capabilities. Daily's own infrastructure handles AEC at the media server level.

Audio Ducking

Audio ducking reduces the microphone input level while the agent is speaking, preventing the TTS output from being captured as user input.

How it works:

When the agent starts speaking (TTS begins), reduce the microphone gain or apply a gate.
When the agent stops speaking, restore normal microphone gain.
The microphone is effectively "listening less" while the agent talks.

Advantages: Simple to implement. Eliminates the echo source rather than trying to cancel it.

Disadvantages: The user can't interrupt the agent (barge-in is disabled). Any speech the user produces during the agent's turn is lost. This is acceptable for some use cases (IVR, appointment confirmations) but unacceptable for conversational agents.

Partial ducking variant: Instead of fully muting the microphone, reduce gain by 10-20dB while the agent speaks. This still allows barge-in detection for loud, intentional interruptions while suppressing quieter echo.

Barge-In Detection with Echo Awareness

A more sophisticated approach than ducking: maintain an echo-aware barge-in detector that can distinguish between echo and genuine user speech during agent output.

How it works:

While the agent is speaking, continue capturing microphone input.
Use the known TTS output as a reference signal.
Apply AEC to the microphone input.
If significant speech energy remains after echo cancellation, classify it as a genuine user interruption (barge-in).
If the residual after AEC is near silence, classify it as echo and ignore it.

Advantages: Supports barge-in while suppressing echo. Best of both worlds.

Disadvantages: Requires a reliable AEC algorithm on the server side. Double-talk detection adds complexity. Latency for barge-in detection may be slightly higher.

Muting the Agent's Microphone Input During TTS Playback

For architectures where the agent controls the audio pipeline (server-to-server, no browser), simply don't send microphone input to STT while TTS is playing.

This is essentially audio ducking implemented at the pipeline level. After TTS playback completes, add a brief delay (200-500ms) before resuming STT input to allow any residual echo to decay.

Headphone Detection and Prompting

The simplest and most reliable echo prevention: headphones. If the user wears headphones, echo is virtually eliminated because the speaker output never reaches the microphone through the air.

Implementation:

Detect headphone connection via the browser's MediaDevices API.
If no headphones are detected, prompt the user: "For the best experience, please use headphones or earbuds."
Optionally, gate the conversation behind headphone usage for applications where echo would be particularly disruptive.

This approach is practical for applications with a captive audience (internal tools, patient portals) but less practical for consumer-facing voice AI where you can't dictate device usage.

Testing Echo Scenarios

Echo is notoriously difficult to test because it depends on physical device characteristics, acoustic environments, and real-time audio processing. However, systematic testing can surface most issues before they reach production.

What to Test

Scenario	Why It Matters	What to Look For
Speakerphone at high volume	Most common echo trigger	Agent responding to its own output; conversation loops
First 5 seconds of call	AEC stabilization window	Echo in initial greeting exchange
Long agent response followed by short user response	Maximum echo energy from extended TTS	User's short response being missed or mixed with echo
Quiet room, no headphones	Baseline non-headphone scenario	Any echo presence at all
Noisy environment + speakerphone	Compound challenge: AEC + noise	Echo misclassified as user speech
Firefox browser	Known weak AEC	Higher echo rate vs. Chrome baseline
User interrupts agent mid-speech	Double-talk scenario	Both user and echo present simultaneously

Simulating Echo in Test Environments

Real echo testing requires physical audio setups -- a speaker playing agent output near a microphone capturing for the agent's input. This is inherently difficult to automate.

Approaches:

Physical test rig: Set up a phone or laptop with speaker on in a test room. Play agent output through the speaker and verify that the pipeline handles the acoustic feedback correctly. Labor-intensive but the most realistic test.
Loopback injection: Feed a portion of the agent's TTS output back into the input stream digitally (simulating echo). This tests the pipeline's ability to handle echo but doesn't capture the acoustic transformations that real echo undergoes.
Background noise simulation: While not echo itself, testing with background noise that includes speech-like characteristics (crowd noise, office chatter) exercises similar code paths. If your pipeline handles speech-like background noise correctly, it's more likely to handle echo correctly.

Coval's persona system supports background noise simulation with 19 environment options -- from cafe chatter to construction noise -- and adjustable volume. While this tests noise robustness rather than echo specifically, the overlapping failure modes mean noise testing catches many of the same pipeline issues that echo exposes, particularly in how the VAD and STT handle non-user audio.

Audio Metrics for Echo Detection

Several metrics help identify echo issues in recorded conversations:

Interruption Rate: A high interruption rate where the agent is interrupting the user (especially shortly after finishing its own turn) can indicate echo. The agent "hears" its own echo, classifies it as user speech, and starts responding.

Agent Repeats Itself: If the agent's response is semantically similar to its previous response, it may be processing its own echo as user input and generating the same response again. A pattern of the agent saying similar things in consecutive turns is a strong echo indicator.

Background Noise / Signal-to-Noise Ratio: Poor SNR can indicate acoustic conditions that are likely to produce echo. An SNR below 10dB means the environment may overwhelm AEC.

Latency Anomalies: Echo can cause unusual latency patterns. If the agent responds extremely quickly (because the "user input" was just the end of its own TTS output already present in the audio buffer), this latency anomaly can be detected.

Monitoring Echo in Production

Echo issues that escape testing will show up in production. The key is detecting them before too many users are affected.

Signals That Indicate Echo Problems

Conversation loops: The agent and user exchange increasingly similar messages, or the conversation length grows abnormally long without resolution. Monitor for conversations where the agent's response in turn N+1 is highly similar to its response in turn N.

High interruption rate from the agent: If your production metrics show the agent interrupting users at a high rate, especially in the moments immediately after the agent finishes speaking, echo is a likely cause.

User hangups during or immediately after agent speech: Users who experience echo often hang up quickly because the conversation becomes unintelligible. Track call abandonment rates and correlate with conversation length and agent speech duration.

Unusual STT transcription patterns: Echo manifests in STT output as the agent's own words appearing in user turns. Monitor for cases where the user's transcribed speech closely matches the agent's immediately preceding output.

Building Echo-Specific Monitoring

Create custom monitoring metrics that specifically target echo:

LLM-as-a-Judge metric: "Did any of the user's turns contain text that is nearly identical to the agent's immediately preceding response?" A YES answer strongly suggests echo contamination.
Transcript pattern detection: Use regex or string similarity to compare each user turn against the agent's previous turn. Flag conversations where similarity exceeds a threshold (e.g., 70% word overlap).
Interruption clustering: Track whether interruptions cluster in the first 5 seconds of a call (suggesting AEC stabilization issues) or immediately after long agent responses (suggesting echo from extended TTS output).

Push production call transcripts and audio into your evaluation pipeline to run these metrics continuously. When echo is detected, convert the affected conversation into a regression test case to ensure the issue is caught by future test runs.

Coval's monitoring system supports this workflow -- production conversations evaluated with custom metrics, alerts for threshold violations, and direct conversion of problem conversations into regression test sets.

Practical Decision Framework

Not every echo mitigation is appropriate for every use case. Here's how to decide:

If your use case...	Recommended approach
Is browser-based with general users	WebRTC AEC + headphone detection + browser recommendation + partial ducking for non-headphone sessions
Is a phone/PSTN-based agent	Server-side AEC with known reference signal + pipeline-level muting during TTS
Requires barge-in support	Echo-aware barge-in detection with server-side AEC
Is an IVR or structured flow (no barge-in needed)	Full audio ducking during agent speech
Targets elderly or less tech-savvy users	Headphone prompt + higher echo tolerance + longer AEC stabilization window
Is used in known noisy environments	Server-side AEC + noise suppression + adaptive gain control
Must work on Firefox	Aggressive server-side processing + lower TTS volume + consider push-to-talk fallback

FAQ

Why does echo cancellation take a few seconds to work at the start of a call?

Echo cancellation algorithms use adaptive filters that need to learn the acoustic characteristics of the current environment -- the relationship between what's played through the speaker and what the microphone picks up. This includes room reflections, device speaker response, microphone sensitivity, and the physical positioning of speaker and mic. The adaptation process typically takes 3-4 seconds, during which echo cancellation is less effective. This is why the opening seconds of a voice AI call are most vulnerable to echo.

Why is Firefox worse than Chrome for echo cancellation?

Firefox uses its own WebRTC implementation rather than Google's. The AEC algorithm in Firefox's implementation is less sophisticated and less frequently optimized than Chrome's. Chrome benefits from Google's investment in WebRTC as a core technology (used in Google Meet, Android, and other products), which drives continuous improvement to its AEC. Safari benefits from Apple's tight hardware-software integration. Firefox, with fewer resources dedicated to media processing, lags behind on AEC quality.

Can I fix echo purely in software, or do I need hardware changes?

In most cases, software solutions are sufficient. Server-side AEC with a known reference signal can eliminate most echo. Audio ducking eliminates it entirely (at the cost of barge-in). However, in extreme cases -- reflective environments like glass rooms, very high speaker volumes, or devices with poor hardware audio processing -- no amount of software processing can fully solve the problem. In those cases, headphones are the only reliable solution.

How do I test for echo without a physical audio setup?

Fully realistic echo testing requires physical audio (a speaker near a microphone). However, you can approximate it by: (1) injecting a portion of the agent's TTS output back into the input stream digitally, (2) testing with speech-like background noise to exercise similar code paths, and (3) monitoring production conversations for echo indicators (conversation loops, high interruption rates, agent repeating itself). The production monitoring approach catches real-world echo that no test setup perfectly replicates.

My users report intermittent echo. How do I debug it?

Intermittent echo is usually caused by: (1) specific device models with poor hardware AEC, (2) users switching between headphones and speakerphone mid-call, (3) acoustic environment changes (moving from a quiet room to a hallway), or (4) AEC adaptation loss after a pause in the conversation. Collect device information (browser, OS, device model), audio recordings, and transcripts from affected users. Look for patterns in the device or environment data. Monitor the interruption rate and agent-repeats-itself metrics to quantify how often it occurs across your user base.

Should I just require headphones?

If you can, it's the most reliable solution. Headphones eliminate the acoustic echo path entirely. For applications where you control the user experience (internal tools, healthcare patient portals, employee-facing agents), requiring headphones is reasonable. For consumer-facing products, it's not practical -- you'll lose users who don't have headphones handy. A middle ground: detect headphone usage, recommend headphones if absent, and apply more aggressive echo mitigation (partial ducking, server-side AEC) for non-headphone sessions.

Echo cancellation is one of those problems that's easy to ignore in development (when you're always wearing headphones) and impossible to ignore in production (when your users aren't). Proactive testing with varied acoustic conditions and continuous production monitoring are the only reliable ways to stay ahead of it.

-> Coval's audio metrics and background noise simulation help voice AI teams test and monitor the acoustic scenarios that trigger echo and other audio quality failures. Learn more at coval.dev