Voice Load Testing: How to Simulate 10,000 Concurrent Calls
Mar 7, 2026
Your voice AI agent handles five concurrent calls beautifully. Ten calls, no problem. Fifty calls during a product demo and everything still looks sharp. Then you launch, a marketing campaign drives 2,000 inbound calls in the same hour, and the whole thing collapses. Latency spikes from 600ms to 8 seconds. Half the calls drop. Your STT provider starts rate-limiting. The LLM queue backs up. Users hear silence, assume the call is dead, and hang up.
This is not a hypothetical scenario. It is the most common failure pattern in voice AI deployments, and it happens because teams validate correctness without validating scale. Standard functional tests confirm that the agent says the right things. Voice load testing confirms that it still says the right things when 10,000 people are listening.
Why Standard Load Testing Tools Fail for Voice
If you have built web applications, you are probably familiar with tools like k6, Locust, or Apache JMeter. They generate HTTP requests at scale, measure response times, and report throughput. They work well for REST APIs and web pages. They do not work for voice AI.
Voice AI load testing is fundamentally different from HTTP load testing for three reasons.
Real-Time Audio Streams, Not Request-Response
An HTTP load test sends a request and waits for a response. A voice call maintains a continuous, bidirectional audio stream for the entire duration of the conversation. Each simulated caller needs to send audio in real time (not faster, not slower), process incoming audio, and maintain conversational timing. A single five-minute call generates roughly 4.8 MB of raw audio data at 16kHz mono. Multiply that by 10,000 concurrent callers, and your load generation infrastructure alone needs to handle 48 GB of audio data flowing in both directions simultaneously.
Stateful, Multi-Turn Conversations
HTTP load tests can fire independent requests. Voice load tests cannot. Each simulated call is a stateful, multi-turn conversation where the next utterance depends on what the agent just said. The simulated caller needs to listen, understand, and respond contextually -- not just replay a script. This means each concurrent call consumes compute resources on the load generation side for STT processing, response generation, and TTS synthesis.
Protocol Complexity
Voice AI systems rarely use plain HTTP. They use WebSocket connections for real-time audio streaming, SIP for telephony signaling, RTP for media transport, and often WebRTC for browser-based interactions. Each protocol has its own connection lifecycle, keepalive requirements, and failure modes. A load test that does not exercise the actual protocol stack is testing a fiction.
What Breaks That HTTP Tests Miss
Failure Mode | Why HTTP Tests Miss It | Impact at Scale |
|---|---|---|
Jitter accumulation | No real-time audio stream to measure | Audio becomes choppy, users hear garbled speech |
Codec negotiation failures | No SIP/RTP handshake | Calls fail silently at connection |
Turn-taking collapse | No conversational state | Agent talks over users or stops responding |
STT queue saturation | No actual audio to transcribe | Latency compounds across every turn |
TTS concurrency limits | No audio synthesis under load | Agent responses arrive seconds late |
WebSocket connection exhaustion | Many tools use HTTP/1.1 only | New calls cannot connect |
The Metrics That Actually Matter Under Load
Measuring load test results with the same metrics you use for functional tests is a mistake. A passing functional metric (did the agent resolve the issue?) can mask a failing experience metric (did the user wait 6 seconds between every turn?).
P95 Response Latency
Median latency is deceptive. If your median response time is 700ms but your p95 is 4.2 seconds, one in twenty callers is having a terrible experience. Under load, latency distributions widen. Track p50, p95, and p99 separately, and set alerts on the p95.
Target thresholds under load:
P50 (median): Under 800ms for real-time conversation
P95: Under 2 seconds -- beyond this, users perceive the agent as broken
P99: Under 4 seconds -- this catches the worst-case tail
Jitter
Jitter is the variation in packet arrival times. Low average latency with high jitter produces an experience where some responses arrive instantly and others arrive after an uncomfortable pause. Users interpret inconsistency as unreliability. Measure jitter as the standard deviation of inter-packet arrival times across your audio streams.
Target: Under 30ms jitter for audio streams at peak load.
Packet Loss
Voice audio is real-time and unrecoverable. Unlike TCP-based data transfer, lost audio packets cannot be retransmitted because the conversation has already moved on. Even 1% packet loss causes perceptible audio degradation. At 3%, speech becomes difficult to understand.
Target: Under 0.5% packet loss at peak load.
Call Completion Rate
What percentage of calls that connect actually complete the full conversation flow? This is the most business-relevant metric. Track it as a ratio: calls that reach a natural endpoint (resolution, transfer, scheduled callback) divided by total calls that successfully connected.
Target: Above 95% completion rate at peak load, with no more than a 5% drop from your baseline completion rate.
Component-Level Latency Breakdown
Aggregate latency tells you there is a problem. Component-level breakdown tells you where the problem is. Track each pipeline stage independently:
STT processing time: Audio-in to transcript-out
LLM inference time: Prompt submission to first token (TTFT)
TTS synthesis time: Text-in to first audio byte
Network transport time: Everything else (routing, queuing, serialization)
When total latency spikes, you need to immediately identify whether your STT provider is saturated, your LLM is queuing, or your TTS is falling behind. Without component-level data, you are guessing.
Concurrent Connection Ceiling
At what exact number of concurrent calls does your system start degrading? This is not a binary threshold -- it is a curve. Plot your key metrics (p95 latency, completion rate, error rate) against concurrent call count and identify the inflection point where quality begins to decline.
The Five-Phase Load Testing Methodology
Phase 1: Baseline Profiling (Day 1-2)
Before you can identify degradation, you need to know what "normal" looks like.
What to do:
Run 10 concurrent calls with representative conversation scenarios
Record all metrics: latency per component, completion rate, audio quality scores, resource utilization
Run each scenario at least 5 times for statistical stability
Document your baseline numbers as the reference for all future comparisons
What you are establishing:
Component-level latency at minimal load (this is your floor)
Resource consumption per concurrent call (CPU, memory, network bandwidth)
Baseline quality metrics (transcription accuracy, response relevance, completion rate)
The baseline profile also reveals your theoretical maximum. If each call consumes 200MB of memory and your infrastructure has 64GB available, your hard ceiling is roughly 320 concurrent calls before you need to scale horizontally -- regardless of what other metrics say.
Phase 2: Ramp Testing (Day 3-5)
Ramp testing gradually increases load to find where degradation begins.
The ramp protocol:
Start at your baseline (10 concurrent calls)
Increase by 2x every 10 minutes: 10 --> 20 --> 40 --> 80 --> 160 --> 320 --> 640
At each level, hold for a full 10 minutes to let the system stabilize
Record all metrics at each level
Continue until either your target load or your quality floor is reached
What you are looking for:
The inflection point where p95 latency starts climbing nonlinearly
The first component to degrade (this is your bottleneck)
Whether quality degradation is gradual (manageable) or cliff-like (dangerous)
Auto-scaling behavior: does new capacity come online before quality degrades?
Common findings during ramp testing:
LLM inference is almost always the first bottleneck in cascaded voice pipelines. Token throughput has hard limits, and queuing adds latency that compounds across turns.
TTS concurrency limits are often discovered here. Many TTS providers enforce per-account or per-API-key concurrency caps that are not documented prominently.
STT providers that perform well at low concurrency sometimes have per-region capacity limits that cause latency spikes at moderate scale.
Phase 3: Spike Testing (Day 6-7)
Production traffic does not ramp gradually. It spikes. A news mention, a marketing email, a seasonal event -- traffic can jump from baseline to 10x in minutes. Spike testing validates that your system survives sudden load.
The spike protocol:
Run at steady-state baseline load (e.g., 50 concurrent calls) for 15 minutes
Instantly jump to 5x (250 concurrent calls) and hold for 10 minutes
Return to baseline and hold for 10 minutes
Jump to 10x (500 concurrent calls) and hold for 10 minutes
Return to baseline and measure recovery time
What you are looking for:
How quickly auto-scaling responds to the spike
Whether the system gracefully queues or ungracefully drops calls during the spike
Recovery behavior: does the system return to baseline performance after the spike, or does degradation persist?
Connection storms: does a burst of simultaneous connection attempts cause additional failures (SIP registration storms, WebSocket handshake timeouts)?
The gap between spike and ramp results reveals your scaling architecture's responsiveness. If ramp testing shows you can handle 500 concurrent calls but spike testing fails at 300, your auto-scaling is too slow.
Phase 4: Soak Testing (Day 8-10)
Soak testing runs sustained load over extended periods to catch issues that only emerge over time.
The soak protocol:
Run at 70-80% of your identified capacity ceiling
Maintain this load continuously for 8-24 hours
Monitor for gradual degradation in any metric
Pay particular attention to memory usage trends and connection pool behavior
What you are looking for:
Memory leaks: does memory usage climb steadily? Even a small leak (50MB/hour) will crash your system overnight.
Connection pool exhaustion: are WebSocket connections, database connections, or API client connections being properly released?
Token or rate limit accumulation: some providers have daily or hourly quotas that are not hit during short tests.
Garbage collection pauses: in JVM or managed-runtime environments, GC pauses become more frequent and longer under sustained load.
Log storage: are logs filling disk? Production voice systems generate significant log volume.
Phase 5: Scale to Target (Day 11-14)
With bottlenecks identified and resolved from phases 2-4, now you push to your actual target: 10,000 concurrent calls.
Infrastructure considerations at this scale:
Telephony capacity: You need SIP trunking that supports 10,000 concurrent channels. Most standard Twilio or Telnyx accounts are provisioned for far less. Contact your provider to pre-provision capacity and confirm rate limits on SIP INVITE messages.
LLM token throughput: At 10,000 concurrent calls with an average of 4 turns per minute, you need to process 40,000 LLM requests per minute. At an average of 200 output tokens per response, that is 8 million tokens per minute. Confirm your LLM provider's throughput limits and pre-provision capacity.
TTS concurrency: Each concurrent call needs its own TTS synthesis stream. At 10,000 calls, you need 10,000 concurrent TTS sessions. Most TTS providers require enterprise agreements and pre-provisioned capacity at this scale.
STT concurrency: Same as TTS -- 10,000 concurrent audio streams all need real-time transcription. Confirm your STT provider's concurrent session limits.
Network bandwidth: At 16kHz mono audio (32 kbps per stream), 10,000 bidirectional streams require roughly 640 Mbps of sustained bandwidth just for audio. Add protocol overhead, and you need at least 1 Gbps of dedicated capacity.
Load generation infrastructure: You cannot generate 10,000 realistic voice calls from a single machine. You need a distributed load generation cluster -- typically 50-100 machines, each handling 100-200 concurrent simulated callers. Coordinating these machines while maintaining realistic conversational timing is the hardest part of large-scale voice load testing.
How Simulation Platforms Replace Custom Tooling
Building voice load testing infrastructure from scratch takes months. You need to build simulated callers that generate realistic audio, orchestrate thousands of concurrent conversations, handle the full SIP/RTP/WebSocket protocol stack, and build measurement and analysis pipelines.
Platforms like Coval provide this out of the box. Coval's simulation engine generates realistic voice conversations using configurable AI personas -- complete with accents, background noise, interruption patterns, and natural turn-taking. Because each simulated caller is an AI-driven conversational agent (not a script replay), the conversations are realistic enough to stress-test not just infrastructure capacity but conversational quality under load.
The concurrency controls in Coval's evaluation templates let you configure exactly how many parallel simulations run simultaneously. Combined with the platform's built-in latency metrics, interruption rate tracking, and component-level trace analysis via OpenTelemetry, you get load testing results that show both whether your infrastructure scales and whether your agent still performs well at scale.
This matters because a system that handles 10,000 calls but delivers terrible responses at that scale has not actually passed the test.
Building Your Load Testing Playbook
Pre-Test Checklist
Before running any load test:
Baseline metrics documented (latency per component, resource consumption per call)
Monitoring dashboards set up with real-time visibility into all components
Provider rate limits confirmed (STT, LLM, TTS, telephony)
Auto-scaling policies configured and tested at small scale
Alerting thresholds set for p95 latency, error rate, and resource utilization
Rollback plan documented in case load testing destabilizes shared infrastructure
Test scenarios represent actual production traffic mix (not just happy paths)
Post-Test Analysis Framework
After each load test phase:
Compare against baseline: Which metrics degraded, by how much, and at what concurrency level?
Identify the bottleneck: Which component degraded first? Was it infrastructure (CPU, memory, network) or external (provider rate limits, API timeouts)?
Calculate headroom: What is your current maximum capacity with acceptable quality? How much headroom does that provide over expected peak traffic?
Prioritize fixes: Rank bottlenecks by impact. Fix the one that unlocks the most capacity first.
Retest: After each fix, re-run the relevant load test phase to confirm the improvement and reveal the next bottleneck.
Ongoing Load Testing Cadence
Voice load testing is not a one-time activity. Run it:
Before every major release that changes conversation logic, provider configurations, or infrastructure
Monthly as a baseline check, even without changes (provider performance drifts)
After any provider change (new STT model, LLM version upgrade, TTS voice switch)
Before expected traffic events (marketing campaigns, seasonal peaks, product launches)
Frequently Asked Questions
How is voice load testing different from IVR testing?
IVR testing validates that a voice system's functional logic works correctly -- menu options route properly, inputs are recognized, and callers reach the right destination. Voice load testing specifically focuses on performance under concurrent load. You need both: IVR testing catches logic bugs, load testing catches scaling failures.
What hardware do I need to generate 10,000 concurrent calls?
You cannot generate 10,000 realistic voice calls from a single machine. Each simulated caller needs CPU for audio generation and processing, memory for conversational state, and network bandwidth for audio streams. Expect to need 50-100 machines, each running 100-200 simulated callers. Cloud instances (c5.2xlarge or equivalent) work well as load generators.
Can I load test against my production environment?
You can, but proceed with caution. Load testing in production risks impacting real users if the test pushes infrastructure beyond capacity. The safer approach is to test against a staging environment that mirrors production infrastructure as closely as possible. If you must test in production, do so during low-traffic windows, ramp gradually, and have someone watching dashboards with a kill switch.
How long should a soak test run?
Minimum 8 hours for catching common issues like memory leaks and connection pool exhaustion. Ideally 24 hours if you can allocate the infrastructure. Some issues -- like daily rate limit resets or garbage collection degradation -- only surface over longer periods.
What is an acceptable quality degradation under load?
There is no universal answer, but a common industry benchmark is: p95 latency should not exceed 2x your baseline at peak load, and call completion rate should not drop more than 5% from baseline. If your baseline p95 is 800ms, anything above 1,600ms at peak load signals a problem. If your baseline completion rate is 97%, anything below 92% at peak load signals a problem.
Do I need to test with real audio or can I use synthetic input?
Real audio or AI-generated conversational audio. Pre-recorded script replay is better than nothing but misses the conversational dynamics that stress a voice system -- natural pauses, interruptions, variable utterance lengths, and contextual responses. AI-driven simulated callers produce more realistic load patterns.
Ready to validate your voice AI scales before your users discover the limits? Coval simulates thousands of concurrent voice conversations with realistic AI personas, measuring both infrastructure performance and conversation quality under load.
-> coval.dev
