Why Latency Kills Voice AI (And How We Got It Under 800ms)

It's 2:17am. A potential patient calls your clinic. The phone rings twice, then an AI voice answers: "Hi, this is Sarah from the clinic, how can I help you tonight?" The caller asks a question. Then waits.

One second. Two seconds. Three seconds of silence.

They hang up. You just lost a $4,000 hair transplant booking because your voice AI had the conversational rhythm of a dial-up modem.

Text chatbots get away with 2-3 second response times because users expect to wait. They're typing. They're multitasking. But voice is a different physics. In a phone conversation, 1.5 seconds of dead air feels like an eternity. By 3 seconds, the human brain assumes the line dropped.

The Voice AI Latency Budget: Every Millisecond Counts

A voice agent's response has four stages, each with its own latency tax:

Speech-to-Text (STT): 200-400ms depending on model and whether you're streaming or waiting for end-of-speech detection
LLM inference: 300-800ms for a GPT-4 class model, faster for smaller models but you sacrifice quality
Text-to-Speech (TTS): 150-300ms to generate the first chunk of audio
Network + buffering: 50-150ms of unavoidable overhead

Add it up and you're looking at 700ms to 1,650ms total. That's before any business logic, database lookups, or calendar API calls. A poorly architected voice agent can easily hit 4+ seconds, which in voice terms is a conversation ender.

Why Text-First Companies Fail at Voice

Most AI chat platforms (Intercom, Drift, Tidio, Crisp) started with text widgets. Their entire stack is optimized for async messaging. Text tolerates latency. You can batch requests, cache aggressively, queue tasks, retry failures gracefully.

Voice doesn't forgive any of that. A 2-second delay in a voice conversation isn't "slow" — it's broken. The caller assumes:

The agent didn't hear them
The call dropped
The AI is "thinking" (which makes it feel robotic)

So they repeat themselves, talk over the agent when it finally responds, or just hang up.

If your voice AI can't respond in under 1 second, you don't have a voice product — you have a science experiment.

How We Got Softnode Voice Agents Under 800ms

We built Softnode voice-first from day one. That meant making architectural choices that text-only platforms can't retrofit:

1. Streaming STT with aggressive endpoint detection. We don't wait for perfect silence to decide the user is done speaking. Our VAD (voice activity detection) triggers after 400ms of probable endpoint, not 800ms of guaranteed silence. Occasionally we cut off a slow talker — they just keep talking and we pick it up in the next turn. That's better than everyone else waiting an extra half-second.

2. LLM streaming + speculative TTS. We don't wait for the full LLM response before starting TTS. As soon as we have 8-10 tokens (usually a complete sentence fragment), we stream that to TTS and start playing audio. The LLM keeps generating in parallel. The caller hears a response in 500-600ms, and by the time the agent finishes the first sentence, the rest is already buffered.

3. Model selection is a latency/quality tradeoff we make per-language. For English, we use OpenAI's gpt-4o-mini for most turns (fast, cheap, good enough) and only escalate to gpt-4o when the conversation history shows complexity. For Turkish, gpt-4o is non-negotiable because smaller models butcher grammar. That's a 200ms tax we accept.

4. TTS model: OpenAI tts-1 with the nova voice. It's not the most expressive TTS on the market (ElevenLabs sounds more human), but it's consistently sub-200ms for first-byte and the quality is "good enough" that 95% of callers don't notice it's AI in the first 10 seconds. We'll accept "good enough + fast" over "incredible + slow" every time.

SOFTNODE NOTE

Our median end-to-end latency across all voice calls in March 2026 was 760ms (p50) and 1,100ms (p95). We're targeting sub-700ms p50 by June with a new VAD model and regional TTS edge caching.

The Business Impact of 800ms vs 2,000ms

We ran an A/B test in February with one of our Turkish hair transplant clinics. Same agent script, same LLM, same voice. The only difference: one variant had an artificial 1,200ms delay injected into every response.

Results over 400 calls:

Fast variant (800ms): 68% of callers stayed on the line past 60 seconds, 41% booked a consultation
Slow variant (2,000ms): 49% stayed past 60 seconds, 28% booked

That 1.2 second difference cost the clinic 13 percentage points of conversion. For a clinic doing 200 calls/month at $4,000 AOV, that's $416,000 in annual revenue lost to latency.

Voice AI Is Not a Feature — It's a Different Product

If you're evaluating AI agents for your SaaS, clinic, or service business, don't treat voice as a checkbox feature. Ask:

What's the p50 and p95 latency, measured end-to-end?
Is the platform streaming STT and TTS, or batch processing?
Can you test it with a real phone call, not a demo video?

Text-only platforms will tell you "voice is also available." That's a red flag. Voice isn't an add-on — it's a ground-up architectural decision. If they built the platform for text and bolted voice on later, the latency will betray that.

At Softnode, we built voice-first because we knew the clinics, SaaS companies, and service businesses we serve can't afford to lose customers to dead air. Every millisecond is a micro-decision point where the caller either trusts the agent or reaches for the hang-up button.

We optimized for the hang-up button. You should too.

Try a Voice Agent That Actually Responds

Softnode voice and chat agents answer in under 800ms, in your customer's language, 24/7. Set up in 5 minutes, no credit card required.

Start Free Trial

Engin Ferahli Engin Ferahli · Founder, Softnode.ai