Why Voice AI Latency Kills Conversions (And How We Got It Under 800ms)

A customer calls your clinic at 2:13am. Your AI voice agent picks up. There's a pause. Then another. By second three, they've hung up.

Voice AI lives and dies by latency. Text chat can get away with a two-second delay—users expect to wait for typing. But in voice? Anything over a second feels broken. Over two seconds feels like the line dropped.

We've spent six months obsessing over every millisecond in our voice AI stack. Here's what we learned, what broke, and how we got total round-trip response time under 800ms for 90% of queries.

The Latency Budget: Where Every Millisecond Goes

A voice AI interaction has four choke points:

Speech-to-text (STT): User speaks → transcription. Typically 200-400ms with streaming models.
LLM inference: Transcribed text → agent decides what to say. 300-700ms depending on prompt size and model.
Text-to-speech (TTS): Text response → audio. 150-400ms for the first audio chunk.
Network + codec overhead: Audio streaming, WebRTC handshakes, packet loss recovery. 100-200ms baseline.

Add those up and you're looking at 750ms to 1,700ms best case. That's why most voice AI demos feel sluggish—nobody's optimizing the full pipeline.

Why Text-Only Chatbots Don't Prepare You for This

Text chat hides latency crimes. A typing indicator buys you three seconds of user patience. You can run a complex RAG query, invoke three tools, regenerate twice—nobody notices.

Voice doesn't give you that cushion. Silence is violence. A two-second pause mid-conversation signals "the system is confused" or "the call dropped." Users don't wait. They hang up.

This is why we built Softnode voice-first. Competitors bolt voice onto text-chat architectures and wonder why adoption is low. The stack is different. The latency budget is different. The user expectation is completely different.

"In voice AI, anything over 1 second of silence feels like a technical failure, not a thoughtful pause."

How We Cut 600ms Without Sacrificing Quality

We made four architectural changes that collapsed our P90 response time from 1.4 seconds to 780ms:

1. Streaming TTS from first token. We switched from waiting for the full LLM response to streaming TTS as soon as the first few tokens arrive. OpenAI's tts-1 model with the nova voice supports chunk-by-chunk generation. The user hears the first syllable while the LLM is still thinking about the end of the sentence.

2. Prompt preloading and context caching. For high-traffic agents (clinic booking, SaaS onboarding), we preload the system prompt and knowledge base into the LLM context before the call connects. OpenAI's prompt caching cuts 200-300ms off cold-start inference.

3. Regional TTS endpoints. We run TTS inference from the edge region closest to the caller. A user in Prague hits our EU-central instance; someone in San Francisco hits US-west. Cuts 80-150ms of network overhead.

4. Predictive turn-taking. We start transcribing and queuing the LLM call while the user is still finishing their sentence. Voice activity detection (VAD) gives us a 200ms head start. Risky if you jump the gun, but with good VAD tuning it's nearly invisible.

SOFTNODE NOTE

We use OpenAI's tts-1 model with the nova voice for all production agents. It's fast (sub-300ms first-chunk), sounds natural in English/Turkish/Czech, and costs ~$15 per 1M characters. ElevenLabs is higher quality but adds 100-200ms latency—we only use it for pre-recorded greetings.

The Voice-First Difference in Real Numbers

We A/B tested two versions of the same agent on a Turkish hair transplant clinic's booking line over 4 weeks:

Version A (text-optimized stack, 1.6s avg response): 41% of callers hung up before completing the booking flow.
Version B (our optimized voice stack, 790ms avg response): 19% hang-up rate.

Same script. Same LLM. Same knowledge base. The only difference was latency. Cutting 810ms more than doubled call completion rate.

This isn't an edge case. It's the entire game. If your voice AI feels slow, users will assume it's broken—even if it's giving perfect answers.

What This Means for SaaS Founders and Clinic Owners

If you're evaluating voice AI vendors, ask one question: "What's your P90 round-trip response time, measured from end-of-user-speech to first-audio-byte?"

If they don't know, or if the answer is over 1.2 seconds, walk away. Voice AI that feels slow is worse than no voice AI—it trains users not to trust your automation.

At Softnode, we show you latency metrics in real time on every call. Because agents that speak fast feel smarter, even when they're saying the same thing.

And in a world where your competitor's AI picks up in 800ms, a 2-second delay isn't a technical detail. It's a lost customer.

Want sub-second voice AI for your business?

Softnode agents answer in under 800ms, in your customer's language, 24/7. Set up in 5 minutes—no code, no integrations, just a script tag.

Start Free Trial

Engin Ferahli Engin Ferahli · Founder, Softnode.ai