A patient calls your clinic at 11:42pm. Your AI voice agent picks up in two rings. She asks about appointment availability. The gap between her question and your agent's response is 1.8 seconds.
She hangs up. Not because the answer was wrong — she never heard it. The silence felt broken.
This is the latency problem in voice AI. And it's killing more conversations than bad transcription ever will.
The 300 Millisecond Rule
Human conversation has a natural rhythm: we expect responses within 200-400 milliseconds. Anything longer triggers discomfort. Linguists call it the "floor transfer offset" — the gap between one speaker stopping and another starting.
When you build AI voice agents for customer-facing scenarios — clinic intake, SaaS support, service booking — you're not competing against chatbots. You're competing against human phone conversations. Your latency budget is brutally tight:
- Speech-to-text (STT): 80-150ms for streaming transcription
- LLM inference: 200-800ms depending on model size and prompt complexity
- Text-to-speech (TTS): 100-300ms for first audio chunk
- Network overhead: 50-150ms round-trip
Add those up and you're looking at 430ms to 1,400ms total. The difference between the low end and high end is the difference between "seamless" and "this feels like a robot."
Why Latency Beats Accuracy in Voice
Accuracy is forgiving. Latency is not. If your voice agent mishears "Thursday" as "Tuesday," the customer will correct it immediately. Natural conversation has built-in error recovery — humans do it constantly.
But if your agent pauses for 2 seconds before every response? There's no recovery. The conversation feels wrong from the first exchange. The customer's cognitive load spikes. Trust drops. Hang-up rates climb.
A 95% accurate agent with 400ms latency will outperform a 99% accurate agent with 1,200ms latency every single time.
We've seen this in production across Turkish, English, and Czech voice deployments. When we optimized one clinic's agent from 980ms average latency down to 340ms, their completion rate (calls that reached booking) jumped from 62% to 81%. Same model. Same accuracy. Just faster.
The Stack Matters: Where Latency Hides
Most voice AI latency isn't in the models — it's in the architecture. If you're building a voice agent by chaining together separate API calls (Deepgram → GPT-4 → ElevenLabs), you're adding network round-trips at every step.
Here's what we learned building Softnode's voice pipeline:
- Use streaming everywhere. Don't wait for full utterance transcription. Start LLM inference as soon as you detect sentence boundaries.
- Parallelize where possible. You can often start TTS on the first sentence of a response while the LLM is still generating the rest.
- Regional deployment matters. A customer in Istanbul talking to a server in us-east-1 pays 150ms+ in latency just from physics. We run inference in eu-central-1 for European customers.
- Pick fast TTS. OpenAI's
tts-1with thenovavoice gives us 120-180ms to first audio chunk. ElevenLabs sounds slightly more natural but adds 200ms+ on average. For customer-facing voice, we pick speed.
tts-1 model with the nova voice, streaming STT via Deepgram, and GPT-4o-mini for most conversational turns (escalating to GPT-4o only when tool use or complex reasoning is required). Setup takes 5 minutes.The Solo Founder Latency Test
If you're building a voice AI product, here's the simplest quality test: Call your own agent. Have a normal conversation. Count the seconds of silence after you finish speaking.
If it feels awkward to you — the person who built it, who knows it's AI, who's rooting for it to work — it will feel 10x more awkward to a customer who just wants to book an appointment.
Latency is the thing your customers can't articulate but will absolutely feel. It's why they hang up. It's why they say "I'll just use the website." It's why voice AI demos impress but production deployments disappoint.
Voice-First Means Speed-First
Text-only chat widgets don't have this problem. A 2-second delay in a chat message is barely noticeable. You can use slower, more accurate models. You can add multiple retrieval steps. You can overthink the response.
Voice doesn't give you that luxury. When you commit to AI voice agents — real-time, spoken, phone-call-quality interactions — you're committing to a much harder performance bar.
This is why most "AI chatbot" platforms (Intercom, Drift, Tidio, Crisp) don't offer real voice. It's not a feature you bolt on. It's a different architecture, a different stack, a different set of tradeoffs.
At Softnode, we built voice-first from day one. Our agents speak Turkish, English, and Czech with the same sub-400ms latency target. They handle clinic intake, SaaS support, and service booking — scenarios where hanging up is one awkward pause away.
Because in voice AI, speed isn't a nice-to-have. It's the entire experience.
Build voice agents that feel instant
Softnode's AI voice and chat agents keep latency under 400ms, speak your customer's language, and set up in 5 minutes. No engineering required.
Start free trial