A customer calls your clinic at 11:43pm. Your AI agent picks up. They ask, "Do you have appointments tomorrow?"
Three hundred milliseconds pass.
In a text chat, 300ms is invisible. On a phone call, it's an awkward silence. The customer assumes the line dropped. They hang up before your agent even starts speaking.
Voice AI lives or dies in the gap between speech-end detection and first audio byte. If you're building conversational agents, latency isn't a performance optimization — it's the entire product.
What Latency Actually Means in Voice AI
The total round-trip for a voice interaction has five stages:
- Voice Activity Detection (VAD): Detecting the user stopped speaking (20–100ms)
- Speech-to-Text (STT): Transcribing audio to text (150–400ms depending on model and streaming)
- LLM inference: Generating the response (200–800ms for GPT-4 class models, 50–150ms for smaller agents)
- Text-to-Speech (TTS): Rendering the reply as audio (100–300ms for first chunk, depending on streaming support)
- Network overhead: Round trips, buffering, jitter (50–200ms depending on geography and protocol)
Add those up. Best case: ~520ms. Typical real-world case: 1,200–1,800ms. That's one to two seconds of dead air after every customer sentence.
Humans expect turn-taking latency of 200–500ms in natural conversation. Anything over 600ms feels robotic. Over 1,000ms feels broken.
If your voice agent takes more than a second to respond, customers will assume it didn't hear them and talk over it. You've just created an interrupt loop.
Why Text Chat Metrics Don't Translate
Most SaaS founders come from text-first support tools. Intercom, Drift, Zendesk — all text. In that world, a 2-3 second response time is fine. Users are multitasking. They expect a little lag.
Voice removes that forgiveness buffer entirely. A phone call is synchronous. The customer isn't doing anything else. They're standing in their kitchen at midnight, phone to ear, waiting.
When we built Softnode's voice pipeline, we learned this the hard way. Our first prototype used a standard REST API pattern: record full utterance → upload WAV → wait for transcription → call LLM → wait for TTS → stream back audio. Total latency: 2.1 seconds average.
Callers hung up during the pause. We had a 34% hang-up rate before the agent's first reply.
The Streaming Solution (and Its Limits)
The obvious fix: stream everything. Modern STT models like OpenAI Whisper API and Deepgram support streaming transcription. GPT-4 supports streaming tokens. TTS engines like OpenAI tts-1 and ElevenLabs support streaming audio chunks.
In theory, you can pipeline the whole flow: start TTS as soon as the LLM emits the first few tokens, start playing audio as soon as you receive the first chunk.
In practice, you're still bounded by the slowest stage. If your LLM takes 600ms to emit the first token, streaming TTS doesn't help. If your STT model waits for end-of-speech before returning anything, you've added 200–400ms of unavoidable latency.
The real wins come from:
- Using faster base models (GPT-4o mini instead of full GPT-4, or fine-tuned Llama models)
- Aggressive VAD tuning to detect speech-end as early as possible without cutting off the speaker
- Choosing TTS engines that support true streaming (OpenAI's
tts-1with thenovavoice streams first audio in ~80ms) - Geographic proximity — if your customer is in Prague and your inference runs in us-east-1, you've added 150ms round-trip before you even start
tts-1-nova for voice generation. Typical end-to-end latency: 480–620ms in EU, 550–720ms US East. We consider anything over 800ms a production incident.The Hidden Cost: Interruptions and Barge-In
Low latency isn't just about feeling smooth. It's about making barge-in (user interrupting the agent) actually work.
If your agent has 1,500ms of lag, users will start talking again before the agent finishes. Your STT pipeline now has overlapping audio. Your agent is still responding to the previous question while the user asks a new one.
You end up with conversation drift: the agent says something, the user clarifies, the agent responds to the original question because the clarification hasn't been processed yet. It's a UX nightmare.
Fast agents allow natural interruptions. The user says "Wait—" and the agent stops. Just like a human conversation.
Voice-First Means Latency-First
Here's the thing most AI chat widget companies miss: voice isn't a feature you add to a text chatbot. It's a different product with different constraints.
Text chatbots can get away with 2-second response times, markdown formatting, and "let me look that up" stalling messages. Voice agents can't. Every millisecond of silence is a moment your customer might hang up.
If you're building AI agents for customer support, sales, or clinic intake, and you're not measuring P95 latency by pipeline stage, you're flying blind.
The bar is simple: 500ms or faster, or don't ship voice.
Ship voice agents that actually feel real
Softnode handles the latency stack for you — streaming STT, optimized inference, sub-600ms responses in 15+ languages. Set up your AI voice agent in five minutes.
Start free trial → softnode.ai