It's 2:17am. A customer calls your support line. The AI agent picks up. Three hundred milliseconds later, they hear "Hello, how can I help you today?" The customer doesn't notice the gap. The conversation feels human.
Now imagine that same call with 2.5 seconds of silence. The customer is already checking their phone to see if the call dropped. By the time the agent speaks, trust is gone.
In voice AI, latency isn't a technical curiosity—it's the entire user experience. Text chat can hide processing time behind typing indicators. Voice cannot. Every millisecond of silence is audible, awkward, and expensive.
The Voice AI Latency Stack
When someone speaks to an AI agent, four things have to happen in sequence:
- Speech-to-text (STT) transcribes the audio into text—typically 80-150ms with streaming STT like Deepgram or AssemblyAI.
- LLM inference processes the intent and generates a response—100-800ms depending on model size, tokens, and whether you're streaming.
- Text-to-speech (TTS) converts the response back into audio—50-200ms for first-byte latency with modern neural TTS.
- Network round-trips add overhead at every step—20-100ms depending on geography and CDN.
Add those up and you're looking at 250ms on a very good day, or 1,200ms+ if any component is slow. Most conversational AI systems today land between 400-700ms end-to-end.
Under 500ms, users perceive the agent as responsive. Over 800ms, they perceive it as broken.
Why Text-Only Chatbots Hide This Problem
Most AI chat widgets—Intercom, Drift, Tidio, Crisp—don't expose latency because text interfaces have built-in forgiveness. A typing indicator buys you 2-3 seconds before users get impatient. You can batch tokens and show them all at once. You can fake "thinking."
Voice AI has no such mercy. Silence is the only indicator, and silence feels like failure. This is why we obsess over latency at Softnode: AI agents that speak can't fake it.
tts-1 model with the nova voice for sub-200ms first-byte TTS latency in production. Paired with streaming STT and optimized LLM calls, our agents consistently hit 350-500ms end-to-end response time across 8 languages.What Causes Voice AI to Feel Slow
If you've ever built or tested a voice agent and felt it was "off," latency is usually the culprit. Here are the common bottlenecks:
1. Non-streaming STT or TTS. Waiting for the entire user utterance before starting transcription adds 500ms+ of dead time. Streaming STT (Deepgram, AssemblyAI) fixes this.
2. Cold LLM API calls. First request to a model can take 1-2 seconds if the endpoint is cold. Keep-alive pings or reserved capacity help, but cost money.
3. Multi-hop tool use. If your agent needs to call an external API (check inventory, pull CRM data), every network round-trip adds 100-300ms. Cache aggressively.
4. Geographic distance. An EU customer calling a US-hosted agent can add 80-120ms per leg. Deploy TTS/STT endpoints regionally if you serve global users.
The best voice AI agents feel like talking to a human who's fully present. That presence is a function of latency, not just personality.
How We Think About Latency at Softnode
We treat voice response time as a core product metric, not an infra afterthought. Every agent deployment gets latency monitoring in production. If P95 response time crosses 600ms, we investigate.
Our architecture choices reflect this obsession:
- Streaming STT + TTS everywhere, no batch processing on the critical path.
- Regional TTS inference in US-East, EU-Central, and Asia-Pacific to keep network hops under 50ms.
- LLM prompt caching so repeated intents ("What are your hours?") hit warm cache instead of cold inference.
- Tool call result caching for common queries (pricing, availability) that don't change minute-to-minute.
For a solo SaaS founder, this matters even more than for enterprises. You don't have a 24/7 support team to catch frustrated users. Your AI agent is the first impression. If it feels sluggish, the customer leaves. If it feels instant, they convert.
Measuring Latency in Your Own Voice Agents
If you're building or evaluating voice AI, instrument these three metrics:
Time to first audio byte (TTFAB): How long from user silence → agent starts speaking. Target: <400ms.
End-to-end turn latency: User stops talking → agent finishes response. Target: <3 seconds for a 1-sentence reply.
Perceived responsiveness: Qualitative—does it feel like talking to a person? Run user tests. Record calls. Listen for awkward gaps.
Most voice platforms won't give you millisecond-level telemetry out of the box. You'll need to log timestamps at each stage (STT done, LLM done, TTS started) and calculate deltas. We expose this in Softnode's agent analytics dashboard because it's that important.
The 2026 Latency Landscape
The good news: latency is improving fast. OpenAI's gpt-5.5 has lower TTFT (time to first token) than GPT-4. ElevenLabs and Azure TTS both ship sub-150ms first-byte latency now. Deepgram's Nova-2 STT is under 100ms streaming.
The bad news: most companies still ship voice AI with 1+ second lag because they treat it like a chatbot with audio I/O. They batch. They wait. They add tool calls without caching. The tech is ready; the discipline isn't.
If you take one thing from this post, take this: voice AI is not a feature you bolt onto a text chatbot. It's a different discipline, and latency is the difference between a product that delights and one that frustrates.
Ship Voice AI That Feels Instant
Softnode agents are live in under 5 minutes, with sub-500ms response time in 8 languages. No engineering team required.
Start Free Trial