It's 2:14am. A SaaS customer hits your site, clicks the voice button, asks "Can I upgrade my plan right now?" and waits. One second. Two seconds. Three seconds. They close the tab.
You just lost a sale to latency. Not because your AI was dumb. Not because your product was wrong. Because your voice agent felt broken.
Human conversational turn-taking happens at roughly 200ms. When an AI voice agent takes longer than 300ms to start responding, users perceive it as lagging, confused, or—worst of all—not listening. And for voice AI, perception is reality.
The Five Layers of Voice AI Latency
Most people think latency is just "the API call." It's not. Every voice interaction goes through at least five distinct bottlenecks:
- Network RTT: Audio packets traveling from the user's device to your server. 50-150ms depending on geography.
- Speech-to-Text (STT): Transcribing audio into text. OpenAI Whisper averages 200-400ms; Deepgram can hit sub-100ms with streaming.
- Intent & LLM processing: Running the transcribed text through your agent logic and LLM. 300-800ms for GPT-4, 100-200ms for GPT-4o-mini.
- Text-to-Speech (TTS): Generating audio from the AI's response. OpenAI tts-1 with nova voice is ~200ms for first audio chunk; ElevenLabs can run 400-600ms.
- Network return: Streaming audio back to the user. Another 50-150ms.
Add them up. You're looking at 800ms to 2 seconds per turn in a typical non-optimized stack. That's enough time for a user to decide your product is broken.
Why Voice Latency Matters More Than Text Chat
In text chat, users expect to wait. They're typing. They're multitasking. A 2-second delay feels normal because they just spent 5 seconds composing the message.
Voice is different. When you speak out loud, your brain expects an immediate response. Silence after a question triggers social anxiety. It's the same reason Zoom lag is exhausting—your conversational rhythm breaks.
"If your voice agent doesn't start speaking within 300ms of silence, users assume it didn't hear them. They repeat themselves. The interaction collapses."
This is why text-only chat widgets from Intercom, Drift, or Crisp can get away with slower backends. Voice can't. Voice AI agents have to be architected for real-time from day one.
How Softnode Keeps Voice Response Under 400ms
We've obsessed over every millisecond. Here's the technical stack we use to keep voice interactions feeling instant:
- Streaming STT: We use Deepgram's streaming endpoint, which returns partial transcriptions as the user speaks. No waiting for silence detection.
- Parallel intent detection: While the user is still finishing their sentence, we're already querying context (CRM, docs, knowledge base). By the time STT is done, we have 90% of what we need.
- GPT-4o-mini for most turns: We reserve GPT-4 for complex decisions. For "What are your hours?" or "Can you book me for Tuesday?" GPT-4o-mini at 100-150ms is perfect.
- OpenAI tts-1 with nova voice: Fast, natural-sounding, and crucially—streamable. We start playing audio to the user before the entire response is generated.
- Edge caching for common queries: "What's your pricing?" doesn't need to hit the LLM every time. We cache voice responses for high-frequency questions and serve them in under 50ms.
The Latency Budget Every Voice AI Product Needs
If you're building or buying voice AI, demand a latency budget. Here's ours:
- STT: <100ms (streaming)
- Intent + LLM: <150ms (cached or mini model for simple turns)
- TTS first chunk: <200ms
- Network overhead: ~50ms (CDN + WebRTC)
- Total target: <400ms to first audio
Anything over 500ms and you're in the danger zone. Over 1 second and you might as well not offer voice at all.
What This Means for Your Product
If you're a solo founder or SaaS team evaluating voice AI vendors, ask these questions:
- What's your P95 latency for STT → first audio?
- Do you support streaming TTS, or do you wait for the full response?
- Can I see a latency trace for a sample interaction?
- What model are you using for LLM inference, and can I control it?
If the vendor can't answer these, they haven't thought about real-time voice. They're building a text chatbot with voice bolted on.
Voice-first AI agents require voice-first architecture. You can't fake it. You can't "add voice later" to a text-based system and expect it to feel natural. The latency constraints are fundamentally different.
The 300ms Rule
Here's the heuristic we use internally: If a human couldn't respond that fast, your AI shouldn't either. But if a human could respond that fast—"What time do you close?"—your AI must match or beat it.
300ms is the threshold. Faster than that, users perceive the agent as sharp and responsive. Slower, and you're in the uncanny valley of "is this thing working?"
Latency isn't a nice-to-have. It's the difference between a voice AI agent that converts and one that frustrates. Build for speed, or don't build voice at all.
Voice AI that responds in under 400ms
Softnode agents are built for real-time voice from the ground up. Streaming STT, parallel intent, edge-cached responses. Get started in 5 minutes.
Start Free Trial