Why 400ms Kills Your Voice AI Agent (and How to Fix It)

It's 2:34am. A potential customer calls your support line. Your AI agent picks up on the second ring. They ask a question. Then they wait.

One second. Two seconds. At three seconds, they're already annoyed. At five, they've hung up.

Text chatbots get away with slow responses—users expect to wait a beat when typing. But voice is different. Voice AI lives or dies in the pause. If your agent takes more than 400 milliseconds to respond, it stops feeling like a conversation and starts feeling like a broken IVR system.

The Voice AI Latency Budget

Human conversation has a natural rhythm: 200-300ms between turns. When you ask someone a question face-to-face, they usually respond within a third of a second. Any longer and the silence becomes noticeable. Awkward.

For AI voice agents, you're working with a latency chain:

Speech-to-text (STT): 80-150ms for streaming recognition
LLM inference: 200-800ms depending on model size and prompt complexity
Text-to-speech (TTS): 50-200ms for first audio chunk
Network round-trips: 20-100ms each way

Add those up and you're already at 350-1,250ms. The slow end of that range is unacceptable. Your users will leave.

Where Most Voice Agents Fail

The biggest bottleneck is almost always the LLM. Developers pick GPT-4 because it's the "best" model, then wonder why their agent feels sluggish. GPT-4 can take 600-900ms to generate a response, especially with long context windows or complex system prompts.

Here's what kills latency:

Overloaded system prompts (1,500+ tokens of instructions)
Non-streaming LLM calls (waiting for the full response before TTS starts)
Cold starts on serverless functions
Sequential API calls instead of parallel processing
Using slower TTS models for "higher quality" that users can't distinguish in real-time

The irony: most teams obsess over model accuracy and ignore the metric that actually determines whether users stay on the call.

SOFTNODE NOTE

We use OpenAI's gpt-4o-mini with streaming enabled and tts-1 nova voice. First audio chunk arrives in under 200ms. The agent starts speaking before the LLM finishes generating—users hear a response in 250-350ms total.

How to Architect for Sub-300ms Response Time

Stream everything. Don't wait for the LLM to finish the full response. Start sending text tokens to TTS as soon as the first few words arrive. The user hears the beginning of the answer while the model is still generating the end.

Use a faster base model. gpt-4o-mini or claude-3-haiku are 3-5x faster than their larger siblings and 90% as good for structured voice tasks. Your users cannot tell the difference between a perfect answer delivered in 800ms and a great answer delivered in 250ms—but they can feel the wait.

Parallelize your API calls. If you need to check a database or call an external API, do it while the STT is still processing, not after. Use WebSockets or persistent connections to eliminate connection overhead.

Trim your system prompt. Every extra token adds latency. A 2,000-token system prompt can add 100-200ms compared to a 400-token one. Be ruthless: does the agent actually need that instruction, or are you over-engineering?

Latency is a feature. If your voice agent responds in under 300ms, users describe it as "shockingly human." At 600ms, they say it's "pretty good." At 1,000ms, they hang up.

Why Text-Only Chatbots Don't Teach You This

Most SaaS tools (Intercom, Drift, Tidio, Crisp) are text-only, so they never had to solve for conversational latency. A chat widget can take two seconds to respond and users barely notice—they're multitasking, checking another tab, scrolling.

Voice is unforgiving. There's no other tab. The user is holding a phone to their ear, waiting. Silence is loud.

If you're building voice AI, you need to think like a telecom engineer, not a chatbot vendor. Measure p50, p95, and p99 latency. Set up alerts when response time exceeds 400ms. Treat every millisecond like it matters—because it does.

The Boring Truth About Latency Optimization

It's not one big fix—it's twenty small ones. You shave 30ms by switching TTS models. Another 50ms by enabling streaming. Another 80ms by moving your inference to the same region as your users. Another 40ms by caching frequent responses.

None of these changes are exciting. None of them make good demo videos. But together they're the difference between a voice agent users trust and one they avoid.

The companies winning in voice AI right now aren't using secret models or magic prompts. They're measuring latency in every API call, optimizing the boring stuff, and shipping agents that feel instant.

Because in voice, instant isn't a nice-to-have. It's the entire product.

Ship a voice agent that feels instant

Softnode agents respond in under 300ms. No cold starts, no complex setup—just fast, natural voice AI that speaks your customers' language. Set up in 5 minutes.

Start free trial

Engin Ferahli Engin Ferahli · Founder, Softnode.ai