Why 300ms Matters: Latency in Voice AI That Customers Actually Use

A patient calls your clinic at 11:32pm. The voice agent picks up. They ask if Dr. Mehmet has availability next Tuesday. Then: silence. One second. Two seconds. Three—

They've already hung up.

Latency is the silent killer of conversational AI. Not hallucinations, not accent recognition, not even multilingual support. It's the 2.7 seconds of dead air that makes a caller think the line dropped, or worse—that they're talking to a bad robot.

We obsess over this at Softnode because voice AI lives or dies in those milliseconds. Text chat gives you grace; people expect a typing indicator, a brief pause. But voice? Humans expect conversation to flow at conversation speed. Anything over 500ms starts to feel wrong. Over 1 second, and you've broken the illusion entirely.

Where Voice AI Latency Actually Comes From

Most founders think latency is just LLM speed. It's not. In a real voice interaction, you're chaining at least five operations:

Speech-to-text (STT) — Whisper or Deepgram turn audio into text. Streaming STT helps, but you still need enough audio context to avoid mis-transcriptions.
Intent detection — Your agent needs to know if the user has finished speaking. Voice Activity Detection (VAD) adds 100-300ms here.
LLM inference — GPT-4, Claude, or whatever you're using generates the response. Streaming helps, but first-token latency is what you feel.
Text-to-speech (TTS) — Turning text into natural audio. OpenAI's tts-1 is fast; ElevenLabs sounds better but adds 400-800ms.
Network + audio buffering — The round-trip for audio packets, especially on mobile networks. WebRTC helps, but cell connections vary wildly.

Add those up naively and you're looking at 2-4 seconds per turn. That's unusable.

How We Keep Voice Response Under 500ms

Our production agents regularly hit 300-450ms turn latency. Here's how:

1. Streaming STT with aggressive VAD tuning. We use Deepgram's Nova-2 model with a custom silence threshold (250ms instead of the default 500ms). Yes, this occasionally cuts off a slow talker mid-sentence, but in practice users adapt—and the snappiness is worth it.

2. Parallel processing where possible. The moment VAD signals end-of-speech, we fire both the STT finalization and start warming the LLM connection. Saves ~80ms.

3. OpenAI tts-1 with the nova voice. It's not the most expressive TTS on the market, but it's fast—typically 150-200ms for a 2-sentence response. We experimented with ElevenLabs and the quality bump wasn't worth the 400ms penalty for our use case (clinic booking, not audiobook narration).

4. Response streaming to the user. We don't wait for the full TTS audio file. As soon as we have the first 0.5 seconds of audio, we start playing it back. Perceived latency drops dramatically.

SOFTNODE NOTE

We run our voice stack on dedicated EU instances (Hetzner + Cloudflare) to minimize round-trip time for our Czech and Turkish clinic customers. AWS eu-central-1 added an average of 40ms vs. our current setup—small, but measurable when you're chasing sub-500ms.

The Latency Budget: What You Can Actually Control

Here's our real-world breakdown for a typical agent turn handling "Do you have availability on Tuesday?" in Turkish:

STT (streaming, 2s of speech): ~180ms
LLM first token (GPT-4 Turbo): ~220ms
TTS generation start (OpenAI tts-1): ~160ms
Network + buffering: ~60ms
Total perceived latency: ~380ms

That's the target. Some calls hit 280ms. Some hit 650ms when the user is on a rural 3G connection in Anatolia. But the median needs to stay under 500ms or you lose trust.

The non-negotiable rule: if your P95 latency is over 1 second, your voice agent will feel broken to 1 in 20 callers. That's enough to kill word-of-mouth for a clinic.

Why Text-Only Widgets Don't Teach You This

If you're building a text chat widget (Intercom, Drift, Crisp, Tidio—the usual suspects), latency over 1 second is fine. Users see a typing indicator. They're multitasking anyway. Two seconds feels like a thoughtful pause.

Voice has no such affordance. Silence is death. You can't show a "thinking" indicator in audio without it sounding ridiculous ("Please hold while I process your request" is a trust-killer). The agent has to respond at human speed or the caller's brain flags it as non-human and disengages.

This is why most chatbot companies struggle when they bolt on voice as an afterthought. The stack assumptions are completely different.

The Tooling We Actually Use to Measure This

You can't optimize what you don't measure. Here's our monitoring setup:

Per-turn latency logs — Every agent interaction writes a structured log with timestamps for STT start, STT end, LLM first token, TTS start, audio playback start. We parse these into Prometheus and graph P50/P95/P99.
Sentry for anomaly detection — If any single turn exceeds 1.5 seconds, we get an alert. Usually it's an LLM timeout or a TTS service hiccup.
Real user monitoring — We instrument the WebRTC client to report perceived latency from the user's side (not just server-side). Mobile Safari on a weak connection can add 300ms we'd never see server-side.

This isn't exotic tooling. It's just boring diligence.

What This Means for Solo Founders Building Voice Products

If you're shipping a voice AI agent in 2026, here's the checklist:

1. Pick your TTS based on latency, not just quality. ElevenLabs sounds incredible. But if your use case is transactional (booking, support, sales qualification), speed wins.

2. Stream everything you can. STT, LLM output, TTS playback. Every stage that can start before the previous one finishes is free latency reduction.

3. Tune your VAD aggressively. The default silence thresholds are too conservative. Real humans tolerate being cut off once in a conversation if it means the agent feels alive.

4. Measure P95, not average. A 300ms average with a 2-second P95 means 5% of your users think your product is broken.

5. Don't build voice if you can't commit to the latency budget. Seriously. A slow voice agent is worse than no voice agent. Just ship text chat and do it well.

Voice-First Means Latency-First

We built Softnode voice-first because we knew text-only widgets were already a crowded market. But going voice-first meant inheriting a much harder engineering problem: making AI feel human requires inhuman precision in timing.

Every millisecond you shave off the response loop compounds. 300ms feels like talking to a sharp human. 800ms feels like talking to a call center IVR from 2003. It's the difference between a patient booking an appointment and hanging up in frustration.

That's why we measure it. Why we argue about VAD thresholds. Why we switched TTS providers twice in six months. Because in voice AI, latency isn't a technical detail—it's the entire user experience.

Ship a voice agent that actually feels fast

Softnode AI voice and chat agents go live in 5 minutes. Sub-500ms response times, multilingual support in 30+ languages, and a stack built for speed. Built for clinics, SaaS, and service businesses that care about the details.

Start free at softnode.ai

Engin Ferahli Engin Ferahli · Founder, Softnode.ai