Voice quality — 280ms median latency
An AI voice agent succeeds or fails on latency. If the conversation feels "laggy" — the agent replies two-thirds of a second after the customer finishes speaking — the conversation feels unnatural and the customer hangs up. The industry benchmark for natural turn-taking is 200-400 milliseconds. We optimised for 280ms median end-to-end latency and have held it through six months of live traffic.
The pipeline where milliseconds go to die
A single conversational turn travels through five stages:
- SIP ingress (Telnyx) — customer audio arrives via Telnyx SIP trunk. Telnyx EU region average: 18ms.
- STT — OpenAI Realtime (or fallback ElevenLabs / Web Speech) — audio becomes text. Streaming, not batch. Partial hypothesis after 80-120ms.
- LLM — OpenAI Realtime (or fallback) — reply generated from text. Token-level streaming. First token 90-140ms.
- TTS — reply becomes audio. Pre-warmed cached voice profile. 60-100ms to first audio chunk.
- SIP egress (Telnyx) — audio travels back to the customer. 18ms.
Total median: 280ms. P95: 410ms. P99: 680ms (we actively hunt these).
Most of the time isn't on the network, it's the time to first LLM token. Squeezing that down was the biggest optimisation effort.
The three big optimisation wins
1. Streaming STT, not batch. First iteration waited for the entire sentence before transcribing. That added ~600ms to the pipeline. We moved to streaming mode where STT emits partial hypotheses every 80ms. The LLM gets the partial text and starts planning the answer. Gain: ~400ms.
2. Half-duplex barge-in. If the customer starts speaking while the agent is talking, the agent stops immediately and listens. In v1 the barge-in had 800ms latency (TTS buffer drain). We switched to instant abort + chunk discard. Gain: 600ms faster reaction.
3. Pre-warmed TTS cache. The 30 most common phrases ("one moment", "thank you for waiting", "I understand") are pre-synthesised and stored as audio. Zero TTS latency on those. Gain: ~80ms on 60% of calls.
The empty response.done bug — the 30-second silence
During the six-month run we had one ugly live incident worth sharing because it's likely useful to others.
Over two weeks we got scattered complaints: "sometimes the agent goes silent for 30 seconds." We couldn't reproduce it. Logs showed nothing unusual — no error code, no timeout. The OpenAI Realtime API returned response.done cleanly — except there wasn't a single text or audio chunk inside it.
The fix: detect the empty response.done and instantly restart generation with a fallback prompt ("one moment, let me check"). The restart takes 1-1.5 seconds — instead of the 30-second silence. The bug exists on OpenAI's side, but the workaround is always on ours.
After the fix the average "silent gap" incident disappeared. We also armed the monitoring: any gap >5 seconds between two turns logs and alerts.
Hunting the P99
The median is a nice number, but user experience depends on the worst case. If 1% of your calls stutter with 2-second delays, that's 10 unhappy customers per 1,000 calls. Three sources cause our P99 spikes:
- GC pause in the STT worker — Java GC, fixed with G1GC tuning
- Cold-start TTS voice profile — rarely used voices off-cache, fix: nightly pre-warm
- OpenAI rate-limit-near throttling — burst traffic, fix: load balancer rotates keys flexibly
After the three fixes, P99 dropped from 1,100ms to 680ms.
Provider failover
Nortinia always keeps three voice providers ready: primary OpenAI Realtime, secondary ElevenLabs, tertiary Web Speech (browser-side). If OpenAI doesn't respond for 3 seconds, the system switches to ElevenLabs automatically — mid-call, the customer doesn't notice. The switch is circuit-breaker-driven (5 failures in 60 seconds → blackout for 5 minutes).
In six months we switched to a fallback provider 14 times. Zero full outages.
What voice quality doesn't fix
Latency and audio quality alone don't make the agent better if the content is wrong. A fast reply that misreads the customer is worse than a slow reply that's accurate. Latency optimisation has reached the natural-conversation threshold; from here, prompt engineering and intent-classification tuning drive the CSAT gains.