The Challenge
A client needed to replace a rigid, menu-driven phone system with a voice AI that could understand natural language, query live data, and respond in real time — without the latency that makes voice AI feel broken.
Script-based IVR failure — Traditional IVR trees force users into predefined paths. When a customer’s question didn’t match a menu option, the system failed them entirely. The client was losing calls and customer trust at the IVR stage.
Latency as a hard constraint — On telephone channels, response delay is felt immediately. Anything above ~600ms feels like a broken line. The system needed to complete STT, query a database, generate a response, and begin streaming TTS — all within a sub-500ms window.
Live data requirement — The voice agent couldn’t serve canned answers. Every response needed to reflect the caller’s actual account state, pulled from live databases mid-conversation. Static knowledge bases weren’t an option.
Dual-channel delivery — The client needed the same voice AI accessible both via standard telephone (PSTN) and via a browser-based click-to-call button — without running two separate systems.
The Solution
Superteams built a unified voice AI pipeline optimized for latency, with a single core that serves both telephone and WebRTC channels.
Streaming STT/TTS pipeline — Rather than waiting for a full utterance before processing, the system uses streaming STT to begin understanding the user’s query as they speak. TTS output begins streaming as soon as the first tokens are generated — eliminating the pause-then-speak pattern that makes voice AI feel robotic.
Live database integration — The agent connects directly to user databases. When a caller asks about their account, order, or status, the system queries live data and incorporates the result into the response — in real time, within the same latency budget.
Telephony and WebRTC on a single core — A shared voice processing layer handles both channels. PSTN calls route through telephony integration; browser sessions connect via WebRTC click-to-call. Both surfaces use identical STT/TTS and LLM pipelines, so behavior is consistent across channels.
Voice Activity Detection — Implemented VAD to distinguish speech from silence and background noise, reducing false triggers and ensuring the system responds only when the user has finished speaking — critical for natural conversation flow on telephone.
Results
The agent launched across both channels — telephone and click-to-call — with sub-500ms response latency achieved under normal conditions. Users can ask natural language questions and receive personalized, database-accurate answers without navigating menus.
“Real-time voice AI connected to user databases — accessible via telephone and click-to-call with modern TTS/STT.”
Call completion rates improved as users stopped abandoning the IVR. The client now has a voice interface that handles a meaningful portion of inbound queries without agent involvement.