Technical Workshop · 2-Day Intensive

Voice AI Programming.
From audio bytes to production calls.

Build real-time voice AI systems — ASR, TTS, voice cloning, SIP integration, sub-800ms latency, agentic voice pipelines, and multilingual support for Indian and global languages. Taught by the team behind NextNeural.

2-day intensive (16 hrs)
Up to 30 participants
In-person
India-first, 20+ languages
10 modules · 2 days

What you'll learn.

Day 1 covers the audio and language stack — ASR, TTS, voice cloning, and multilingual systems. Day 2 builds the full production pipeline: telephony, latency, agentic flows, and campaign architecture.

Day 1 Audio stack: ASR, TTS, voice cloning & multilingual systems
01
WhisperFaster-WhisperMMSParakeet

ASR: Choosing & Optimising Your Speech Recognition Stack

Whisper vs Faster-Whisper vs MMS vs Parakeet — when to use each, how to fine-tune for domain-specific vocabulary, real-time streaming vs batch transcription, and accuracy vs latency tradeoffs in Indian acoustic conditions.

02
Coqui XTTSF5-TTSKokoroVoice cloning

TTS, Voice Cloning & Synthesis Pipeline

Build a brand voice from a 30-second reference recording. XTTS, F5-TTS, and Kokoro architectures, speaker embeddings, SSML markup for natural prosody, and streaming TTS synthesis for real-time applications.

03
HindiTamilTeluguHinglishIndicASR

Multilingual Voice AI & Indian Languages

Fine-tuning ASR for Indian languages and accents, code-switching (Hinglish, Tanglish), language detection mid-conversation, and handling the acoustic variability of real Indian call centre environments.

04
Silero VADWebRTC VADPipecat

Turn Detection, VAD & Interrupt Handling

Voice activity detection models, end-of-speech detection without awkward pauses, barge-in handling (user interrupting a speaking agent), and building conversation rhythm that feels natural rather than mechanical.

Day 2 Production: noise cancellation, diarization, telephony, latency, agentic pipelines & campaigns
05
RNNoiseDeepFilterNetKrispAudio pre-processing

Noise Cancellation & Audio Pre-processing

Real-world audio is noisy — background noise, codec artefacts, packet loss. Learn to apply deep-learning noise suppression (RNNoise, DeepFilterNet, Krisp) as a pre-processing stage before ASR to recover accuracy in call-centre and field environments, and understand where in the pipeline it should sit for minimum latency impact.

06
pyannote.audioNeMoSpeaker embeddingsPost-call analytics

Speaker Diarization

Automatically segment recordings by "who spoke when" using pyannote.audio and NVIDIA NeMo. Covers speaker embedding models, overlap detection, and linking diarization output to ASR transcripts for multi-speaker call analytics, agent/customer separation, and routing decisions in multi-participant calls.

07
SIP/PSTNTwilioExotelFreeSWITCHLiveKit

SIP, Telephony & WebRTC Integration

SIP trunk setup, PSTN integration, connecting to Twilio / Exotel / Plivo / FreeSWITCH, codec selection (G.711, G.729, Opus) for different network conditions, and the architecture of a production telephony stack.

08
Streaming TTSEdge ASRLatency profiling

Sub-800ms Latency: Architecture & Profiling

The full latency budget across ASR → LLM → TTS, edge-deployed ASR for sub-200ms transcription, streaming LLM responses, TTS chunking and progressive audio playback, and where to deploy each component to hit under 800ms end-to-end.

09
PipecatLiveKit agentsLangChain voice

Agentic Voice Pipelines: Tools, Memory & Handoff

Connecting voice agents to databases, document stores, and APIs mid-conversation. Conversation memory across sessions, dynamic tool invocation without interrupting speech flow, and human handoff patterns that preserve full context.

10
Campaign managementCRM integrationRecording pipelines

Inbound & Outbound Campaign Architecture

Building scalable outbound calling infrastructure, campaign management, recording pipelines, real-time transcript export to CRM, structured data extraction from calls, and monitoring for production voice systems.

What you walk away with

Three deliverables,
not just takeaways.

Every participant leaves with working code and documented architecture decisions — not just slides.

Working voice agent

Every team ships a working real-time voice agent during the workshop — with ASR, TTS, and agentic tool use integrated. Includes a sample telephony integration ready to connect to your provider.

Latency profiling report

A documented analysis of your stack's latency budget — with specific recommendations for hitting under 800ms end-to-end on your chosen infrastructure and telephony provider.

Stack architecture decision guide

A framework for choosing ASR, TTS, and telephony providers for your specific use case — with cost, latency, and accuracy tradeoffs documented so your team can make the right call without starting from scratch.

Pricing & format

In Person, Hands-On.

In-Person
On enquiry
pricing depends on location, travel & customisation
Duration 2 full days (9am–5pm) or can be split across 4 half-days
Group size Up to 30 — small enough for live Q&A and lab help
Format 70% hands-on labs · 30% concept and architecture review
Pre-work Brief call + GPU setup guide sent 1 week before. Labs adapted to your stack.
Post-workshop 30 days async Q&A with facilitators included in every booking
Book this workshop
Who attends

Built for engineers who need to deploy voice AI, not just demo it.

This is a technical workshop. Participants should be comfortable with Python and have basic familiarity with audio APIs or LLM integrations. No prior voice AI experience needed.

Backend & platform engineers
Building the infrastructure layer for voice AI systems
AI/ML engineers
Integrating speech models into production pipelines
Product engineers
Building voice-first features into existing products
Telephony engineers
Modernising legacy IVR systems with AI-native voice agents
Engineering managers
Understanding the real architectural constraints of voice AI at scale
SP
Soum Paul
Co-Founder & CTO, Superteams.ai
Built NextNeural — Superteams' production voice AI platform
30,000+ daily calls handled on systems he designed and deployed
10+ enterprise voice AI deployments across BFSI, healthcare, and collections
Your facilitator
Voice AI is one of the highest-leverage automation surfaces in India — collections, customer support, outbound sales, medical transcription. But most teams get stuck at the API integration layer. This workshop teaches you to own the stack: the models, the latency, the telephony, and the agents. That's the difference between a POC and a production system.
India-first

The only voice AI workshop
built for Bharat.

Most voice AI workshops treat Indian languages as an afterthought — a slide at the end that says "multilingual support coming soon." We don't. Module 3 covers the full picture: fine-tuning ASR for regional accents, code-switching, AI4Bharat models, and the acoustic challenges specific to Indian call centres. Because that's the real environment your system will run in.

Indian languages covered

HindiTamilTeluguKannadaMalayalamBengaliMarathiGujaratiPunjabiOdiaHinglishTanglish

Models covered

AI4Bharat IndicASRMMS (Meta)Whisper fine-tunedCoqui XTTS multilingualF5-TTS
Real-world context

Use cases we build
during the workshop.

Labs are structured around real enterprise voice AI deployments — not toy demos. We adapt these to your industry in the pre-workshop brief.

Common questions

Before you reach out.

Do we need telephony infrastructure before the workshop?

No — workshop labs use LiveKit or a simulated environment so participants don't need SIP trunks or phone numbers pre-configured. In the pre-workshop brief, we can plan labs around your actual telephony provider if you want deeper integration practice.

Which ASR and TTS models are used in the labs?

We use open-source models (Faster-Whisper for ASR, XTTS for voice cloning) as the baseline — so participants can run labs without third-party API costs. We also cover commercial options (Deepgram, ElevenLabs, OpenAI TTS) for teams evaluating managed services.

How relevant is this for Indian language voice AI specifically?

Very. Module 3 is dedicated to Indian languages — ASR fine-tuning for Hindi, Tamil, Telugu, and other Indian languages, code-switching, and the specific acoustic challenges of Indian call centres. We also cover IndicASR and AI4Bharat models, which most workshops ignore.

What compute do participants need?

A machine with a modern GPU (or access to cloud GPU — we send a guide). Minimum: NVIDIA T4 or equivalent. Most TTS and ASR models can run on 8–16GB VRAM. If your team uses cloud-only, we'll adapt the labs to Colab or your preferred cloud provider.

How does this fit with NextNeural — your Voice AI platform?

This workshop teaches the underlying engineering — ASR, TTS, telephony, and agent integration. NextNeural is our pre-built platform that assembles these layers. Workshop participants can choose to deploy the stack they build, or adopt NextNeural as their production platform. Either path works.

What if our team has questions during production rollout after the workshop?

30 days of async Q&A with the facilitators is included. Architecture reviews, debugging help, stack-specific questions — covered. For deeper, ongoing implementation support, we also offer structured embed engagements.

Book this workshop

Ready to build voice AI
that actually ships?

Tell us about your team, your telephony setup, and the languages you need to support. We'll tailor the workshop and confirm availability.

We respond within 1 business day. No sales scripts.