10 modules · 2 days

What you'll learn.

Day 1 covers the audio and language stack — ASR, TTS, voice cloning, and multilingual systems. Day 2 builds the full production pipeline: telephony, latency, agentic flows, and campaign architecture.

Day 1 Audio stack: ASR, TTS, voice cloning & multilingual systems

WhisperFaster-WhisperMMSParakeet

ASR: Choosing & Optimising Your Speech Recognition Stack

Whisper vs Faster-Whisper vs MMS vs Parakeet — when to use each, how to fine-tune for domain-specific vocabulary, real-time streaming vs batch transcription, and accuracy vs latency tradeoffs in Indian acoustic conditions.

Coqui XTTSF5-TTSKokoroVoice cloning

TTS, Voice Cloning & Synthesis Pipeline

Build a brand voice from a 30-second reference recording. XTTS, F5-TTS, and Kokoro architectures, speaker embeddings, SSML markup for natural prosody, and streaming TTS synthesis for real-time applications.

HindiTamilTeluguHinglishIndicASR

Multilingual Voice AI & Indian Languages

Fine-tuning ASR for Indian languages and accents, code-switching (Hinglish, Tanglish), language detection mid-conversation, and handling the acoustic variability of real Indian call centre environments.

Silero VADWebRTC VADPipecat

Turn Detection, VAD & Interrupt Handling

Voice activity detection models, end-of-speech detection without awkward pauses, barge-in handling (user interrupting a speaking agent), and building conversation rhythm that feels natural rather than mechanical.

Day 2 Production: noise cancellation, diarization, telephony, latency, agentic pipelines & campaigns

RNNoiseDeepFilterNetKrispAudio pre-processing

Noise Cancellation & Audio Pre-processing

Real-world audio is noisy — background noise, codec artefacts, packet loss. Learn to apply deep-learning noise suppression (RNNoise, DeepFilterNet, Krisp) as a pre-processing stage before ASR to recover accuracy in call-centre and field environments, and understand where in the pipeline it should sit for minimum latency impact.

pyannote.audioNeMoSpeaker embeddingsPost-call analytics

Speaker Diarization

Automatically segment recordings by "who spoke when" using pyannote.audio and NVIDIA NeMo. Covers speaker embedding models, overlap detection, and linking diarization output to ASR transcripts for multi-speaker call analytics, agent/customer separation, and routing decisions in multi-participant calls.

SIP/PSTNTwilioExotelFreeSWITCHLiveKit

SIP, Telephony & WebRTC Integration

SIP trunk setup, PSTN integration, connecting to Twilio / Exotel / Plivo / FreeSWITCH, codec selection (G.711, G.729, Opus) for different network conditions, and the architecture of a production telephony stack.

Streaming TTSEdge ASRLatency profiling

Sub-800ms Latency: Architecture & Profiling

The full latency budget across ASR → LLM → TTS, edge-deployed ASR for sub-200ms transcription, streaming LLM responses, TTS chunking and progressive audio playback, and where to deploy each component to hit under 800ms end-to-end.

PipecatLiveKit agentsLangChain voice

Agentic Voice Pipelines: Tools, Memory & Handoff

Connecting voice agents to databases, document stores, and APIs mid-conversation. Conversation memory across sessions, dynamic tool invocation without interrupting speech flow, and human handoff patterns that preserve full context.

Campaign managementCRM integrationRecording pipelines

Inbound & Outbound Campaign Architecture

Building scalable outbound calling infrastructure, campaign management, recording pipelines, real-time transcript export to CRM, structured data extraction from calls, and monitoring for production voice systems.

What you walk away with

Three deliverables,
not just takeaways.

Every participant leaves with working code and documented architecture decisions — not just slides.

Working voice agent

Every team ships a working real-time voice agent during the workshop — with ASR, TTS, and agentic tool use integrated. Includes a sample telephony integration ready to connect to your provider.

Latency profiling report

A documented analysis of your stack's latency budget — with specific recommendations for hitting under 800ms end-to-end on your chosen infrastructure and telephony provider.

Stack architecture decision guide

A framework for choosing ASR, TTS, and telephony providers for your specific use case — with cost, latency, and accuracy tradeoffs documented so your team can make the right call without starting from scratch.

Pricing & format

In Person, Hands-On.

In-Person

On enquiry

pricing depends on location, travel & customisation

Duration 2 full days (9am–5pm) or can be split across 4 half-days

Group size Up to 30 — small enough for live Q&A and lab help

Format 70% hands-on labs · 30% concept and architecture review

Pre-work Brief call + GPU setup guide sent 1 week before. Labs adapted to your stack.

Post-workshop 30 days async Q&A with facilitators included in every booking

Book this workshop

Who attends

Built for engineers who need to deploy voice AI, not just demo it.

This is a technical workshop. Participants should be comfortable with Python and have basic familiarity with audio APIs or LLM integrations. No prior voice AI experience needed.

Backend & platform engineers

Building the infrastructure layer for voice AI systems

AI/ML engineers

Integrating speech models into production pipelines

Product engineers

Building voice-first features into existing products

Telephony engineers

Modernising legacy IVR systems with AI-native voice agents

Engineering managers

Understanding the real architectural constraints of voice AI at scale

Real-world context

Use cases we build
during the workshop.

Labs are structured around real enterprise voice AI deployments — not toy demos. We adapt these to your industry in the pre-workshop brief.

BFSI

Common questions

Before you reach out.

Do we need telephony infrastructure before the workshop?

No — workshop labs use LiveKit or a simulated environment so participants don't need SIP trunks or phone numbers pre-configured. In the pre-workshop brief, we can plan labs around your actual telephony provider if you want deeper integration practice.

Which ASR and TTS models are used in the labs?

We use open-source models (Faster-Whisper for ASR, XTTS for voice cloning) as the baseline — so participants can run labs without third-party API costs. We also cover commercial options (Deepgram, ElevenLabs, OpenAI TTS) for teams evaluating managed services.

How relevant is this for Indian language voice AI specifically?

Very. Module 3 is dedicated to Indian languages — ASR fine-tuning for Hindi, Tamil, Telugu, and other Indian languages, code-switching, and the specific acoustic challenges of Indian call centres. We also cover IndicASR and AI4Bharat models, which most workshops ignore.

What compute do participants need?

A machine with a modern GPU (or access to cloud GPU — we send a guide). Minimum: NVIDIA T4 or equivalent. Most TTS and ASR models can run on 8–16GB VRAM. If your team uses cloud-only, we'll adapt the labs to Colab or your preferred cloud provider.

How does this fit with NextNeural — your Voice AI platform?

This workshop teaches the underlying engineering — ASR, TTS, telephony, and agent integration. NextNeural is our pre-built platform that assembles these layers. Workshop participants can choose to deploy the stack they build, or adopt NextNeural as their production platform. Either path works.

What if our team has questions during production rollout after the workshop?

30 days of async Q&A with the facilitators is included. Architecture reviews, debugging help, stack-specific questions — covered. For deeper, ongoing implementation support, we also offer structured embed engagements.

Voice AI Programming.
From audio bytes to production calls.

What you'll learn.

ASR: Choosing & Optimising Your Speech Recognition Stack

TTS, Voice Cloning & Synthesis Pipeline

Multilingual Voice AI & Indian Languages

Turn Detection, VAD & Interrupt Handling

Noise Cancellation & Audio Pre-processing

Speaker Diarization

SIP, Telephony & WebRTC Integration

Sub-800ms Latency: Architecture & Profiling

Agentic Voice Pipelines: Tools, Memory & Handoff

Inbound & Outbound Campaign Architecture

Three deliverables,
not just takeaways.

Working voice agent

Latency profiling report

Stack architecture decision guide

In Person, Hands-On.

Built for engineers who need to deploy voice AI, not just demo it.

The only voice AI workshop
built for Bharat.

Use cases we build
during the workshop.

Outbound Calling for MFI Collections

Medical Transcription

Call Recording Analysis

Before you reach out.

Do we need telephony infrastructure before the workshop?

Which ASR and TTS models are used in the labs?

How relevant is this for Indian language voice AI specifically?

What compute do participants need?

How does this fit with NextNeural — your Voice AI platform?

What if our team has questions during production rollout after the workshop?

Ready to build voice AI
that actually ships?

Voice AI Programming. From audio bytes to production calls.

What you'll learn.

ASR: Choosing & Optimising Your Speech Recognition Stack

TTS, Voice Cloning & Synthesis Pipeline

Multilingual Voice AI & Indian Languages

Turn Detection, VAD & Interrupt Handling

Noise Cancellation & Audio Pre-processing

Speaker Diarization

SIP, Telephony & WebRTC Integration

Sub-800ms Latency: Architecture & Profiling

Agentic Voice Pipelines: Tools, Memory & Handoff

Inbound & Outbound Campaign Architecture

Three deliverables,not just takeaways.

Working voice agent

Latency profiling report

Stack architecture decision guide

In Person, Hands-On.

Built for engineers who need to deploy voice AI, not just demo it.

The only voice AI workshopbuilt for Bharat.

Use cases we buildduring the workshop.

Outbound Calling for MFI Collections

Medical Transcription

Call Recording Analysis

Before you reach out.

Do we need telephony infrastructure before the workshop?

Which ASR and TTS models are used in the labs?

How relevant is this for Indian language voice AI specifically?

What compute do participants need?

How does this fit with NextNeural — your Voice AI platform?

What if our team has questions during production rollout after the workshop?

Ready to build voice AIthat actually ships?

Voice AI Programming.
From audio bytes to production calls.

Three deliverables,
not just takeaways.

The only voice AI workshop
built for Bharat.

Use cases we build
during the workshop.

Ready to build voice AI
that actually ships?