This article describes how you can build a voice-enabled assistant for product discovery and support.

If you talk to any D2C founder, one problem comes up again and again: customers add items to cart, then they disappear.
Most Shopify sellers try to fix this using:
All of these approaches assume the customer is willing to read, scroll, and process information, usually in English.
But, in India, that assumption often breaks.
India is a voice-native market. People are comfortable asking questions verbally, especially when shopping:
When answers are buried inside long descriptions written in English, customers don’t always abandon because they dislike the product. Often, they leave because finding the right information takes effort.
This project started from a simple observation:
If customers can speak their questions, why should stores only reply in text?
The goal was not to build a fancy chatbot.
The goal was to build a store-aware voice assistant that:
At the end, the assistant behaves like a helpful store staff member, not an FAQ page.
At a conceptual level, the system works like this:
This loop continues until the customer says “exit” or “stop”.

The engagement depends on the quality of the responses. And we need to remember that voice only amplifies wrong answers—it does not fix them.
The assistant is built using real Shopify store information:
This data is fetched from Shopify, cleaned to remove HTML, and converted into plain text.

Raw product pages are long and messy. Reading them aloud would be useless.
So the content is:
Chunking ensures:
def chunk_text(text, size=CHUNK_SIZE):
chunks = []
start = 0
while start < len(text):
chunks.append(text[start:start + size])
start += size
return chunks
with open(os.path.join(SCRIPT_DIR, "shopify_data.json"), "r", encoding="utf-8") as f:
data = json.load(f)
chunks = []
for item in data:
text_chunks = chunk_text(item["content"])
for chunk in text_chunks:
chunks.append({
"title": item["title"],
"type": item["type"],
"content": chunk
})
with open(os.path.join(SCRIPT_DIR, "rag_chunks.json"), "w", encoding="utf-8") as f:
json.dump(chunks, f, indent=2)
print("Chunking complete. Total chunks:", len(chunks))

Each chunk is converted into a vector embedding using a sentence-transformer model.
These embeddings are stored in a FAISS index, which allows semantic search.
When a customer asks:
“What sizes do you have in cotton kurta?”
The system retrieves:
This is the core of the RAG system.
Real customers don’t speak like documentation.
They say things like:
To handle this, the assistant uses:
Instead of reading the entire product description aloud, it extracts only what is relevant.
Example:
This keeps voice responses short, focused, and natural.
The Speech-to-Text module converts user voice input into text using Hugging Face transformer-based speech recognition models.
model = WhisperModel("base", compute_type="int8")
def listen(fs=16000, silence_threshold=0.015, silence_duration=1.5, max_duration=12):
"""
Record audio dynamically until silence is detected after speech starts.
- silence_threshold: RMS amplitude below which audio is considered silent
- silence_duration: seconds of silence needed to stop recording
- max_duration: hard cap on recording length in seconds
"""
print("Listening...")
chunk_size = int(fs * 0.1) # process in 100ms chunks
max_chunks = int(max_duration / 0.1)
silence_chunks_needed = int(silence_duration / 0.1)
audio_chunks = []
silent_count = 0
speech_started = False
with sd.InputStream(samplerate=fs, channels=1, dtype='float32') as stream:
for _ in range(max_chunks):
chunk, _ = stream.read(chunk_size)
audio_chunks.append(chunk.copy())
rms = np.sqrt(np.mean(chunk ** 2))
if rms > silence_threshold:
speech_started = True
silent_count = 0
elif speech_started:
silent_count += 1
if silent_count >= silence_chunks_needed:
break
if not audio_chunks or not speech_started:
return ""
recording = np.concatenate(audio_chunks, axis=0)
# Convert float32 [-1, 1] → int16 for a standard WAV that Whisper handles reliably
recording_int16 = (recording * 32767).astype(np.int16)
How It Works
Technologies Used
File
scripts/hf_stt.py
Purpose
This module enables natural voice interaction with the AI assistant, eliminating the need for manual typing.
The Text-to-Speech module converts the AI-generated text response back into natural-sounding speech.
pygame.mixer.pre_init(frequency=44100, size=-16, channels=2, buffer=512)
pygame.mixer.init()
# Sweet, natural neural voice — options you can swap:
# "en-US-JennyNeural" — warm and friendly (default)
# "en-US-AriaNeural" — calm and clear
# "en-US-SaraNeural" — bright and polite
VOICE = "en-US-JennyNeural"
async def _generate(text: str, path: str):
communicate = edge_tts.Communicate(text, voice=VOICE, rate="-5%", pitch="+0Hz")
await communicate.save(path)
How It Works
Technologies Used
File
scripts/hf_tts.py
Purpose
This module provides a fully voice-enabled AI assistant experience by converting responses into speech.
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
# Load chunked data
with open(os.path.join(SCRIPT_DIR, "rag_chunks.json"), "r", encoding="utf-8") as f:
data = json.load(f)
texts = [item["content"] for item in data]
# Load FAISS
index = faiss.read_index(os.path.join(SCRIPT_DIR, "faiss.index"))
# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
welcome = "Shopify Voice Assistant started. Ask your question."
print(welcome)
speak(welcome)
Voice changes how customers interact with a store:
For Indian Shopify sellers, this can improve:
Sometimes, a clear spoken answer is all a customer needs to move forward.


Data Layer
RAG Layer
Logic Layer
Voice Layer
Each layer is independent, making the system:
Shopify voice assistant

Although this project focuses on voice-based support, the same architecture can be extended to:
Once the store knowledge is solid, voice becomes just another interface.
At Superteams, we help you build voice agents customized to your business needs. To learn more, speak to us.
You can explore the project here:
🔗 Project Repository:
https://github.com/AbhinayaPinreddy/Shopify-Voice-AI-Assistant