GenAI
Feb 15, 2024

Using WhisperSpeech - the Voice Synthesis AI for Voice-Over Generation at Scale

Learn how to setup and use WhisperSpeech, an open source Voice Generation Generative AI technology.

Using WhisperSpeech - the Voice Synthesis AI for Voice-Over Generation at Scale
We Help You Engage the Top 1% AI Researchers to Harness the Power of Generative AI for Your Business.

Introduction: Why WhisperSpeech

Say, you're on the hunt for a text-to-speech solution that not only meets but exceeds the requirements of modern digital content creation?

Whether it's for crafting engaging videos, adding depth to background voice-overs, or injecting life into video game characters through voice cloning, the need for advanced text-to-speech models has never been more pressing. These innovative technologies offer a bridge between the written word and spoken voice, providing a seamless way to produce content that is both accessible and captivating. With the power to transform scripts into natural, human-like speech, text-to-speech models are revolutionizing the way we create, interact with, and consume digital media.

In this blog post, we’ll explore WhisperSpeech - the perfect solution for your text-to-speech needs. WhisperSpeech is an innovative text-to-speech (TTS) model that has been created by inverting OpenAI's speech-to-text model, Whisper. Developed by Collabora, this open-source project aims to provide a versatile and natural-sounding speech generation tool that is both powerful and easily customizable, similar to how Stable Diffusion has impacted the field of image generation.

As of now, WhisperSpeech can generate voice-overs in two languages - English and Polish in a standard female voice but, if given a reference, it can also clone a voice. We’ll first give WhisperSpeech an elementary text to generate a voice-over from, and then, to make things more exciting, we’ll clone the voice of Robert Downey Jr and recreate some of his dialogues from ‘Iron Man’!

Step-by-Step Guide to Generating Voice-Overs with WhisperSpeech

Let’s get started!

Spin up your Colab notebooks and install WhisperSpeech by running the command below.


!pip install -Uqq WhisperSpeech

Checking if CUDA is available in our notebook environment.


def is_colab():
    try: import google.colab; return True
    except: return False

import torch
if not torch.cuda.is_available():
    if is_colab(): raise BaseException("Please change the runtime type to GPU. In the menu: Runtime -> Change runtime type (the free T4 instance is enough)")
    else:          raise BaseException("Currently the example notebook requires CUDA, make sure you are running this on a machine with a GPU.")

Importing PyTorch modules.


import torch
import torch.nn.functional as F

from IPython.display import Markdown, HTML

Importing the WhisperSpeech pipeline.


from whisperspeech.pipeline import Pipeline

Let’s load the fast SD S2A model.


pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

Let’s first test with the standard female voice provided by the model.


pipe.generate_to_notebook("""
Hello, welcome to WhisperSpeech! Today, we're excited to guide you through the innovative world
of text-to-speech technology. We'll show you how to breathe life into your text with natural-sounding
voiceovers and explore the cutting-edge features that make WhisperSpeech a game-changer in digital content creation.
""", lang = 'en', cps=15)

Listen to the output below:

Voice Cloning

Now, let’s have some fun with voice cloning. We’ll clone Robert Downey Jr’s voice and generate a voice-over for ‘Iron Man’.


pipe.generate_to_notebook("""
"Good evening, everyone. As Iron Man, I've faced some of the toughest challenges imaginable, but the real strength lies not in the suit, but in the heart and determination of those who strive to make the world a better place. Let's remember, it's not just about being heroes in battle, but being champions of innovation, courage, and unity. Together, we can overcome any obstacle."
""", lang='en', cps=15, speaker='https://tmpfiles.org/dl/4167328/robertdowneyjr.mp3')

Check the output below: 

Let’s try with a different text (the following paragraph has been generated by ChatGPT).


pipe.generate_to_notebook("""
In a world that's constantly changing, the true measure of a person is not just in the ability to adapt, but in our capacity to use our gifts for the greater good. I've seen firsthand that technology, when guided by heart and responsibility, has the power to change the world. Let's be the heroes our future needs, not just in suits of armor, but in our everyday actions.
""", lang='en', cps=10, speaker='https://tmpfiles.org/dl/4168800/robertdowneyjr.mp3')

This generates the following output: 

Next lets try to clone the voice of Benedict Cumberbatch, and generate a voice-over of him reading passages from ‘Sherlock Holmes’. (Again, here, the texts have been generated by ChatGPT.) This time, we will try with a longer passage, and demonstrate how to use WhisperSpeech when you have a lot of content.

First, the text: 


pipe.generate_to_notebook("""
As the morning mist hung low over Baker Street, Sherlock Holmes, with his characteristic keenness, peered across the sitting room at me, his friend and chronicler, Dr. John Watson. "Watson," he began, his voice carrying the thrill of a new challenge, "consider the peculiar case that has presented itself to us this day. A mystery that, I daresay, will require all our wits to unravel."
I looked up from my newspaper, intrigued by the spark in his eye.
""", lang='en', cps=10, speaker='https://tmpfiles.org/dl/4170772/y2meta.app-sherlockholmesaudiobookreadbybenedictcumberbatch_sherlockholmesaudiobook_freeaudiobook128kbps.wav')

pipe.generate_to_notebook("""
I could not help but lean forward, my interest piqued. "Robbery?" I ventured.
Holmes shook his head. "Too simple, Watson. No, there is more at play here. The constabulary found a peculiar token clenched in his fist—a small, intricately carved chess piece, a knight, to be precise. This, combined with the absence of any physical harm to the body, suggests a game of a far more sinister nature."
""", lang='en', cps=10, speaker='https://tmpfiles.org/dl/4170772/y2meta.app-sherlockholmesaudiobookreadbybenedictcumberbatch_sherlockholmesaudiobook_freeaudiobook128kbps.wav')

pipe.generate_to_notebook("""
As Holmes spoke, the room seemed to grow colder, the fog outside pressing against the windows as if trying to listen in. I knew then that we were on the cusp of an adventure that would test our limits and perhaps reveal the darker corners of the human heart.
""", lang='en', cps=10, speaker='https://tmpfiles.org/dl/4170772/y2meta.app-sherlockholmesaudiobookreadbybenedictcumberbatch_sherlockholmesaudiobook_freeaudiobook128kbps.wav')

I stitched the output of the above three queries together into one single file (it’s slightly jarring at the points where the responses are merged - but, overall, it works well). You can listen to the output here:

Final Words

So now you can experiment with WhisperSpeech for your own voice-over pipelines! Feel free to reach out to founders@superteams.ai if you need any assistance in setting up your voice-over or any other AI pipeline.