Bringing Videos to Life: From Concept to Video Using Generative AI - An Experiment

Introduction

Generative AI is a field of artificial intelligence which synthesizes images, audio, and video using state-of-the-art algorithms for content generation. Generative AI can simplify media workflows, streamline stock footage creation, create audio effects, and do a lot more.

For those of you who are unfamiliar with the subject, this is how it works: You input a series of natural language prompts into the right set of technologies, and they will generate the outcome you seek. For instance, you could provide the prompt, ‘Create an image of an astronaut walking on the moon’ to a text-to-image generative AI tech, and it will give you multiple generated images to choose from.

Generative AI has the potential to revolutionize the field of synthetic content generation. The years 2021-23 have seen a drastic shift in this technology because of the many open-source tools that are now available to convert text to images, videos, and audio. The growth has been exponential. Every week, a new experiment or model is being released. Eventually, it seems like the future of media creation will seamlessly integrate all of such models to generate high-quality content that stretches the boundaries of creativity.

Specifically in the case of media generation, with the emergence of Generative AI, future media workflows will increasingly incorporate a combination of footage / audio which have been shot or recorded with ones that have been generated using Generative AI techniques. We see examples of this in several action-packed Hollywood films already.

This technology is extremely potent, as it can solve the challenge of finding stock footage and other associated media required to create a video - thereby unleashing a kind of creativity we have never seen before.

At MediaAI Lab, a project of Superteams.ai, we have started exploring the possibilities and limits of this new technology. In this article, we will give a step-by-step outline of how we generated a video from a simple concept by purely using open-source Generative AI technologies.

The Experiment

This is an early experiment we conducted at MediaAI Lab to see what is possible. In this experiment, we have used the following open-source Generative AI tools:

Bark for text-to-audio
Text2Video-zero for text-to-video

Bark was chosen because it can generate highly realistic, multilingual speech as well as other audio - including music, background noise, and simple sound effects.

Text2Video-zero was chosen because it does not need any extra training. It is built on top of the stable diffusion model, which creates high-quality images.

We have also picked a Cloud GPU partner that we found to be highly performant for this experiment - E2E Networks. They are amongst the few who offer NVIDIA A100 in an hourly billing and prepaid model in India, and it worked well for us. You can try this experiment with any other provider or your own set-up.

Note that the video we generated is a first experiment, so it's rudimentary. However, our purpose was not to create a polished final version, but play with the possibilities of this emerging tech. In a future post, we will advance this further and attempt to get closer to production-grade quality.

The Workflow

We followed the below workflow to conduct the experiment:

The Problem Statement

We decided on an image and had to create a story out of it. The image is depicted below:

‍

‍

Using the keywords from the above image, we wrote the following story:

The timeline begins with the Paleolithic era, which lasted from 3 million years ago to 10,000 years ago. During this time, humans were hunter-gatherers who lived in small groups. They were the first humans to use fire, which allowed them to cook food and stay warm.

The Neolithic era began 10,000 years ago, and it was a time of great change for humans. During this time, humans began to practice agriculture, which allowed them to produce more food and live in larger groups. They also began to develop new technologies, such as pottery and writing.

The invention of writing in 3000 BC marked the beginning of history. This allowed humans to record their thoughts and experiences, which helped to preserve their culture and knowledge.

The timeline then divides human history into three main periods: the ancient age, the medieval age, and the modern age.

The ancient age lasted from 3000 BC to 476 AD. During this time, some of the world's most famous civilizations flourished, including the Egyptians, the Greeks, and the Romans.

The medieval age lasted from 476 AD to 1492 AD. During this time, Europe was dominated by the Roman Catholic Church, and there was a lot of conflict between different kingdoms and empires.

The modern age began in 1492 AD, and it is the period that we are currently living in. During this time, there have been many major advances in technology, science, and culture.

The timeline ends in 2011 AD, but it is important to remember that history is still being written. We are constantly making new discoveries and creating new technologies, and the future of humanity is still unwritten.

This is a reminder of how far we have come as a species, and how much we have accomplished. It is also a reminder that we are still learning and growing, and that the future is full of possibilities.

Text to Audio Implementation Using Bark

In order to convert the above script to audio, we did the following steps:

On E2E Cloud platform, we created a NVIDIA RAPIDS Image and used GPU:

‍

Then we installed the necessary packages:

pip install nltk
pip install git+https://github.com/suno-ai/bark.git

In the notebook, we ran the following cells:

Imported the necessary libraries:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from IPython.display import Audio
import nltk  # we'll use this to split into sentences
import numpy as np

from bark.generation import (
   generate_text_semantic,
   preload_models,
)

from bark.api import semantic_to_waveform
from bark import generate_audio, SAMPLE_RATE

preload_models()

You should see an output along these lines:

Downloading text_2.pt:   0%|          | 0.00/5.35G [00:00<?, ?B/s]
Downloading (...)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]
Downloading (...)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]
Downloading (...)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]
Downloading coarse_2.pt:   0%|          | 0.00/3.93G [00:00<?, ?B/s]
Downloading fine_2.pt:   0%|          | 0.00/3.74G [00:00<?, ?B/s]
Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /root/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th
100%|██████████| 88.9M/88.9M [00:00<00:00, 100MB/s]

‍

Let's now try and enter the script we generated above as the prompt for audio generation.

#PROMPT

script = """
The timeline begins with the Paleolithic era, which lasted from 3 million years ago to 10,000 years ago. During this time, humans were hunter-gatherers who lived in small groups. They were the first humans to use fire, which allowed them to cook food and stay warm.
The Neolithic era began 10,000 years ago, and it was a time of great change for humans. During this time, humans began to practice agriculture, which allowed them to produce more food and live in larger groups. They also began to develop new technologies, such as pottery and writing.
The invention of writing in 3000 BC marked the beginning of history. This allowed humans to record their thoughts and experiences, which helped to preserve their culture and knowledge.
The timeline then divides human history into three main periods: the ancient age, the medieval age, and the modern age.
The ancient age lasted from 3000 BC to 476 AD. During this time, some of the world's most famous civilizations flourished, including the Egyptians, the Greeks, and the Romans.
The medieval age lasted from 476 AD to 1492 AD. During this time, Europe was dominated by the Roman Catholic Church, and there was a lot of conflict between different kingdoms and empires.
The modern age began in 1492 AD, and it is the period that we are currently living in. During this time, there have been many major advances in technology, science, and culture.
The timeline ends in 2011 AD, but it is important to remember that history is still being written. We are constantly making new discoveries and creating new technologies, and the future of humanity is still unwritten.
This is a reminder of how far we have come as a species, and how much we have accomplished. It is also a reminder that we are still learning and growing, and that the future is full of possibilities.
""".replace("\n", " ").strip()

# Tokenize the script into sentences
sentences = nltk.sent_tokenize(script)

SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

# Generate audio and create an array of spoken sentences
pieces = []
for sentence in sentences:
	audio_array = generate_audio(sentence, history_prompt=SPEAKER)
	pieces += [audio_array, silence.copy()]

‍
For each sentence from the script, you will see an output along the lines of...

100%|██████████| 416/416 [00:03 00:00, 108.90it/s]
...

‍

You then concatenate the pieces, set the sample rate, and generate the final audio that has been used in the video below.

Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

‍
Text to Video

Once we were done with audio generation, we moved on to video generation.

To do this, we cloned this repository and entered:

git clone https://github.com/Picsart-AI-Research/Text2Video-Zero.git
cd Text2Video-Zero/

Then, you have to Install requirements using Python 3.10 and CUDA >= 11.6

virtualenv --system-site-packages -p python3.10 venv
source venv/bin/activate
pip install -r requirements.txt

‍

Inference API

To run inferences, we created an instance of Model class.

import torch
from model import Model
model = Model(device = "cuda", dtype = torch.float16)

This would load the model.

Generating the Video

Now that we have loaded the model, let's get to the work of actually generating the video.

For this, we made minor changes in the parameters to generate the video from the script. We changed the frames per second to 1 and video_length to 10.

The issue with this model is that it does not take long prompts, so we divided the script into smaller chunks and entered each one of them as separate prompts:

The timeline begins with the Palaeolithic era, which lasted from 3 million years ago to 10,000 years ago. During this time, humans were hunter-gatherers who lived in small groups. They were the first humans to use fire, which allowed them to cook food and stay warm.
The Neolithic era began 10,000 years ago, and it was a time of great change for humans. During this time, humans began to practice agriculture, which allowed them to produce more food and live in larger groups. They also began to develop new technologies, such as pottery and writing.
The invention of writing in 3000 BC marked the beginning of history. This allowed humans to record their thoughts and experiences, which helped to preserve their culture and knowledge.
The timeline then divides human history into three main periods: the ancient age, the medieval age, and the modern age.
The ancient age lasted from 3000 BC to 476 AD. During this time, some of the world's most famous civilizations flourished, including the Egyptians, the Greeks, and the Romans.
The medieval age lasted from 476 AD to 1492 AD. During this time, Europe was dominated by the Roman Catholic Church, and there was a lot of conflict between different kingdoms and empires.
The modern age began in 1492 AD, and it is the period that we are currently living in. During this time, there have been many major advances in technology, science, and culture.
The timeline ends in 2011 AD, but it is important to remember that history is still being written. We are constantly making new discoveries and creating new technologies, and the future of humanity is still unwritten.
This is a reminder of how far we have come as a species, and how much we have accomplished. It is also a reminder that we are still learning and growing, and that the future is full of possibilities.

Then we ran this Python command which stored the result in mnt/text2video/video.mp4 :

# You have to do this for each of the prompt from the prompt list above.
prompt = "your_prompt"
params = {"t0": 44, "t1": 47 , "motion_field_strength_x" : 12, "motion_field_strength_y" : 12, "video_length": 10}

out_path, fps = f"video.mp4", 1
model.process_text2video(prompt, fps = fps, path = out_path, **params)

‍
This would generate a series of videos, each corresponding to the prompt listed above. We then simply concatenated the generated videos, collated the video and audio together and added a title animation. Here's the final version, the outcome of our experiment.

Try it out yourself and feel free to write to us with your experience.

Conclusion

Generative AI is revolutionizing the field of content generation and media. We are excited to see what the future holds. These technologies are allowing media companies to create content on a low budget and in much less time - letting creators use their creativity to its maximum potential, without having to worry about the technicalities behind it.

We, at Superteams.ai, are conducting regular experiments to understand the future of media and be a participant in it. We are also partnering with cloud GPU providers and media companies to build and test such workflows in partnership with them.

To get in touch with us, write to info@superteams.ai.