DeepDub

We live in a highly connected world. Humans around the world, of all cultures, are consuming entertainment at a pace faster than ever before in history. Media is quickly being imported and exported between countries via the Internet. In this media, the visual medium is of great power. It can

Project Title

DeepDub

Project Area of Specialization

Artificial Intelligence

Project Summary

In this media, the visual medium is of great power. It can transfer a state of human presence as-is would be seen from the human eyes and ears. However it brings with itself a classic issue since even before all this technology; the clash of languages and culture. However much English may be considered as a “default” medium, only 25% of the world has an even faint understanding of what “English” is to begin with. So keeping that in mind, English entertainment is not the silver bullet for global entertainment, as it may not suit viewers of other cultures.

Our Final Year Project, named “DeepDub”, aims to provide seamless translations of videos. It will achieve this by:

Translating videos with just one stream of audio
Voices similar to the original performers
Capturing and translating the emotion in the voices
Syncing the lips of the actors to the translated audio

This will make dubbed videos much more entertaining and relatable to viewers of different languages. It will also reduce the cost of localization exponentially, as it leads to easier retakes and adjustments. In the best case, there wouldn’t even be a voice actor; hence also saving HR costs and time.

Project Objectives

Create a working system/engine which can translate videos from one language to another
Support Urdu, English and Turkish
Create practical applications (web app and/or mobile app) which demonstrate the power of the engine
Produce high quality results that are acceptable to a general audience
Write at least 2 research papers on the idea and the implementation (Research Gap)
Create an End-to-End system for quicker results

Project Implementation Method

Our proposed implementation requires a video, the source language and a target language as in input from the user. The proposed implementation comprises multiple independent modules which work in a sequential order. Following is a brief overview of each module:

The first module is an Automatic Speech Recognition model which recognises and translates speech utterances made in the input video. The current implementation uses the latest open source ASR model wav2vec2. The output of this module is a raw transcription file containing speech utterances the ASR model has predicted.
This raw transcription file acts as an input for the next module which is a Forced Aligner the purpose of which is to timestamp the speech utterances relative to the time they were spoken at in the input video. The output of this module is a word by word Time Stamped Transcript file. The current implementation uses Aeneas open source model to perform this transcription.
The time stamped transcript file is then given as input to our custom built Clustering module which aims to cluster the transcripted words into transcripted sentences by calculating the difference between each timestamped word and clustering those which have specifically close differences between them.
At this point, we obtain timestamped sentences which essentially is an SRT file. This SRT file is then given input to a Text-to-Text translation module which converts the original source text into the target language the user has provided as input. The current implementation currently uses Opus open source Text-to-Text translation module which has a wide range of supported languages. The output of this module is a Translated Time stamped Transcript (SRT) file.
This Translated SRT file along with the input video act as an input to the next module which is the Text-to-Speech module which aims to convert the text translated into audio. The current implementation uses an open source module named Real Time Voice Cloning (an unofficial implementation of the research paper “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis” ) which not only performs text to speech but also clones the speaker's voice from the. The output of this module is an synthesized Audio file based on the translated SRT file.
The final module is a GAN based Speech-to-Lip ( Lip synchronization model ) which takes in the synthesized audio and the input video as its input. It utilizes these inputs to synchronize the lips of the actor present in the video according to the audio file provided. This module is integrated into the pipeline to make the output much more immersive and natural. The current implementation uses an open source model Wav2Lip based on the research “A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild” paper. The output of this module is a completely translated and visually synchronized video according to the user's specified target language.

Benefits of the Project

Statistics, Observations and Benefits

We observed that (Globally):

According to Cisco, 82% of all created content would be video
Only 25% of the world understands English
Many people prefer subtitles over dubbed videos, despite subtitles being highly distracting to the viewer’s experience. (The reason? Not realistic enough)
Dubbing a video, at “good” quality, can cost anywhere from $30,000 to $100,000 for a 90-minute program or film (Expensive? That’s just for one language!)
Our observation (Local):
- Increased interest in dubbed Turkish drama serials especially after Ertugrul
- Immense gap between the original audio and dubbed voice-over audio leading to desynchronized actor expressions (especially facial and emotional)
- Viewers already relate to foreign actors a lot
- The COVID-19 Pandemic has highly affected the way we consume media and entertainment
Through our proposed implementation method we aim to:
- Make content easy to understand through dubbed language
- Increase audience immersion by generating realistic, synchronized output videos

Target Industries:

This project is aimed to facilitate faster, cost efficient and higher quality dubbed video content production. Therefore the applications of this project covers a broad range of video content production industries. These industries include:

Film Production industry
Drama Production industry
TV broadcasting channels
News channels
Media streaming platforms such as Netflix, Amazon Prime, TikTok
Educational Platforms such as Udemy, Coursera
Live speech sessions on online platforms
Video Calling/Streaming on Social Media

Technical Details of Final Deliverable

Project architecture:

Engine: Hybrid of Natural language Processing and Deep learning models. A total of 6 modules as explained in the proposed implementation method section. The current implementation includes open source pre-trained models. Each of the model in its sequential order are as follows:
- Wav2vec2: Used for Automatic Speech Recognition
- Aeneas: Used for Forced Alignment of Raw Transcription file
- Opus: Used for Text-to-Text translation
- Real Time Voice Cloning: Used for Text-to-Speech conversion and actor voice cloning
- Wav2Lip: Used for facial synchronization of actor in accordance with the translated audio
Frontend: The front end is a web application which provides functionalities such as recording a live video through a device’s camera and the option to select both the source and the target language by the user. After the live video recording has been processed by our backend engine the user will be navigated to the preview screen where they are to be presented with both the original vide and the dubbed video each along with the original transcript of the videos.The user will provided the functionality to edit the generated transcriptions through the website after which the user can resubmit the video alongside the updated transcription to reprocess the dubbed video in order to produce better results.
The current frontend implementation is built using ReactJS.