In this paper, we formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO). Unlike traditional speech synthesis, AVO seeks to generate not only human-sounding speech, but also perfect lip-speech synchronization. A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video. We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization. The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence. Experimental results show that VisualTTS achieves accurate lip-speech synchronization and outperforms all baseline systems.
Fig. 1: The typical workflow of automatic voice over: An AVO framework takes lip image sequence and text script as input, and generates speech audio in sync with video.
Fig. 2: Model architecture of the proposed VisualTTS, that consists of visual encoder, textual encoder, visual-guided aligner and acoustic decoder. Pre-trained blocks are denoted with a lock.
Speech Samples:
We use a modified version of Tacotron [1] that takes no visual input as the Baseline.
We replace the original pre-recorded speeches in videos from the GRID dataset with synthetic speech samples produced by Baseline and VisualTTS.
Compared to Baseline, our VisualTTS achieves better lip-speech synchronization that is presented in videos below.
With the help of visual information, VisualTTS has better duration modeling performance. The duration distortion between speech samples of VisualTTS and Ground Truth is smaller.
VisualTTS can generate silent clips and pauses between phonemes where the input lip sequence indicates silence or little lip motion.