Orator: LLM-Guided Multi-Shot Speech Video Generation

Additional Video Generation Results

We present additional generated long human speech videos with multi camera shots.

Additional Comparison on Camera Switching

Abstract

In this work, we propose a novel system for automatically generating multi-shot speech videos with natural camera transitions, using input text lines and reference images from various camera angles. Existing human video generation datasets and methods are largely centered on faces or half-body single-shot videos, thus lack the capacity to produce multi-shot full-body dynamic movements from different camera angles. Recognizing the lack of suitable datasets, we first introduce TalkCuts, a large-scale dataset containing over 500 hours of human speech videos with diverse camera shots, rich 3D SMPL-X motion annotations, and camera trajectories, covering a wide range of identities. Based on this dataset, we further propose an LLM-guided multi-modal generation framework, named Orator, where the LLM serves as a multi-role director, generating detailed instructions for camera transitions, speaker gestures, and vocal delivery. This enables the system to generate coherent long-form videos through a multi-modal video generation module. Extensive experiments show that our framework successfully generates coherent and engaging multi-shot speech videos.

TalkCuts Dataset

Dataset Overview

We present samples from TalkCuts dataset, which features a diverse collection of videos from talk shows, TED talks, stand-up comedy, and other speech scenarios.

Dataset Annotations

We provide visualization of 2D keypoints and 3D SMPLX annotation of the TalkCuts dataset.

Video Generation Comparison

We compare our generated human speech video with state-of-the-art human video generation baselines.

Video Generation Results

We demonstrate our generated human speech video with multi camera shots.