We present additional generated long human speech videos with multi camera shots.
We compared generated human speech videos with multi camera shots against videos with groundtruth camera shot cutting.
In this work, we propose a novel system for automatically generating multi-shot speech videos with natural camera transitions, using input text lines and reference images from various camera angles. Existing human video generation datasets and methods are largely centered on faces or half-body single-shot videos, thus lack the capacity to produce multi-shot full-body dynamic movements from different camera angles. Recognizing the lack of suitable datasets, we first introduce TalkCuts, a large-scale dataset containing over 500 hours of human speech videos with diverse camera shots, rich 3D SMPL-X motion annotations, and camera trajectories, covering a wide range of identities. Based on this dataset, we further propose an LLM-guided multi-modal generation framework, named Orator, where the LLM serves as a multi-role director, generating detailed instructions for camera transitions, speaker gestures, and vocal delivery. This enables the system to generate coherent long-form videos through a multi-modal video generation module. Extensive experiments show that our framework successfully generates coherent and engaging multi-shot speech videos.
We present samples from TalkCuts dataset, which features a diverse collection of videos from talk shows, TED talks, stand-up comedy, and other speech scenarios.
We provide visualization of 2D keypoints and 3D SMPLX annotation of the TalkCuts dataset.
We compare our generated human speech video with state-of-the-art human video generation baselines.
We demonstrate our generated human speech video with multi camera shots.