Treffer: Creating Multimodal Interactive Digital Twin Characters From Videos: A Dataset and Baseline.
Weitere Informationen
In this paper, we introduce a novel framework for creating multimodal interactive digital twin characters, from dialogue videos of TV shows. Specifically, these digital twin characters are capable of responding to user inputs with harmonious textual, vocal, and visual content. They not only replicate the external characteristics, such as appearance and tone, but also capture internal attributes, including personality and habitual behaviors. To support this ambitious task, we collect the Multimodal Character-Centric Conversation Dataset, named MCCCD, which includes character-specific and high-quality multimodal dialogue data with detailed annotations, featuring 6.8 k utterances and 4.6 hours of audio/video per character. Notably, the MCCCD dataset is approximately ten times larger than existing datasets in terms of per-character data volume, facilitating the detailed modeling of complex character-centric traits. Further, we propose a baseline framework to create digital twin characters, consists of dialogue generation through large language models, voice generation via speech synthesis models, and visual representation with 3D talking head models. Experimental results demonstrate that our approach significantly outperforms existing methods in generating consistent and character-specific responses, setting a new benchmark for digital character creation. Our collected dataset and proposed baseline have paved the way for the creation of highly interactive and natural digital avatars, opening the door to extensive and practical applications of digital humans.