Conan’s Bow Tie: A Streaming Voice Conversion for Real-Time VTuber Livestreaming
Published in Proceedings of ACM IUI, 2024
Recommended citation: Qianniu Chen, Zhehan Gu, Li Lu*, Xiangyu Xu, Zhongjie Ba, Feng Lin, Zhenguang Liu, Kui Ren. "Conan's Bow Tie: A Streaming Voice Conversion for Real-Time VTuber Livestreaming." Proceedings of ACM IUI. Greenville, SC, USA. pp. 35-50. 2024. doi: 10.1145/3640543.3645146.
ACM Conference on Intelligent User Interfaces is the annual premiere venue of the intersection of Artificial Intelligence (AI) and Human-Computer Interaction (HCI). ACM IUI is a CCF-B conference.
Abstract: Recent years have witnessed a dramatic growing trend of Virtual YouTubers (VTubers) as a new business on social media, such as YouTube, Twitch, and TikTok. However, a significant challenge arises when VTuber voice actors face health issues or retire, jeopardizing the continuity of their avatar’s recognizable voices. A potential solution reminiscent of Conan’s Bow Tie voice changer in the popular animation Case Closed (i.e., Detective Conan) has inspired our work. To make this a reality, we introduce VTuberBowTie, a user-friendly streaming voice conversion system for real-time VTuber livestreaming. We propose an innovative streaming voice conversion approach that tackles the challenges of limited context modeling and bidirectional context dependence inherent to conventional real-time voice conversion. Rather than individually processing the voice stream in data chunks, our approach adopts a fully sequential structure that leverages contextual information preceding the input chunk, thereby expanding the perceptual range and enabling seamless concatenation. Moreover, we developed a ready-to-use interaction interface for VTuberBowTie and deployed it on various computing platforms. The experimental results show that VTuberBowTie can achieve high-quality voice conversion in a streaming manner with a latency of 179.1ms on CPU and 70.8ms on GPU while providing users a friendly interactive experience.