CSSinger Demo Page

CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder

Abstract

Singing Voice Synthesis (SVS) is a task designed to generate singing voices that are both high in fidelity and expressiveness. Conventionally, an SVS system utilizes an acoustic model to transform a music score into acoustic features, which are subsequently reconstructed into a singing voice through the use of a vocoder. Recently, end-to-end modeling approaches have experienced rapid advancements in the domains of SVS and Text to Speech (TTS) tasks. This paper presents a fully end-to-end Singing Voice Synthesis (SVS) system and implements chunkwise streaming inference to address latency issues in such systems. It is the first work to fully implement end-to-end streaming audio synthesis using latent representations in VAE. We have made specific improvements to enhance the performance of streaming SVS tasks using latent representations. The experimental results demonstrate that the system proposed in this paper achieves synthesized audio with high expressiveness and pitch accuracy in both streaming SVS and TTS tasks.

Singing Voice Synthesis

Recording SiFiSinger CSSinger-SS CSSinger-SS-NP CSSinger-FS
Recording SiFiSinger CSSinger-SS CSSinger-SS-NP CSSinger-FS

TTS

Recording SiFiSinger CSSinger-SS CSSinger-SS-NP CSSinger-FS

Ablation

Recording -Causal Smooth Layer -Natural Padding -Causal PostEnc CSSinger-FS