Skip to the content.

Video-to-Audio Generation with Fine-grained Temporal Semantics


Abstract

With recent advances of AIGC, video generation have gained a surge of research interest in both academia and industry (e.g., Sora). However, it remains a challenge to produce temporally aligned audio to synchronize the generated video, considering the complicated semantic information included in the latter. In this work, inspired by the recent success of text-to-audio (TTA) generation, we first investigate the video-to-audio (VTA) generation framework based on latent diffusion model (LDM). Similar to latest pioneering exploration in VTA, our preliminary results also show great potentials of LDM in VTA task, but it still suffers from sub-optimal temporal alignment. To this end, we propose to enhance the temporal alignment of VTA with frame-level semantic information. With the recently popular grounding segment anything model (Grounding SAM), we can extract the fine-grained semantics in video frames to enable VTA to produce better-aligned audio signal. Extensive experiments demonstrate the effectiveness of our system on both objective and subjective evaluation metrics, which shows both better audio quality and fine-grained temporal alignment.


arch


Video-to-Audio Generation Results


System Racing Car Popping Popcorn
GT
Tango
Diff-Foley
VTA-LDM
FoleyCrafter
VTA-SAM


System Pigeon Dove Cooing Dog Bow-wow
GT
Tango
Diff-Foley
VTA-LDM
FoleyCrafter
VTA-SAM


System Canary Calling Playing Badmintonn
GT
Tango
Diff-Foley
VTA-LDM
FoleyCrafter
VTA-SAM


System Chicken Crowing Playing Harpsichord
GT
Tango
Diff-Foley
VTA-LDM
FoleyCrafter
VTA-SAM


System Playing Clarinet Ferret Dooking
GT
Tango
Diff-Foley
VTA-LDM
FoleyCrafter
VTA-SAM