Soundify: Matching Sound Effects to Video

Abstract

In the art of video editing, sound is really half the story. A skilled video editor overlays sounds, such as effects and ambients, over footage to add character to an object or immerse the viewer within a space [1]. However, through formative interviews with 10 professional video editors, we found that this process can be extremely tedious and time-consuming. More specifically, video editors identified three key bottlenecks: (1) finding suitable sounds, (2) precisely aligning sounds to video, and (3) tuning parameters such as pan and gain frame-by-frame. To address these challenges, we introduce Soundify, a system that matches sound effects to video. Prior works have largely explored either learning audio-visual correspondence from large-scale data [3, 9, 8] or performing audio synthesis from scratch [5, 10, 4]. In this work, we take a different approach. By leveraging labeled, studioquality sound effects libraries [2] and extending CLIP [6], a neural network with impressive zero-shot image classification capabilities, into a "zero-shot detector", we are able to produce high-quality results without resource-intensive correspondence learning or audio generation.

Example Visualization
Example Results