Auto transitions: Learn to suggest video transitions

news7g07/30/2022

8 2 minutes read

Video screens. Image credit: Hannu Makarainen via Flickr, CC BY-SA 2.0

Edit video is a challenging task for the novice. Fortunately, video editing tools can help. A recent article on arXiv.org introduces a new task: automatic video conversion recommendations. With a sequence of raw video footage and accompanying audio, the task was to suggest a sequence of video transitions for each neighboring shot.

Video screen. Image credit: Hannu Makarainen via Flickr, CC BY-SA 2.0

Video screen. Image credit: Hannu Makarainen qua FlickrCC BY-SA 2.0

The researchers present the first large-scale video conversion dataset to facilitate future research. The task is built as a multi-method retrieval problem and a proposed framework for learning the correspondence between input image/audio and video transitions in feature space.

Quantitative and qualitative evaluation and a user study confirm that the proposed approach successfully learns visual/audio-to-transition fit and produces reasonable recommendation results.

Video transitions are widely used in video editing to combine footage to create visually engaging and cohesive videos. However, choosing the best transition is a challenge for non-professionals due to lack of cinematic knowledge and design skills. In this paper, we present the top work on automatic video forwarding recommendation (VTR) implementation: given a sequence of raw video footage and accompanying audio, video forward suggestions for each neighboring pairs of images. To address this task, we collect large-scale video transition datasets using video samples that are publicly available on editing software. We then formulate VTR as a multimodal retrieval problem from image/audio to video conversion and propose a new multimodal association framework consisting of two parts. First, we learn how to embed video transitions through the video transition classification task. We then propose a model to learn the appropriate correspondence from the visual/audio input to the video transition. Specifically, the proposed model uses a multimodal transformer to combine audio and visual information, as well as capture contextual signals in sequentially relayed outputs. Through quantitative and qualitative experiments, we clearly demonstrate the effectiveness of our method. Notably, in our comprehensive user study, our method received comparable scores to professional editors while improving video editing efficiency by 300× We hope our work will inspire other researchers to work on this new mission. The dataset and code are made public at This https URL.

Research articles: Shen, Y., Zhang, L., Xu, K. and Jin, X., “AutoTransition: Learn to suggest video transitions”, 2022. Link: https://arxiv.org/abs/2207.13479

Source link

news7g07/30/2022

8 2 minutes read