MaskViT: Masked image pre-training for video prediction

news7g06/26/2022

19 2 minutes read

Video-ReTime: Quick learning changes over time to record time

Humans can predict future events and sensory signals and use this ability to simulate, evaluate, and choose among different possible actions. If the robot has similar abilitythey can plan solutions for many tasks in complex and dynamic environments.

A video camera. Image credits: PxhereCC0 . public domain

A recent paper on arXiv.org presents Masked Transfiguration: a masked visual model-based video prediction method.

The researchers use a discrete-variant autoencoder to compress the frames into a smaller visual grid of tokens. A new iterative decoding scheme for video based on the proposed mask scheduling function. It is shown that masking a variable number of tokens during training allows to achieve competitive video prediction results. Iterative decoding scheme is significantly faster than competitors and allows planning for real robot manipulation tasks.

The ability to predict future visual observations based on past observations and maneuvering commands can enable embodied agents to plan solutions for multiple tasks in complex environments. This work shows that we can generate good video prediction models by pre-training the transformers through a masked visualization model. Our approach, dubbed MaskViT, is based on two simple design decisions. First, for memorization and training efficiency, we use two types of attention windows: spatial and spatial technology. Second, during training, we mask a variable percentage of tokens instead of a fixed percentage of the mask. For inference, MaskViT generates all the code through iterative tweaking, where we gradually increase the rate of mask generation after a mask scheduling function. On several datasets, we demonstrate that MaskViT performs better than previous jobs in video prediction, efficient parameter and can generate high resolution (256 × 256) video. Furthermore, we demonstrate the benefit of increased inference speed (up to 512x) due to iterative decoding using MaskViT for planning on a real robot. Our work suggests that we can provide embodiment agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge. .

Research articles: Gupta, A., Tian, S., Zhang, Y., Wu, J., Martín-Martín, R. and Fei-Fei, L., “MaskViT: Masked Visual Pre-Training for Video Prediction”, 2022 . Link: https://arxiv.org/abs/2206.11894
Project location: https://maskedvit.github.io/

Source link

news7g06/26/2022

19 2 minutes read