Tech

MaskViT: Masked image pre-training for video prediction


Humans can predict future events and sensory signals and use this ability to simulate, evaluate, and choose among different possible actions. If the robot has similar abilitythey can plan solutions for many tasks in complex and dynamic environments.

A video camera.

A video camera. Image credits: PxhereCC0 . public domain

A recent paper on arXiv.org presents Masked Transfiguration: a masked visual model-based video prediction method.

The researchers use a discrete-variant autoencoder to compress the frames into a smaller visual grid of tokens. A new iterative decoding scheme for video based on the proposed mask scheduling function. It is shown that masking a variable number of tokens during training allows to achieve competitive video prediction results. Iterative decoding scheme is significantly faster than competitors and allows planning for real robot manipulation tasks.

The ability to predict future visual observations based on past observations and maneuvering commands can enable embodied agents to plan solutions for multiple tasks in complex environments. This work shows that we can generate good video prediction models by pre-training the transformers through a masked visualization model. Our approach, dubbed MaskViT, is based on two simple design decisions. First, for memorization and training efficiency, we use two types of attention windows: spatial and spatial technology. Second, during training, we mask a variable percentage of tokens instead of a fixed percentage of the mask. For inference, MaskViT generates all the code through iterative tweaking, where we gradually increase the rate of mask generation after a mask scheduling function. On several datasets, we demonstrate that MaskViT performs better than previous jobs in video prediction, efficient parameter and can generate high resolution (256 × 256) video. Furthermore, we demonstrate the benefit of increased inference speed (up to 512x) due to iterative decoding using MaskViT for planning on a real robot. Our work suggests that we can provide embodiment agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge. .

Research articles: Gupta, A., Tian, ​​S., Zhang, Y., Wu, J., Martín-Martín, R. and Fei-Fei, L., “MaskViT: Masked Visual Pre-Training for Video Prediction”, 2022 . Link: https://arxiv.org/abs/2206.11894
Project location: https://maskedvit.github.io/






Source link

news7g

News7g: Update the world's latest breaking news online of the day, breaking news, politics, society today, international mainstream news .Updated news 24/7: Entertainment, Sports...at the World everyday world. Hot news, images, video clips that are updated quickly and reliably

Related Articles

Back to top button