Google’s VideoBERT predicts what will happen next in videos

Google Researchers have proposed VideoBERT, a self-supervised system that is able to perform predictions from unlabeled videos. VideoBERT makes use of Google’s BERT to learn the details of the video. BERT (Bidirectional Encoder Representations from Transformers) is the cutting-edge model used by Google for natural language based applications.

The researchers trained VideoBERT on over one million instructional videos across categories like cooking, gardening, and vehicle repair. The results show that VideoBERT successfully predicted things like that a bowl of flour and cocoa powder may become a brownie or cupcake after baking in an oven, and that it generated sets of instructions from a video along with video segments reflecting what’s described at each step.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.