Text and Visual: Introduction to Multiple Visual/Video BERT Papers

Reprinted from WeChat Official Account: AI Technology Review

Author: Yang Xiaofan

Since the success of Google’s BERT model in 2018, more and more researchers have drawn on BERT’s ideas for tasks beyond pure text, developing various visual/video (Visual/Video) fusion BERT models.Here we introduce the original VideoBERT paper and six other recent V-BERT papers (sorted in chronological order).

VideoBERT

VideoBERT: A Joint Model for Video and Language Representation Learning

VideoBERT: A joint learning model for video and language representation

Paper link: https://arxiv.org/abs/1904.01766

Abstract: To leverage the large-scale unlabeled data on public media platforms like YouTube, self-supervised learning has become increasingly important. Most current methods learn some low-level representations, while in this paper, the authors propose a joint model of vision and semantics that learns high-level features without additional explicit supervision. Specifically, the authors build upon the highly successful BERT model in language modeling, improving it by deriving visual tokens and linguistic tokens from the vector quantization of video data and existing speech recognition outputs, respectively, and learning the bidirectional joint distribution over these token sequences. The authors tested this model on multiple tasks, including action classification and video description. They demonstrated that this model can be directly applied to open vocabulary classification tasks and confirmed that large-scale training data and cross-modal information significantly impact the model’s performance. Moreover, this model outperforms the best existing video description models, and the authors validated through quantitative results that this model indeed learns high-level semantic features.

ViLBERT

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

ViLBERT: Training task-agnostic visual-linguistic representations for vision-language tasks

Paper link: https://arxiv.org/abs/1908.02265

Abstract: In this paper, the authors propose ViLBERT (Vision-and-Language BERT), a model that learns task-agnostic joint representations of image content and natural language. They extend the popular BERT architecture into a multimodal model that supports two-stream inputs, preprocessing visual and textual inputs separately and interacting in a joint attention transformer layer. The authors first pretrain the model on a large-scale automatically collected dataset, Conceptual Captions, through two proxy tasks, and then transfer it to multiple existing vision-language tasks, including visual question answering, visual commonsense reasoning, anaphora resolution, and instruction-based image retrieval, making only minor adjustments to the underlying architecture. Compared to current task-specific models, their approach yields significant performance improvements, achieving the best results on all four tasks. Their results also represent a new approach to learning the connection between vision and language, no longer limited to learning during the training process of a specific task, but treating the vision-language connection as a pre-trainable and transferable model capability.

VisualBERT

VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A simple and effective baseline model for vision and language

Paper link: https://arxiv.org/abs/1908.03557

Abstract: In this paper, the authors introduce VisualBERT, a framework capable of modeling a range of different vision-language tasks, which is simple and flexible. VisualBERT consists of a stack of transformer layers that implicitly align elements in a segment of text with regions in a related input image through self-attention. Additionally, the authors propose two visual-language association learning objectives on image description data for VisualBERT’s pretraining. They conducted experiments on four visual-language tasks: VQA, VCR, NLVR2, and Flickr30K, showing that VisualBERT achieves competitive performance or better with a significantly simpler architecture across all tasks. Further analysis indicates that VisualBERT can establish connections between language elements and regions in images without any explicit supervision and is also sensitive to syntactic relationships and tracking (establishing relationships between verbs and image regions based on descriptions).

B2T2

Fusion of Detected Objects in Text for Visual Question Answering

Fusion of detected objects in text for visual question answering

Paper link: https://arxiv.org/abs/1908.05054

Abstract: The paper’s authors developed a simple yet powerful neural network that can merge visual and natural language data as a continuous improvement of multimodal models. The model is named B2T2 (Bounding Boxes in Text Transformer), which utilizes reference information that points words to parts of the image within the same unified architecture. B2T2 exhibits excellent performance on the visual commonsense reasoning (http://visualcommonsense.com/) dataset, reducing the error rate by 25% compared to previously published baseline models, and is currently the best-performing model on public leaderboards. The authors conducted detailed contrastive experiments, demonstrating that early integration of visual features and text analysis is a key reason for the new architecture’s effectiveness.

Unicoder-VL

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Unicoder-VL: A universal encoder for language and vision generated through cross-modal pre-training

Paper link: https://arxiv.org/abs/1908.06066

Abstract: The authors propose Unicoder-VL, a universal encoder that learns joint representations of vision and language in a pre-trained manner. This model draws on the design principles of cross-lingual pre-trained models like XLM and Unicoder, with visual and linguistic content fed into a multi-layer transformer during the cross-modal pre-training phase; the pre-training phase involves three tasks, including masked language modeling, masked object label prediction, and vision-language matching. The first two tasks enable the model to learn content-relevant representations from joint tokens based on language and visual content input; the last task attempts to predict whether an image and a text description match. After pre-training on a large number of image-description pairs, the authors transferred Unicoder-VL to image-text retrieval tasks, adding only one extra output layer, achieving state-of-the-art performance on both MSCOCO and Flicker30K datasets.

LXMERT

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

LXMERT: Learning cross-modality encoder representations from transformers

Paper link: https://arxiv.org/abs/1908.07490

Abstract: Visual-language reasoning requires a certain understanding of visual concepts and linguistic semantics, particularly the ability to align and find relationships between these two modalities. The authors propose the LXMERT framework to learn these connections between language and vision. In LXMERT, the authors construct a large-scale transformer model that includes three encoders: one for object relationships, one for language, and one for cross-modal encoding. To enable the model to establish connections between visual and linguistic semantics, the authors pre-train the model on a large number of image-sentence pairs using five different representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label detection), cross-modal alignment, and image question answering. These tasks help learn relationships within the same modality as well as across modalities. After fine-tuning on the pre-trained parameters, the model achieved the best results on both VQG and GQA visual question answering datasets. The authors also adapted this pre-trained cross-modal model to a challenging visual reasoning task, NLVR2, raising the best score from a previous 54% accuracy to 76%, indicating the model’s strong generalization capability. Finally, the authors demonstrated through contrastive experiments that their newly designed model components and pre-training strategies significantly contributed to the results. Code and pre-trained models can be found at https://github.com/airsplay/lxmert

VL-BERT

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

VL-BERT: Pre-training of generic visual-linguistic representations

Paper link: https://arxiv.org/abs/1908.08530

Abstract: The authors designed a new pre-trainable generic representation for vision-language tasks, named VL-BERT. VL-BERT uses a simple yet effective transformer model as its backbone and extends it, allowing visual and linguistic embedding features to be input simultaneously. Each element in the input can be a word from a sentence or a region of interest in the input image. The model’s design also aims to be compatible with all downstream vision-language tasks. The authors pre-trained the model on a large-scale dataset, Conceptual Captions, with three pre-training tasks: masked text modeling with visual cues, region classification with linguistic cues, and sentence-image relationship prediction. They provided extensive empirical analysis demonstrating that the pre-training phase better aligns visual-language cues and benefits downstream tasks such as visual question answering, visual commonsense reasoning, and anaphora resolution. Notably, VL-BERT achieved the best single model score on the VCR leaderboard.

Recommended Reading:

Discussing the development of pre-trained language models based on recent papers

How to evaluate the fastText algorithm proposed by the authors of Word2Vec? Does deep learning have no advantage in simple tasks like text classification?

From Word2Vec to BERT, discussing the evolution of word embeddings (Part 1)