Summary of NLP and CV Fusion in Multimodal Systems

Follow the WeChat public account “ML_NLP
Set it as “Starred“, delivering heavy content at the first time!

Summary of NLP and CV Fusion in Multimodal Systems

Reprinted from | NLP from Beginner to Abandon
Written by | Sanhe Factory Girl
Edited by | zenRRan
The first exposure to multimodal was a Douyin recommendation project, which involved some videos, titles, user likes, collections, etc., to recommend works to users. I was responsible for the NLP part in this project, and although the overall team performance was acceptable using wide and deep learning, the A/B test showed that the text part contributed nothing… ( )
Looking back now, the wide and deep approach is still too crude (for complex information fusion). This article will cover the basics of multimodal literacy and some recent sophisticated model designs for image-text fusion, mainly in the VQA (Visual Question Answering) field. There is also a multimodal QA because, in the recommendation field, you can see that even if the contribution of NLP is zero, the user features are sufficient, and the results can still be very good.

1. Conceptual Literacy

Multimodal (MultiModal)
  • Obtaining information expression from multiple different sources (different forms of information)

Five Challenges
  1. Representation (Multimodal Representation), for example, shift rotation without deformation, a representation studied in images

  • Redundancy problem of representation

  • Different signals, some are symbolic signals, some are wave signals, what kind of representation is convenient for multimodal models to extract information

Methods of Representation
  • Joint representation maps the information of multiple modalities into a unified multimodal vector space

  • Cooperative representation is responsible for mapping each modality in multimodal to its respective representation space, but the mapped vectors satisfy certain correlation constraints.

Summary of NLP and CV Fusion in Multimodal Systems
2. Translation/Transformation/Mapping
  • Mapping of signals, for example, translating an image into text, translating text into images, transforming information into a unified form for later application

  • Methods, which overlap with the field of translation, involve instance-based translation, retrieval, dictionaries (rules), etc., and generative methods like generating translated content

3. Alignment
  • Multimodal alignment is defined as finding the relationship and correspondence between instance components from two or more modalities, studying how different signals align (for example, finding which part of the script corresponds to a movie)

  • Alignment methods, there are specialized fields studying alignment, mainly two types: explicit alignment (for example, in the time dimension, this is explicit alignment) and implicit alignment (for example, language translation is not position-to-position)

4. Fusion
  • For example, the fusion of tone and statements in sentiment analysis

  • This is the most difficult and most researched field, such as how to fuse syllables and lip-reading avatars; this note mainly writes about fusion methods

2. Applications

Speech recognition, multimedia content retrieval, video understanding, video summarization, event monitoring, sentiment analysis, video conference sentiment analysis, media description, visual question answering, etc. The applications are actually very broad, but the current level of intelligence greatly limits them. Whatever, I think the combination of vision and language is a step closer to intelligence than pure NLP.

3. VQA Literacy and Common Methods

VQA (Visual Question Answering)
  • Given an image (video) and a natural language question related to that image, the computer can generate a correct answer. This is a combination of text QA and Image Captioning, usually involving reasoning about the image content, looking cooler (not referring to logic, just the intuitive feeling).

Currently, Four Major Methods of VQA
  1. Joint embedding approaches, which start fusion information directly from the source encoding perspective. This naturally leads to the simplest and most straightforward method of directly concatenating the text and image embeddings (ps: this crude concatenation works well), Billiner Fusion is the most commonly used, the LR of the fusion field

  2. Attention mechanisms, many VQA problems focus on attention; attention itself is also an action of information extraction. Since “attention is all you need,” the application of attention has become fancy, and this article will later introduce several papers from CVPR2019

  3. Compositional Models, this method solves problems by modularizing, with each module handling different functions, and then reasoning the results through module assembly

Summary of NLP and CV Fusion in Multimodal Systems
For example, in [1], the question is What color is his tie? First, select the attend and classify modules, and assemble the modules based on reasoning methods to finally draw a conclusion.
4. Models Using External Knowledge Base
Utilizing external knowledge bases for VQA is easy to understand; QA often relies on knowledge bases, which provide a one-time knowledge reserve. For example, to answer a question like “How many mammals are in the picture?”, the model must know the definition of “mammals,” which is difficult to learn from the image, so integrating a knowledge base for retrieval is a solution, as seen in [2]
Summary of NLP and CV Fusion in Multimodal Systems

4. Several Methods of CV and NLP Fusion in Multimodal

1. Bilinear Fusion and Joint Embedding
Bilinear Fusion is one of the most common fusion methods; many papers use this as a basic structure. In the CVPR2019 VQA multimodal reasoning [3], the proposed CELL is based on this, where the authors model the relationship reasoning, not only modeling the interaction between the question and the image regions but also the relationships between the image regions. The derivation process is a step-by-step approach.
Summary of NLP and CV Fusion in Multimodal Systems
The proposed MuRel, Bilinear Fusion combines each image region feature with the question text feature to obtain multimodal embedding (Joint embedding), which models the pairwise relationships of these embeddings.
Summary of NLP and CV Fusion in Multimodal Systems
Part One Bilinear Fusion, simply put, is a function that is linear in both variables; the parameters (representing the association between two types of information) are a multi-variate matrix. The Tucker decomposition method used in the MUTAN model significantly reduces the parameter size of the linear relationship.
Part Two Pairwise relation learns the relationships between nodes after fusion (mainly the relationships of images), which are then effectively (crudely) concatenated with the original text information.
Finally as shown below, placed in the network for iterative reasoning. Experimental results show that this structure performs well in location inference-type problems.
Summary of NLP and CV Fusion in Multimodal Systems
2. Fancy Dynamic Attention Fusion
In this paper [4], the authors simultaneously notice the intra-modality and inter-modality relationships, i.e., the intra-modality relation and inter-modality relation as mentioned by the authors, but the authors cleverly use attention for various fusions.
The authors believe that the intra-modality relation supplements the inter-modality relation: image regions should not only receive information from the question text but also need to relate to other image regions.
The model structure first extracts features from images and text separately, then models intra-modality attention and inter-modality attention, stacking this module multiple times, and finally concatenating for classification. The inter-modality attention is reciprocal (text to image, image to text), and attention is used as in the transformer model.
Summary of NLP and CV Fusion in Multimodal Systems
The module for modeling intra-modality relationships is Dynamic Intra-modality Attention Flow (DyIntraMAF). The biggest highlight of the paper is the conditional attention, meaning that the establishment of attention confidence between images should not only rely on the images but also generate different associations based on different specific questions.
Summary of NLP and CV Fusion in Multimodal Systems
This conditional attention design is somewhat similar to the gating mechanism of LSTM, controlling information through the inclusion of a gating mechanism. In the diagram below, the self-attention of the image is filtered through the text’s gating mechanism. Finally, the authors conducted many ablation studies, achieving SOTA results.
Summary of NLP and CV Fusion in Multimodal Systems
3. VQA Dialogue Systems
Additionally, there is a paper [5] on multimodal QA, where the fusion method is quite ordinary multimodal fusion, also ordinary bilinear, but this application scenario is extremely practical. When we usually cannot express clearly with language, a picture is worth a thousand words, and multimodal starts from this point, sending a picture, like this, like that… The paper uses this for commercial customer service QA.
Summary of NLP and CV Fusion in Multimodal Systems
The model is quite standard, on the encoder side, first using CNN to extract image features, then building an attribute classification tree based on product attributes, processing text conventionally, and finally fusing through MFB.
Summary of NLP and CV Fusion in Multimodal Systems
During decoding, the text is decoded with RNN, but the image surprisingly uses cosine similarity. Given the scale of product data in e-commerce, unless significant prior work is done in business, this computation is unrealistic.
Summary of NLP and CV Fusion in Multimodal Systems

In all

This article expands the breadth of NLP; it is not deep, and the selected papers are quite casual (as I am not very familiar), but as an NLPer, I think this is also a direction in terms of breadth.
Summary of NLP and CV Fusion in Multimodal Systems
Summary of NLP and CV Fusion in Multimodal Systems

References

  1. Neural Module Networks

  2. Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources

  3. MUREL: Multimodal Relational Reasoning for Visual Question Answering

  4. Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

  5. User Attention-guided Multimodal Dialog Systems

Download 1: Four-piece Set
Reply "Four-piece Set" in the backend of the Machine Learning Algorithms and Natural Language Processing public account to get the learning materials for TensorFlow, Pytorch, Machine Learning, and Deep Learning!

Download 2: Repository Address Sharing
Reply "Code" in the backend of the Machine Learning Algorithms and Natural Language Processing public account to get 195 NAACL + 295 ACL2019 papers with open-source code. The open-source address is as follows: https://github.com/yizhen20133868/NLP-Conferences-Code

Heavy! The Machine Learning Algorithms and Natural Language Processing exchange group has been officially established! There are a lot of resources in the group, and everyone is welcome to join and learn!

Extra benefits! Resources on Deep Learning and Neural Networks, official Chinese tutorials for Pytorch, data analysis using Python, machine learning notes, official documentation for Pandas in Chinese, Effective Java (Chinese version), and 20 other welfare resources.

How to get: After entering the group, click on the group announcement to obtain the download link.
Note: Please modify the remarks when adding to [School/Company + Name + Direction].
For example - Harbin Institute of Technology + Zhang San + Dialogue System.
The account owner and WeChat merchants should consciously avoid this. Thank you!

Recommended Reading:
Implementation of NCE-Loss in Tensorflow and word2vec
Overview of Multimodal Deep Learning: Summary of Network Structure Design and Modality Fusion Methods
Awesome Adversarial Machine Learning Resource List

Leave a Comment