Click on the “MLNLP” above to select the “Starred” public account
Heavyweight content delivered first-hand
Authors: Wu Yang, Hu Xiaoyu, Lin Zijie from Harbin Institute of Technology SCIR
Introduction
With the rapid development of social networks, the ways people express themselves on platforms have become increasingly rich, such as expressing their emotions and opinions through images, text, and videos. How to analyze the emotions in multimodal data (in this article referring to sound, images, and text, excluding sensor data) is a current opportunity and challenge in the field of sentiment analysis. On the one hand, previous sentiment analysis focused on a single modality. For example, text sentiment analysis focuses on analyzing, mining, and reasoning the emotions contained in the text. Now, there is a need to process and analyze data from multiple modalities, which poses a greater challenge for researchers. On the other hand, multimodal data contains more information compared to single-modal data, and multiple modalities can complement each other. For instance, in identifying whether a tweet is ironic, such as “The weather is great today!” If we only look at the text, it is not ironic. However, if it is accompanied by an image of a cloudy day, it may indeed be ironic. Different modal information can help machines better understand emotions. From the perspective of human-computer interaction, multimodal sentiment analysis can allow machines to interact with people in a more natural manner. Machines can understand user emotions based on people’s expressions and gestures in images, tones in voices, and recognized natural language, and then provide feedback. In summary, the development of multimodal sentiment analysis technology stems from the demands of real life, where people express emotions in a more natural way, and technology should have the capability for intelligent understanding and analysis. Although multimodal data contains more information, how to fuse multimodal data to enhance effectiveness rather than counteract it remains a major issue in the field of multimodal sentiment analysis. How to utilize the alignment information between different modal data, and model the associations between different modal data, such as when people hear “meow” they think of cats, are all current interesting questions in the field of multimodal sentiment analysis. In order to better introduce the relevant research in multimodal sentiment analysis, this article organizes the current tasks in the field of multimodal sentiment analysis and summarizes commonly used datasets and corresponding methods.
Overview of Related Tasks
This article organizes related research tasks through different modality combinations (text-image: text + image, video: text + image + audio). There are few specifically constructed related datasets for the text + audio combination, generally constructed through ASR of speech or using text + image + audio to create text + audio datasets. For text + audio, there are many research works in the speech direction, so this article does not cover it. As shown in Table 1, the sentiment analysis tasks for text-image include sentiment classification tasks for text-image, aspect-level sentiment classification tasks for text-image, and irony recognition tasks for text-image. The sentiment analysis tasks for video include sentiment classification tasks for comment videos, sentiment classification tasks for news videos, sentiment classification tasks for dialogue videos, and irony recognition tasks for dialogue videos. This article summarizes the corresponding datasets and methods for the tasks, with specific content in the third section.
Table 1 Overview of Multimodal Sentiment Analysis Tasks
Datasets and Methods
This article summarizes 13 public datasets, including 8 video datasets and 5 text-image datasets. This article also summarizes the relevant research methods corresponding to sentiment classification tasks for text-image, aspect-level sentiment classification tasks for text-image, irony recognition tasks for text-image, sentiment classification tasks for comment videos, and sentiment classification tasks for dialogue videos.
Sentiment Classification Task for Text-Image
Dataset
The Yelp dataset is sourced from Yelp.com review website, collecting reviews about restaurants and food from five cities: Boston, Chicago, Los Angeles, New York, and San Francisco. There are a total of 44,305 reviews and 244,569 images (each review has multiple images), with an average of 13 sentences and 230 words per review. The sentiment labeling of the dataset assigns a score of 1, 2, 3, 4, or 5 for the sentiment tendency of each review.
The Tumblr dataset is a multimodal emotion dataset collected from Tumblr. Tumblr is a microblogging service where users publish multimedia content that usually includes: images, text, and tags. The dataset is constructed by searching the corresponding emotional tags for fifteen selected emotions and only selecting those that contain both text and images, followed by data processing to delete content where the text already includes the corresponding emotional words, as well as tweets that are primarily not in English. The entire dataset contains 256,897 multimodal tweets, with emotional labels including fifteen emotions such as happiness, sadness, and disgust.
Method
Based on the characteristics of the Yelp dataset, [1] proposed that “images are not independent of textual expressions of emotion, but serve as auxiliary parts to highlight significant content in the text”. VistaNet uses images to guide text attention to determine the importance of different sentences for document sentiment classification. As shown in Figure 1, VistaNet has a three-layer structure, which includes a word encoding layer, a sentence encoding layer, and a classification layer. The word encoding layer encodes the words in a sentence, and then obtains the sentence representation through soft attention. The sentence encoding layer encodes the sentence representation obtained from the previous layer and then obtains the document representation through the visual attention mechanism. The document representation serves as input to the classification layer, outputting the classification results. Structurally, VistaNet is similar to the Hierarchical Attention Network, both used for document-level sentiment classification, both having a three-layer structure, and the first two layers are both GRUEncoder + Attention structures. The difference between them is that VistaNet uses a visual attention mechanism.
Figure 1 VistaNet Model Architecture
Aspect-Level Sentiment Classification Task for Text-Image
Dataset
The Multi-ZOL dataset collects IT information and business portal website ZOL.com reviews about mobile phones. The original data consists of 12,587 reviews (7,359 single-modal reviews, 5,288 multimodal reviews), covering 114 brands and 1,318 mobile phones. The 5,288 multimodal reviews constitute the Multi-ZOL dataset. In this dataset, each multimodal data contains a text content, an image set, and at least one but no more than six evaluation aspects. These six aspects are cost performance, performance configuration, battery life, appearance and feel, shooting effect, and screen. A total of 28,469 aspects are obtained. For each aspect, there is a sentiment score from 1 to 10.
Twitter-15 and Twitter-17 are multimodal datasets that include text and corresponding images, with labels indicating the emotional tendency toward the target entities expressed in their text and images. The overall data scale is Twitter-15 (3,179/1,122/1,037) tweets with images, and Twitter-17 (3,562/1,176/1,234) tweets with images, with sentiment labels classified into three categories.
Method
The aspect-level sentiment classification task studies the sentiment polarity of a given aspect (Aspect) in multimodal documents. An aspect may consist of multiple words, such as “Eating environment”. The information contained in the aspect itself is important for extracting text and image information. For the Multi-ZOL dataset, [2] proposed the Multi-Interactive Memory Network (MIMN), as shown in Figure 2. The model uses an Aspect-guided attention mechanism to guide the model in generating attention vectors for text and images. To capture interaction information between modalities and within single modalities, the model employs a Multi-interactive attention mechanism.
Figure 2 MIMN Model Architecture
Irony Recognition Task for Text-Image
The goal of the irony recognition task is to determine whether a document contains ironic expressions. [3] proposed a Hierarchical Fusion Model to model the text-image information for irony recognition.
Dataset
The Twitter irony dataset is constructed from the Twitter platform, collecting English tweets that contain images and certain topic tags (e.g., #sarcasm, etc.) as positive examples, and collecting English tweets with images but without such tags as negative examples. The dataset has further organized the data by removing tweets containing conventional words such as irony, sarcasm, and satire. Tweets containing URLs are also removed to avoid introducing additional information. Furthermore, words that frequently appear alongside sarcastic tweets, such as jokes and humor, are removed. The dataset is divided into training, development, and test sets, containing 19,816, 2,410, and 2,409 tweets with images, respectively. The labeling of this dataset is binary: sarcastic/not sarcastic.
Method
The HFM (Hierarchical Fusion Model) builds on the dual-modal text and image, adding the image attribute modality (Image attribute), which consists of several words describing the components of the image. As shown in Figure 3, the image contains attributes such as “Fork”, “Knife”, “Meat”, etc. The authors believe that image attributes can connect the content of the image and text, acting as a “bridge”.
The HFM is divided into three levels based on functionality: encoding layer, fusion layer, and classification layer, where the fusion layer can be further divided into representation fusion layer and modality fusion layer. In the encoding layer, HFM first encodes the information of the three modalities, obtaining the raw feature vectors (Raw vectors) for each modality, which is a collection of vector representations of all elements of each modality. After averaging or weighted summing the raw feature vectors, a single vector representation (Guidance vector) for each modality is obtained. After the raw feature vectors and single vector representations pass through the representation fusion layer, they obtain the reconstructed feature vector representations (Reconstructed feature vector) that incorporate information from other modalities. Finally, the reconstructed feature vectors of the three modalities are processed through the modality fusion layer to obtain the final fused vector (Fused vector), which serves as input to the classification layer.
Figure 3 HFN Model Architecture
Sentiment Classification Task for Comment Videos
Dataset
The YouTube dataset collects 47 videos from YouTube, which are not on a single theme, but on a variety of themes such as toothpaste, camera reviews, baby products, etc. The videos feature a single speaker facing the camera presenting their opinions, including 20 females and 27 male speakers, aged approximately 14-60 years, from various ethnic backgrounds. The length of the videos ranges from 2 to 5 minutes, and all video sequences are normalized to a length of 30 seconds. The annotations of the dataset are made by three annotators who watch the videos in random order, labeling them as positive, negative, or neutral. It is important to note that the labeling is not based on the viewers’ emotional tendencies toward the videos, but rather on the emotional tendencies of the speakers in the videos. Finally, out of the 47 videos, 13 are labeled as positive, 22 as neutral, and 12 as negative.
The ICT-MMMO dataset collects videos of movie reviews from social media websites. The dataset contains 370 multimodal review videos, where the videos feature a person speaking directly to the camera, expressing their reviews of movies or stating facts related to specific films. The dataset is sourced from social media platforms YouTube and ExpoTV. All speakers express their opinions in English, and the length of the videos ranges from 1 to 3 minutes. There are a total of 370 movie review videos, of which 308 come from YouTube and 62 are entirely negative reviews from ExpoTV, including 228 positive reviews, 23 neutral reviews, and 119 negative reviews. It is important to note that the labeling of this dataset is not based on viewers’ feelings about the videos, but rather on the emotional tendencies of the speakers in the videos.
The MOSI dataset collects vlogs mainly about movie reviews from YouTube. The length of the videos ranges from 2 to 5 minutes, and a total of 93 videos were randomly collected from 89 different speakers, including 41 females and 48 males, with most speakers aged between 20 and 30 years, from various ethnic backgrounds. The annotations for these videos are done by five annotators from Amazon’s crowdsourcing platform, and the average value is taken, labeled with seven categories of emotional tendencies ranging from -3 to +3. The emotional labeling of this dataset is not based on viewers’ feelings, but rather on the emotional tendencies of the commentators in the videos.
The CMU-MOSEI dataset collects data from monologue videos on YouTube and removes those that contain too many characters. The final dataset includes 3,228 videos, 23,453 sentences, 1,000 speakers, and 250 topics, totaling 65 hours of footage. The dataset includes both emotional labeling and sentiment labeling. The sentiment labeling is a seven-category sentiment labeling for each sentence, and the authors also provide 2/5/7 category labeling. The emotional labeling includes six aspects: happiness, sadness, anger, fear, disgust, and surprise.
Method
Comment video files contain three types of information: text (subtitles), images, and audio, so the sentiment classification task for comment videos requires processing these three modalities. Videos can be viewed as images arranged in a temporal sequence, which adds the attribute of time compared to a single image, allowing for the use of RNNs and their variants for encoding. The following sections will introduce three works on multimodal sentiment classification models for comment videos, namely Tensor Fusion Network [4] presented at EMNLP 2017, Multi-attention Recurrent Network [5] at AAAI 2018, and Memory Fusion Network [6].
TFN (Tensor Fusion Network)
Zadeh and his team [4] proposed a multimodal fusion method based on the outer product of tensors, which is the origin of the name TFN. In the encoding phase, TFN uses an LSTM + 2-layer fully connected network to encode the text modality input, and a 3-layer DNN network for encoding the audio and video modalities. In the modality fusion phase, the outer product of the output vectors encoded from the three modalities is taken to obtain a multimodal representation vector that includes single-modal information, dual-modal, and tri-modal information for subsequent decision-making.
Figure 4 TFN Model Architecture
MARN (Multi-attention Recurrent Network)
MARN is based on the hypothesis that “there are multiple types of information interaction between modalities”, which has been confirmed in cognitive science. Based on this hypothesis, MARN proposes to use a multi-level attention mechanism to extract different modal interaction information. The model architecture is shown in Figure 5. In the encoding phase, the authors proposed a “Long-short Term Hybrid Memory” based on LSTM, which incorporates the processing of multimodal representations, while combining modality fusion and encoding. Since modality fusion is required at each moment, the sequence lengths of the three modalities must be equal, requiring modality alignment before encoding.
Figure 5 MARN Model Architecture
MFN (Memory Fusion Network)
While MARN considers multiple possible distributions of attention weights, MFN focuses on the range of attention processing. Similar to MARN, MFN combines modality fusion with encoding; however, during encoding, the modalities are independent of each other. Since an LSTM is used, there is no shared hybrid vector added to the calculation; instead, MFN uses “Delta-memory attention” and “Multi-View Gated Memory” to simultaneously capture interactions across time and modalities, preserving multimodal interaction information from the previous moment. Figure 6 illustrates the processing of MFN at time t.
Figure 6 MFN Model Architecture
Sentiment Classification Task for Dialogue Videos
Dataset
The MELD dataset is derived from the EmotionLines dataset, which is a pure text dialogue dataset sourced from the classic TV show Friends. The MELD dataset is a multimodal dataset that includes video, text, and audio, ultimately comprising 13,709 segments, each segment not only has emotional annotations including fear among seven emotions but also sentiment annotations classified into positive, negative, and neutral categories.
The IEMOCAP dataset is quite special; it is neither collected from user-uploaded videos on existing platforms like YouTube nor from well-known TV shows like Friends. Instead, it is a multimodal dataset obtained from 10 actors performing around specific themes. The dataset collects videos from five professional male and five professional female actors, who perform conversational acts around themes, totaling 4,787 impromptu conversations and 5,255 scripted conversations, with an average duration of 4.5 seconds per conversation, for a total duration of 11 hours. The final data annotation is emotional labeling, with ten categories including fear and sadness.
Method
The goal of dialogue sentiment classification is to determine the sentiment polarity of each dialogue segment, requiring consideration of speaker information and dialogue scene information, and is significantly influenced by the content of previous dialogues. DialogueRNN [7] uses three GRUs to model speaker information, contextual information from previous dialogues, and emotional information. The model defines a global context state (Global state) and the state of the dialogue participants (Party state). Structurally, it is divided into Global GRU, Party GRU, and Emotion GRU, where the Global GRU is used to calculate and update the Global state at each moment. The Party GRU is used to calculate and update the Party state of the speaker at the current moment (turn). The Emotion GRU is used to calculate the emotional representation of the current dialogue content.
Figure 7 DialogueRNN Model Architecture
Sentiment Classification Task for News Videos
Dataset
The News Rover Sentiment dataset is a dataset in the news domain. The videos used in this dataset were recorded between August 13, 2013, and December 25, 2013, from various news programs and channels in the United States. The dataset is categorized by personnel and profession, with video lengths limited to between 4 and 15 seconds. The authors believe that it is difficult to interpret people’s emotions in very short videos, while videos longer than 15 seconds may contain multiple statements with different emotions. Ultimately, the entire dataset contains 929 segments, each of which has been labeled with three categories of sentiment.
Irony Recognition Task for Dialogue Videos
Dataset
The MUStARD dataset is a dataset for multimodal sarcasm detection, with a wide range of sources, including well-known TV shows such as The Big Bang Theory, Friends, and The Golden Girls. The authors collected videos related to sarcasm from these shows and obtained non-sarcastic videos from the MELD dataset. The final dataset includes 690 video segments, of which 345 are sarcastic and 345 are not, with the labeling indicating whether sarcasm is present.
The information about the above datasets can be summarized in Table 2.
Table 2 Information on Multimodal Sentiment Analysis Datasets
Conclusion
This article briefly outlines the relevant tasks in the field of multimodal sentiment analysis, summarizing the datasets associated with these tasks and some typical methods. Although multimodal data provides more information, how to process and analyze multimodal information, and how to fuse different modalities remains the main issues that need to be resolved in the field of multimodal sentiment analysis.
References
[1] Truong T Q, Lauw H W. VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis[C]. National Conference on Artificial Intelligence, 2019: 305-312.[2] Xu N, Mao W, Chen G, et al. Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis[C]. National Conference on Artificial Intelligence, 2019: 371-378.[3] Cai Y, Cai H, Wan X, et al. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model[C]. Meeting of the Association for Computational Linguistics, 2019: 2506-2515.[4] Zadeh A, Chen M, Poria S, et al. Tensor Fusion Network for Multimodal Sentiment Analysis[C]. Empirical Methods in Natural Language Processing, 2017: 1103-1114.[5] Zadeh A, Liang P P, Poria S, et al. Multi-attention Recurrent Network for Human Communication Comprehension[J]. arXiv: Artificial Intelligence, 2018.[6] Zadeh A, Liang P P, Mazumder N, et al. Memory Fusion Network for Multi-view Sequential Learning[J]. arXiv: Learning, 2018.[7] Majumder N, Poria S, Hazarika D, et al. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations[C]. National Conference on Artificial Intelligence, 2019: 6818-6825.[8] Yu J, Jiang J. Adapting BERT for Target-Oriented Multimodal Sentiment Classification[C]. International Joint Conference on Artificial Intelligence, 2019: 5408-5414.[9] Morency L, Mihalcea R, Doshi P, et al. Towards multimodal sentiment analysis: harvesting opinions from the web[C]. International Conference on Multimodal Interfaces, 2011: 169-176.[10] Wollmer M, Weninger F, Knaup T, et al. YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context[J]. IEEE Intelligent Systems, 2013, 28(3): 46-53.[11] Zadeh A. Micro-opinion Sentiment Intensity Analysis and Summarization in Online Videos[C]. International Conference on Multimodal Interfaces, 2015: 587-591.[12] Zadeh A B, Liang P P, Poria S, et al. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph[C]. Meeting of the Association for Computational Linguistics, 2018: 2236-2246.[13] Poria S, Hazarika D, Majumder N, et al. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations[J]. arXiv: Computation and Language, 2018.[14] Busso C, Bulut M, Lee C, et al. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database[J]. Language Resources and Evaluation, 2008, 42(4): 335-359.[15] Ellis J G, Jou B, Chang S, et al. Why We Watch the News: A Dataset for Exploring Sentiment in Broadcast Video News[C]. International Conference on Multimodal Interfaces, 2014: 104-111.[16] Castro S, Hazarika D, Perezrosas V, et al. Towards Multimodal Sarcasm Detection (An _Obviously_Perfect Paper).[J]. arXiv: Computation and Language, 2019.
This issue’s editor: Ding Xiao
This issue’s editor: Gu Yuxuan
Recommended Reading:
How Impressive is the Architecture of “12306”?
Experience Sharing: Feeding GPUs in Deep Learning
Baido AI Experts Live to Help You Complete Three Major AI Projects, 10 Classes for 1 Yuan