Multimodal Emotion Computing Overview

Exciting Recommendations

By Wang Shasha, R&D Center, Agricultural Bank of China

Emotion computing aims to construct an intelligent system that can perceive, recognize, and understand human emotions, achieving intelligent, sensitive, and natural responses to human feelings. Early on, the industry commonly employed unimodal emotion computing technologies, such as micro-expression recognition, speech emotion recognition, and text sentiment mining. While effective, unimodal analysis has limitations due to the diverse expressions of human emotions. In recent years, multimodal fusion technology has developed rapidly, pushing the transition from unimodal emotion computing to multimodal emotion computing. Multimodal emotion computing refers to the integration of two or more modes of information, utilizing various techniques such as machine learning and deep learning to recognize and understand human emotional states. Studies have shown that multimodal emotion computing has higher accuracy and robustness compared to unimodal emotion computing.

The “2021 Research Frontier Heat Index” report released by the Chinese Academy of Sciences shows that research hotspots centered on “multimodal emotion computing” rank in the TOP 10. The “2023 Financial Technology Trend Outlook” published by the Peking University Guanghua Duxiaoman Financial Technology Laboratory in collaboration with the MIT Technology Review China research team lists multimodal emotion computing as one of the top ten technological trends. This indicates that multimodal emotion computing has become a key technology area for development, layout, and deep application in academia and enterprises in the coming years.

In the banking sector, multimodal emotion computing technology has unique value and broad application prospects. This technology positively promotes optimizing customer service experiences, assisting precise marketing, and refining risk management. Currently, some banking institutions have begun exploring the application of multimodal emotion computing technology in credit card customer service to provide customers with higher quality services. In the future, as technology continues to advance and data processing capabilities improve, the digital transformation of the banking industry will provide a solid foundation and a favorable practical environment for the development of multimodal emotion computing. At the same time, the continuous evolution and upgrading of the technology itself will strongly promote the financial industry towards a more intelligent and personalized direction.

1. Overview of Multimodal Emotion Computing Technology

1. Overall Framework

Multimodal emotion computing mainly involves data collection, feature extraction, feature fusion, and emotion recognition. Taking video datasets as an example, the basic process of multimodal emotion computing is shown in Figure 1. First, by collecting video data and preprocessing the videos, the video can be extracted into three modalities: text, speech, and image. Features that reflect emotional information are extracted from each modality, and the features of different modalities are fused. The fused features are then used to train a multimodal emotion computing model, such as a deep learning model, for classification, regression, or detection. Finally, the trained model is used to predict the emotions of new samples, thereby achieving the goals of multimodal emotion computing technology.

Figure 1 Multimodal Emotion Computing Framework

2. Multimodal Datasets

Currently, the sources of multimodal emotion data at home and abroad mainly come from two paths. One part of the data comes from publicly available audiovisual materials on the internet, including online films, user comments, and live broadcast clips. The other part is custom-collected data under specific experimental conditions through a multi-sensor system configured for different scenarios and subjects. Common datasets include CMU-MOSEI, YouTube, Weibo, DEAP, etc.

3. Feature Extraction

The feature extraction methods of multimodal emotion computing technology are the same as those of unimodal computing technology, that is, features are extracted separately from each modality. In terms of text feature extraction, the mainstream methods at this stage are deep learning-based feature extraction methods, including Word2vec, BERT, ELMO, etc. For audio feature extraction, common methods include Mel spectrogram and Mel-frequency cepstral coefficients. For image feature extraction, common methods include gradient features, color features, texture features, shape features, etc.

4. Multimodal Fusion

The biggest difference and challenge between multimodal emotion computing and unimodal emotion computing is that multimodal requires the fusion of unimodal information. Multimodal fusion strategies can include model-independent fusion and model-dependent fusion. Model-independent fusion includes feature-level fusion, decision-level fusion, and hybrid-level fusion, which do not rely on specific deep learning methods. Model-dependent fusion allows different modality features to be input into different model structures for further feature extraction, requiring deep learning models to solve the multimodal fusion problem. Currently, decision-level fusion is more commonly applied in the industry.

Feature-level fusion, also known as early fusion, is a fusion strategy performed at the feature level. After extracting features from each modality, the feature vectors of each modality are combined through simple concatenation, addition, multiplication, or composite operations to form combined features. Finally, the combined features obtained from early fusion are input into a classifier for emotion classification (as shown in Figure 2).

Figure 2 Feature-Level Fusion

Decision-level fusion, also known as late fusion, is a strategy for feature fusion at the decision level. Each modality’s features are first analyzed separately by a classifier, and the emotional classification results of each modality are fused to obtain the final classification result (as shown in Figure 3).

Figure 3 Decision-Level Fusion

Hybrid-level fusion combines feature-level fusion and decision-level fusion, taking the advantages of both while complementing each other’s shortcomings, and the model complexity and implementation difficulty also increase accordingly (as shown in Figure 4).

Figure 4 Hybrid-Level Fusion

Model-dependent fusion allows different modality features to be input into different model structures for feature extraction without needing to explore the importance of each modality. Instead, suitable models need to be established based on modality characteristics to jointly learn relevant information (as shown in Figure 5). The biggest feature of model-dependent fusion compared to decision-level fusion and feature-level fusion is the flexibility in choosing the fusion position.

Figure 5 Model-Dependent Fusion

5. Emotion Model Description

Emotion model description is the foundation and prerequisite for emotion recognition. By establishing a mathematical model of emotional states, it can more intuitively describe and understand the connotation of emotions. Based on how emotions are described, emotional models can be divided into discrete emotion models, dimensional emotion models, and other emotion models (see Table 1). In practical emotion modeling applications, discrete emotion models and dimensional emotion models are the most widely used. Discrete emotion models refer to emotions composed of several basic discrete states, with no correlation between each emotion. Discrete emotion models align more closely with human cognition, with clear emotional distinctions, but have certain limitations in quantification. Dimensional emotion models use multi-dimensional vectors to represent different emotional states in emotional space. Dimensional emotion models are continuous, with the advantages of representing a wide range of emotions and describing the evolution of emotions. Currently, two-dimensional and three-dimensional models are more common, while four-dimensional models are relatively abstract and complex and are not widely used. Each of the discrete and dimensional emotion models has its advantages and disadvantages, and the choice of which model to use depends on the actual application task and scenario requirements.

Table 1 Emotion Models

3. Potential Financial Application Scenarios of Multimodal Emotion Computing

Multimodal emotion computing has broad application prospects in the banking industry, significantly enhancing customer experience, marketing effectiveness, and refined risk management. With the issuance of personal information protection laws and regulations and the gradual formation of social consensus, banks must strictly act in accordance with the law when collecting user information, adhering to privacy protection principles, ensuring that users are fully informed and obtaining their explicit authorization.

1. Customer Satisfaction Assessment

Traditional customer satisfaction evaluations mainly rely on customers filling out questionnaires and similar forms. By introducing multimodal emotion computing technology, the system can automatically perceive customer service emotions, enabling automated and unobtrusive intelligent assessments. This not only simplifies customer evaluation operations and saves customer time but also more accurately reflects customers’ true feelings, helping banks improve services and continuously optimize customer experience.

2. Intelligent Customer Service

Intelligent customer service systems, by integrating multimodal emotion computing capabilities, can understand customer needs more deeply, real-time sense and adapt to changes in customer emotions, providing more personalized and attentive services. For instance, when the system detects that a customer is troubled or anxious, it will automatically adjust the service approach, responding with a more patient and detailed attitude, and seamlessly connecting to human customer service if necessary, ensuring that every customer receives appropriate attention and assistance, thus enhancing comfort and satisfaction throughout the service process.

3. Precise Marketing

By combining multimodal emotion computing technology, banks can comprehensively grasp and understand customers’ true intentions, thus achieving precise positioning in areas such as speech mining and product recommendations, constructing smarter and more compatible marketing strategies, effectively enhancing user satisfaction and recognition. For example, when the system identifies that a customer shows high interest in a particular product or service, the bank can quickly respond and push relevant product information, improving marketing efficiency and helping customers find products that meet their needs more quickly.

4. Intelligent Fraud Prevention

Many banks have begun to adopt video signing services in credit business. However, non-face-to-face communication may make it difficult for banks to accurately judge customers’ actual intentions, increasing fraud risk. Integrating multimodal emotion computing technology into loan application and approval processes as a key component of risk assessment models can assist customer managers in more accurately identifying potential fraud risks, achieving risk warning and prevention, thereby strengthening the safe and sound operation of credit business.

5. Customer Service Quality Inspection

Traditional customer service quality inspection mainly relies on manual listening to and evaluating service dialogues, which is labor-intensive and time-consuming. By applying multimodal emotion computing technology, banks can automate and precisely conduct customer service quality inspections, effectively shortening inspection cycles, while deeply exploring problem areas in the service process, scientifically assessing customer service performance, revealing the strengths and weaknesses of service strategies, thus helping to optimize service processes and mitigate risks, ultimately enhancing customer satisfaction and overall service quality.

4. Challenges and Future Trends of Multimodal Emotion Computing

Despite the simultaneous advancement of technology research and application practice, multimodal emotion computing faces many challenges in the banking industry. On the technical level, there is still a need to continuously tackle the challenges of algorithm accuracy and robustness, ensuring that models can adapt to diverse emotional expressions in complex financial scenarios. Real-time processing of large amounts of multimodal data requires extremely high computing power, posing greater demands on existing hardware and software. On the application level, while some financial enterprises have begun utilizing multimodal emotion computing to enhance customer experience, large-scale deployment remains rare, and most applications are often limited to specific scenarios, with accuracy in complex emotional expressions still needing improvement. In terms of security and compliance, data privacy protection issues are particularly prominent, and when collecting and utilizing customer emotion data, strict adherence to relevant laws and regulations is necessary to establish a sound data security management system and effectively protect customer privacy rights.

In the future, multimodal emotion computing technology is expected to show a strong development trend of deeply integrating advanced intelligent technologies, combining complex models, multimodal fusion, and edge computing, thereby significantly enhancing the intelligence and humanity of banking services. First, deep learning, as the core driving force of multimodal emotion computing, will continue to improve the accuracy and efficiency of multimodal emotion computing models. Through constantly optimized deep neural network architectures, systems can extract deep emotional features from various modalities of data, such as voice, text, facial expressions, and body language, achieving precise judgment and real-time response to customer emotional states. Secondly, the development of multimodal fusion technology will also promote the progress of multimodal emotion computing. By combining different modalities of data, such as sound, images, and text, more comprehensive emotional analysis results can be provided. Finally, edge computing technology is expected to significantly enhance the speed and efficiency of multimodal emotion computing. By performing preliminary data processing on local devices and only transmitting necessary information to the cloud, latency can be significantly reduced while protecting privacy, ensuring real-time emotional analysis services even in poor network conditions.

Recently, multimodal fusion technology has rapidly developed, and its powerful capabilities can assist in improving the quality and efficiency of multimodal emotion computing. Looking ahead, multimodal emotion computing is expected to become a core competitive element for banks, driving them to build a more intelligent and humane service system, providing customers with warmer and more considerate financial services.

This article is reproduced from the WeChat public account “Our Happiness”

Call for Papers from “China Financial Computer” Magazine

Submission Columns: Digital Development, Data Capabilities, IT Practice, Information Security.

We look forward to your original articles, preferably between 3000 and 6000 words.

Submission Email: [email protected]

Contact Number: 010-51915111-816

“China Financial Computer” serves professionals in the financial technology field, actively disseminating financial technology management concepts and the latest application results, promoting the integration and innovation of financial technology with business departments, and facilitating communication between financial institutions and the IT industry.

Special Reminder

“China Financial Computer” magazine does not charge authors for publication fees and will not collect fees under the guise of “this journal’s agency/editorial department” or “publication deposits” or “handling fees.” The purchasing account for the journal is a public account, and please do not remit to any personal accounts. Additionally, this journal issues invoices without charging any extra fees. If there are any irregularities, authors should immediately verify with the editorial department to avoid risks or losses.

Contact Information for the Editorial Department:

010-51915111-816

Submission Email:

[email protected]

Market Cooperation:

010-51915111-813

010-51915111-812

010-51915111-826

Leave a Comment Cancel reply