A Review of Multimodal Composite Editing and Retrieval

Source: ZHUAN ZHI

This article is a paper introduction, recommended reading time is 5 minutes.
This review is the first comprehensive literature survey on multimodal composite retrieval, providing a timely supplement to existing multimodal fusion reviews.

In the real world, information spans different modalities and is diverse. Understanding and utilizing various data types to improve retrieval systems is one of the key research focuses. Multimodal composite retrieval integrates multiple modalities such as text, images, and audio to provide more accurate, personalized, and contextually relevant results. To promote a deeper understanding of this promising direction, this review delves into multimodal composite editing and retrieval, covering image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval methods. This article systematically organizes application scenarios, methods, benchmarks, experiments, and future directions. In the era of large models, multimodal learning is a hot topic, and several reviews on multimodal learning and visual-language models with Transformers have been published in the journal “PAMI”. To our knowledge, this review is the first comprehensive literature survey on multimodal composite retrieval, providing a timely supplement to existing multimodal fusion reviews. To help readers quickly track progress in this field, we have established a project page for this review, accessible at:https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.Keywords—multimodal composite retrieval, multimodal fusion, image retrieval, image editing.

Introduction

In today’s digital environment, information is transmitted through various channels such as text, images, audio, and radar, leading to a significant increase in data volume and complexity. As data expands exponentially, the challenge of processing and integrating diverse information becomes crucial. Efficiently retrieving personalized and relevant information is becoming increasingly challenging.Traditional single-modal retrieval methods [37], [49], [55], [83], [86], [87], [226]–[228], [237], [239] rely on a single modality, such as images or text, as queries. However, these methods often struggle to fully capture the complexity and nuances of real-world information retrieval scenarios. This limitation has prompted the emergence of multimodal composite image retrieval [11], [21], [28], [88], [106], [172], [190], a promising framework that transcends the boundaries of a single modality. By leveraging the complementary advantages of various data types, multimodal composite retrieval systems enhance the understanding of user queries and context, thereby improving retrieval performance and user satisfaction.As shown in Figure 1, multimodal composite retrieval involves the complex fusion and analysis of diverse data forms such as text, images, and audio to achieve information retrieval. This approach holds significant value in various real-world scenarios, including multimedia content [80], social media platforms, and e-commerce [59], [70], [150], [194], [203]. Moreover, its applications extend to specialized fields such as medical image retrieval [19], [65], [144], document retrieval [72], [80], and news retrieval [178]. By employing diverse multimodal queries, these techniques can provide flexible and accurate results, enhancing user experience and aiding in more informed decision-making. Therefore, multimodal composite retrieval holds important potential and research value in information science, artificial intelligence, and interdisciplinary applications.Most existing multimodal composite retrieval methods [4], [11], [27], [28], [77], [85], [88], [106], [115], [132], [190] primarily focus on integrating images and text to achieve desired results. Early methods employed Convolutional Neural Networks (CNN) for image encoding and Long Short-Term Memory (LSTM) networks [108] for text encoding. With the rise of powerful Transformer models, such as Vision Transformer (ViT) [186], Swin Transformer (Swin) [128], and BERT [102], numerous Transformer-based multimodal composite retrieval methods [184], [208] have been proposed to improve image retrieval performance. Furthermore, Visual-Language Pre-training (VLP) [94], [120], [121], [158] has transformed tasks related to image understanding and retrieval by bridging the semantic gap between textual descriptions and visual content. Various VLP-based multimodal composite image retrieval methods [11], [85], [132] have shown promising results. Additionally, image-text composite editing methods [31], [39], [46], [71], [118], [119], [126], [152], [232] enable users to directly modify images or generate new content through natural language instructions, achieving precise retrieval that aligns closely with user intent. Exploration of other modalities such as audio [2] and actions [215] is also accelerating.

Motivation

Despite extensive research on multimodal composite retrieval models, new challenges continue to emerge and remain to be addressed. In this rapidly evolving field, there is an urgent need for comprehensive and systematic analysis. This review aims to facilitate a deeper understanding of multimodal composite editing and retrieval by systematically organizing application scenarios, methods, benchmarks, experiments, and future directions. We review and categorize over 130 advanced multimodal composite retrieval methods, laying a solid foundation for further research.

Literature Collection Strategy

To ensure a comprehensive overview of multimodal composite retrieval, we adopted a systematic search strategy covering a wide range of relevant literature. Our focus includes innovative methods, applications, and advancements in multimodal retrieval systems. We selected keywords such as “multimodal composite retrieval,” “multimodal learning,” “image retrieval,” “image editing,” and “feature fusion,” encompassing various aspects of this field. These terms reflect common foundational concepts, specific techniques, and emerging trends in multimodal research. We conducted searches in well-known academic databases, including Google Scholar, DBLP, ArXiv, ACM, and IEEE Xplore. Through these explorations, we collected various sources, including journal articles, conference papers, and preprints. To refine our selection, we excluded studies that primarily focused on single-modal methods or irrelevant modalities and manually reviewed the relevance and quality of the remaining literature. In the final selection process, we evaluated each paper based on its contribution and impact to plan key research for in-depth analysis. By applying these standards, we aim to provide a comprehensive perspective on the current landscape and future directions of multimodal composite retrieval.

Classification

To clarify discussions related to multimodal composite editing and retrieval, we classify it into three categories based on application scenarios: 1) image-text composite editing, 2) image-text composite retrieval, and 3) other multimodal composite retrieval, as shown in Figure 2. Specifically, image-text composite editing involves modifying images or creating new content through natural language instructions, allowing users to clearly and intuitively convey their intentions. Image-text composite retrieval searches for personalized results by inputting text and image information, thus locating relevant images through textual descriptions or generating descriptive text based on images, enhancing the search experience. Other multimodal composite retrieval tasks use combinations of different modalities such as audio and actions as inputs, providing a richer and more flexible context-aware retrieval experience.

Contributions

In summary, our contributions are as follows:

To our knowledge, this article is the first comprehensive review of multimodal composite retrieval, aiming to provide a timely overview and valuable insights for this rapidly evolving field, serving as a reference for future research.
We systematically organized research findings, technical methods, benchmarks, and experiments, aiding in the understanding of this topic, and provided extensive coverage of existing research through multi-level classification to meet diverse reader needs.
We addressed challenges and unresolved issues in multimodal composite retrieval, identified emerging trends, and proposed feasible future research directions to drive innovation in this field.

Organization of the Paper

The structure of the rest of this article is as follows. Section two introduces the foundational concepts and applications related to multimodal composite retrieval, laying the groundwork for the discussed methods. Section three delves into the various methods used in this field, categorizing them based on their underlying principles and analyzing their advantages and disadvantages. Section four outlines the benchmarks and experimental setups used to evaluate these methods and presents the results of the latest research. Section five discusses the current state of multimodal composite retrieval, highlighting challenges and proposing future research directions. Finally, section six summarizes key findings and emphasizes the importance of this field for future research.

About Us

Data Pie THU, as a public account focusing on data science, is backed by the Tsinghua University Big Data Research Center, sharing cutting-edge research dynamics in data science and big data technology innovation, continuously disseminating data science knowledge, and striving to build a platform for gathering data talents, creating the strongest group in China’s big data.

Sina Weibo: @Data Pie THU

WeChat Video Account: Data Pie THU

Today’s Headlines: Data Pie THU