Ant Group's Technical Exploration in Video Multimodal Retrieval

Introduction This article shares the research achievements of Ant Group’s multimodal cognitive team in the field of video multimodal retrieval over the past year. The article focuses on how to improve the effectiveness of video-text semantic retrieval and how to efficiently perform video-source retrieval.

Main Sections Include:

1. Overview

2. Video-Text Semantic Retrieval

3. Video-Video Source Retrieval

4. Conclusion

5. Q&A

Speaker: Guo Qingpei Ant Group Senior Algorithm Expert

Editor: Zhang Jindong

Content Proofreader: Li Yao

Community Produced by: DataFun

Overview

Video multimodal retrieval has a wide range of applications within Ant Group. It specifically includes two directions: video-text semantic retrieval and video-video source retrieval.

The video-text semantic retrieval direction aims to retrieve videos semantically related to the search text, even if the search text does not appear directly in the video description. For example, in the search bar of Alipay, users expect to retrieve relevant video content through text searches; in security scenarios, sensitive videos in security contexts can be found through text searches. The search text is usually short.

The other direction is video-video source retrieval. Source retrieval can find segments related to the query video in the video database, which has wide applications in real scenarios. For example, during video procurement, it can prevent purchasing existing videos, thus reducing procurement costs; in video copyright protection, when a user provides a short video, it is necessary to check through a massive video database to determine whether the video infringes copyright.

Methods to quickly enhance video-text semantic retrieval include: video-text pre-training, focusing on difficult samples, and introducing fine-grained recognition. Regarding video-text pre-training, we used the r@sum metric on the MSRVTT text-video retrieval dataset to measure the effectiveness of semantic retrieval algorithms. The r@sum metric combines top1-recall (r@1), top-5 recall (r@5), and top-10 recall (r@10). By adopting video-text pre-training, we achieved a 24.5% improvement in the r@sum metric; subsequently, introducing difficult sample focus successfully improved r@sum by 8.1%; introducing fine-grained recognition technology can enhance r@sum by 2.8%. Additionally, in the field of video source retrieval, we independently developed a video infringement detection method. Based on this method, we successfully saved 85% of storage space and achieved an 18-fold speed increase in infringement retrieval, while the retrieval effect improved by 2.78% in terms of top1 F1-score compared to traditional video retrieval methods. Next, we will elaborate on our improvement methods regarding video-text semantic retrieval and video-video source retrieval.

Video-Text Semantic Retrieval

In the past year, we conducted research in three areas regarding video-text semantic retrieval to improve the effectiveness of video-text semantic retrieval: video-text pre-training, focusing on difficult samples, and introducing fine-grained recognition.

1. Video-Text Pre-Training

The first key advancement is the video-text pre-training technology. Before elaborating on this, let’s first explain what “video-text pre-training” means.

Pre-training is the phase before formal finetuning, mainly utilizing large-scale and unsupervised video-text pair data for semantic alignment training to enhance the representational ability of downstream tasks. Through pre-training, we expect the model to perform well on various downstream tasks. Common downstream tasks include video-text retrieval, VQA (video question answering), and video captioning.

Before understanding the pre-training task, it is essential to understand two concepts: where the video-text pair data comes from, and how to understand the text corresponding to the video. Typically, a video corresponds to two text sources: one is the title description of the video, which usually summarizes the entire video content, such as the title text corresponding to each video in a short video app; the other source is the audio accompanying each segment of the video, which is recognized through automatic speech recognition (ASR) technology to convert the audio into text. Based on the start and end time intervals of ASR, the corresponding video segment can be viewed as the ASR text for the current time segment, thereby establishing a relationship between the video and the text. We constructed a large-scale unsupervised video-text pair based on these two types of associated data and performed pre-training on this dataset. The pre-trained model can then be used as the initialization model for various downstream tasks, significantly improving the performance of these tasks.

Most short video scenarios in China primarily target Chinese users. Currently, we face two major challenges in Chinese video-text pre-training. First, there is a lack of publicly available Chinese video-text pre-training datasets. The datasets commonly used in academia are mostly in English, such as HowTo100M and WebVid, making it challenging to obtain publicly available Chinese video-text pre-training datasets. To address this, we have constructed the first open-access Chinese video-text pre-training dataset in the industry, which has been published at CVPR 2023. Second, model design must focus on cross-modal interaction to achieve deeper interaction and integration between video and text, thereby enhancing the effectiveness of video-text retrieval. We proposed a new model that enhances cross-modal interaction between video and text, named SNP-S3, which has been published in the IEEE T-CSVT journal in 2023.

Let’s first introduce the main research achievements of the first part. We proposed the first publicly released Chinese video-text pre-training dataset, which significantly improves the performance of Chinese video-text retrieval models when pre-trained on this dataset.

The main work includes three parts: first, constructing a large-scale public Chinese video-text dataset CNVid-3.5M; second, employing effective data processing methods to filter out low-matching video-text pairs, significantly improving data quality; finally, we pre-trained on CNVid-3.5M, validating that our proposed CNVid-3.5M significantly enhances the effectiveness of Chinese video-text pre-training and establishing a benchmark on this dataset. The entire process is illustrated in the image above.

Next, let’s discuss the dataset construction process. We collected original videos from multiple Chinese video websites. When collecting videos, we paid special attention to the category and theme of the current video and tried to maintain balance among various categories and themes. We successfully constructed 4.5 million original Chinese video-text pairs. The image above shows the word cloud generated from the keywords corresponding to the video.

After collecting the data, the next step is data cleaning, filtering out relatively low-quality video-text pairs in the dataset. The original video-text pairs are not strictly semantically aligned. For example, visual signals present in the current video and the text derived from the background music audio may not have a clear semantic relationship, and the text derived from the background music audio may be treated as contaminating training data. Therefore, we try to filter out these irrelevant video-text pairs as much as possible. To achieve this goal, we proposed a method for cleaning video data using a vision-text pre-training model. The specific implementation steps are as follows: first, we use the trained vision-text relevance model CLIP to assess the relevance between the current text and each key frame in the video, and aggregate the relevance between key frames to obtain overall relevance. By setting a threshold for overall relevance, we can filter out videos with low relevance. Thus, we filtered out nearly 1 million low-quality video-text pairs, retaining about 3.5 million Chinese video-text pairs.

The image above shows the basic statistical indicators of the CNVid-3.5M dataset we constructed. As of the current sharing time, the Chinese CNVid-3.5M dataset we constructed is the largest publicly available Chinese video-text pre-training dataset in the industry.

Next, based on the CNVid-3.5M dataset, we constructed a benchmark to observe whether various model architectures show improvement when pre-trained on our constructed dataset.

The image above displays detailed experimental results from three stages. During the dataset construction process, we used our collected original dataset for video-text pre-training. The results show that after pre-training, the overall R@SUM metric significantly improved by 17.7% on the translated Chinese MSRVT dataset compared to before pre-training. The table also reveals that during our filtering phase, when reducing the dataset from 4.5 million to 3.5 million, although the amount of pre-training data decreased, the overall model performance actually improved.

The second difficulty in Chinese video-text pre-training lies in model design, which needs to focus on cross-modal interaction. To address this issue, we proposed a model called SNP-S3 that enhances cross-modal interaction between video and text. S3 refers to the enhancement of important semantic information, designed to address the following two shortcomings of traditional pre-training.

The traditional pre-training task generally uses a mask language modeling task on a cross-modal encoder, while another common pre-training task is to perform global vision-text matching. As shown in the image above, one issue with the traditional Mask Language Model (MLM) task is that when the masked token is an unimportant word in the text, such as a quantifier, the model can restore the current masked word without looking at the relevant video content, relying solely on grammatical knowledge. However, when the masked word is a keyword, the model must see the current video to know what the masked word is. For instance, if a boy is wearing a red shirt, if “red” is masked, the model cannot reconstruct it without seeing the visual input. By forcing the model to infer the masked text content based on the given visual input, it enhances the interaction between different modalities.

Another issue with traditional vision-text matching tasks is that they focus more on global alignment, where vision and text align semantically at the sentence level. Sentence-level alignment is a global granularity that lacks local information. For example, if a critical word like “red” is removed from a sentence, the model can still perform well in matching it with the video. This means that the retrieval model does not have fine-grained discrimination capability. Attributes like “red” and some verbs require more fine-grained capability. We hope to make the model more sensitive to these fine-grained pieces of information based on traditional global matching. Thus, we introduced keyword matching to match more critical words in the sentence, such as nouns, verbs, and adjectives, with the video to enhance the model’s fine-grained recognition ability.

These two improvements, namely masking important words in Mask Language Modeling (Mask Significant Semantic Model, MSSM) and adding fine-grained improvements to global information (LVWM), serve the goal of enhancing S3’s significant semantics.

Here we introduce the specific implementation of S3. The MSSM task focuses on masking key semantic words, strongly relying on the given visual input to reconstruct the masked words. The LVWM task increases the matching task between visual input and words. Specifically, the previous Mask Language Modeling randomly selected words from the text based on a certain probability, needing to select important words. Important words can be defined in two ways: one is to use a part-of-speech tagging tool to define them, and the other is to statistically measure the frequency of a word across the entire large dataset using a method similar to TF-IDF to assess the importance of the current word. Here we combine these two approaches: to be considered an important word, it must first be a noun, verb, or adjective; additionally, its frequency cannot be particularly high. The lower the frequency, the larger the IDF, indicating that the information content is higher. By this method, we select important words for masking. Another improvement is the matching of visual and verbal elements. The selected keywords from the first step are matched with visual signals, and each word has a similarity score with the visual input. Finally, we aggregate the similarities to determine the degree of alignment between the sentence and the visual input, constructing a similarity matrix. This constructed matrix is added to the loss function along with the previously established global visual-text matching similarity matrix for optimization.

We conducted a quantitative analysis of the S3 method, leading to the following conclusions:

MSM outperforms traditional MLM across various model structures, allowing it to directly replace the previous MLM task.
Additionally, the LVWM task can only supplement the traditional global video-text matching task and cannot replace it. If the LVWM task is added to the traditional GVTM task, we can see that B3 performs better than B1, and B7 outperforms B5, proving it is an excellent supplement for local information.
Moreover, the two core components proposed in S3 are model-independent. We can see that B1 v.s. B4, B5 v.s. B8 are evaluated on different model structures, such as ResNet50 and PVT; thus, these two strategies can be applied to any model structure. By employing the S3 strategy, we can improve the r@sum metric by 6.8%.

We also conducted a qualitative analysis of the S3 method. After adding S3, given the input text, the model focuses on the regions visually more relevant to the text. The image above shows specific examples, such as an image of a person surfing at sea. We can see that the baseline attention region is dispersed and lacks much semantic relevance, but after adding the S3 method, it focuses on the person and the background waves.

This concludes the introduction to video-text pre-training, covering two main aspects: how to construct a Chinese video-text pre-training dataset and how to enhance the interaction level between video and text in model design. These two optimizations can significantly improve the effectiveness of video-text semantic retrieval.

2. Focusing on Difficult Samples

Next, we will continue to share how focusing on difficult samples can further enhance the effectiveness of video-text semantic retrieval. Focusing on difficult samples can improve R@Sum by nearly 8.1%.

The key point of focusing on difficult samples is that the model can gradually learn to pay attention to them during training. This is mainly because difficult samples may not help the model’s training at the beginning. At the start of training, difficult samples might hinder the model’s convergence, but when the model has converged to a certain extent, focusing on difficult samples can further enhance the model’s performance.

Focusing on difficult samples mainly has two ideas: one is to manually specify the attention level for difficult samples, such as setting different attention levels for difficult samples based on different stages of model training; the other is to allow the model to adaptively learn the attention level for difficult samples. Our team has explored both aspects.

First, let’s introduce the strategy of manually specifying attention levels for difficult samples, which mainly uses a curriculum learning-based difficult sample mining approach, published at CVPR 2023.

The samples during the training process can be roughly divided into good samples, difficult samples, and noise samples. Good samples refer to video-text pairs with relatively high semantic alignment quality, where the text can clearly describe the content corresponding to the current video segment. Difficult samples refer to video-text pairs that are semantically aligned, but the semantics expressed by the text are weakly related to the video, though there is some relevance. Noise samples refer to video-text pairs where the video and corresponding text have almost no relevance, such as when the lyrics in the audio of the video do not significantly relate to the video’s semantics. We define such video-text pairs with low relevance as noise samples.

These three types of samples play different roles during training. First, noise samples negatively impact training, both at the beginning and the end, so they need to be discarded directly; for high-quality samples, the model should focus more on them during the initial training phase to accelerate convergence; for difficult samples, the model should focus more on them once it has converged to a certain extent and already performs well, allowing the model to learn difficult cases better and further enhance performance. However, if difficult samples are focused on too early, it may cause the model to learn incorrectly and fail to converge well.

Based on this observation, we designed a difficult case curriculum learning algorithm. The core idea of the algorithm is to allow the model to focus more on good samples at the beginning, and after the model has developed a certain discriminative ability, it attempts to mine difficult cases, enabling the model to focus more on challenging samples in the later stages of training.

The specific approach is illustrated in the image above. First, we constructed a similarity matrix between videos and texts using contrastive learning, where the diagonal elements are positive samples and the off-diagonal elements are negative samples. Next, we assess the similarity values of the positive samples on the diagonal to determine whether the current positive sample is a difficult sample or a simple sample. Generally speaking, if the similarity of the positive sample is high, it is likely a simple sample. This way, we measure difficult samples along the column dimension. At the same time, we also measure difficult samples along the row dimension, where each row represents the similarity between the current text and all videos in the current batch. We extract all negative samples, and if the current text has a high similarity with a negative sample, we consider the current video-text sample to be a difficult case. Next, we combine the measurements based on rows and columns to construct the weight for the VTM (video-text matching) loss. This weight is derived from the combined row and column weighting, with the weight coefficient adjusted through curriculum learning. At the start, the weight is set to 0, meaning no difficult sample mining loss is added; as training progresses, the weight of the loss part gradually increases, allowing the model to focus more on difficult samples.

We analyzed the performance of the current model after incorporating the HSCL difficult sample curriculum learning loss. We used two datasets: one for Chinese pre-training and fine-tuning, CNVid-3.5M, and the other for English pre-training and fine-tuning, COCO+VG+CC. The experiments showed that after introducing the difficult sample curriculum learning method, the R@SUM metric for text-video retrieval improved by about 5%.

The above introduces the manual method of specifying attention levels for difficult samples, which is not automated and requires hyperparameter adjustments. We hope to allow the model to adaptively learn the attention level for difficult samples, so we designed an adaptive method. The following DMAE and NegNCE methods were published at the ACM Multimedia 2023 conference. This method can lead to a 3.1% improvement in R@SUM.

Next, let’s introduce the motivations for DMAE and NegNCE.

DMAE is a dual-modal attention enhancement module. Its core idea is to find as many difficult cases as possible, mainly using two approaches: the first is to expand the boundary between simple and difficult samples. As shown in the right image, from b to c, by adding DMAE to NegNCE, we can see that the difficult case mining through DMAE can introduce more difficult negative samples, thereby enhancing the model’s discrimination ability. The core idea of NegNCE is to identify which difficult samples truly need to be focused on after finding all these difficult samples, as some difficult samples might have already been addressed by the previous infoNCE contrastive loss. However, there may still be those that remain unresolved, which can be addressed by adding an auxiliary NegNCE loss to the training objective, allowing the model to dynamically focus on these difficult samples.

Traditional infoNCE focuses more on positive samples, pulling positive samples closer while pushing negative samples away, without addressing difficult negative samples. By introducing NegNCE, we enable the model to explicitly focus on difficult samples. In the case illustrated in the image, difficult negative samples are close to the decision boundary. Although they are negative samples, their similarity to the current anchor may be closer than that of positive samples. NegNCE can gradually push such negative samples away. DMAE identifies more of these negative samples, encompassing more negative samples within the scope of the current model loss. Thus, DMAE identifies more negative samples, while NegNCE works to distinguish negative samples, allowing the model to adaptively focus on difficult samples during training through the cooperation of these two losses.

DMAE primarily has two aspects of work: one is on the text side, as the expression of text sentences contains much redundancy. Therefore, we aim to enable the model to focus more on key words in the sentence. These keywords must first be important words, such as nouns, verbs, and adjectives, and secondly, they should have lower word frequencies. Such representative words carry more information. By combining these two aspects, we select keywords from the text, giving higher weights to these keywords during text attention processing.

The other aspect of work is on the visual side. The main difference between videos and images is that videos contain many key frames, which include certain redundancy. This means that temporally adjacent frames may be semantically and visually very similar. If a difficult sample exists in the current frame, there are likely many similar difficult samples in another frame. We take the union of these two types of difficult samples, and the combined set of difficult samples consists of the current frame and another frame that is similar to it.

The specific implementation involves first calculating the similarity matrix between videos and texts, followed by weighting on the text side. The text-side weight is primarily determined by the part of speech and frequency of words. On the visual side, the weight primarily involves calculating the similarity matrix between the current video key frames and itself. Next, we retain the top scores, for example, identifying the most similar frames to the current frame as difficult samples. This way, we can construct a similarity matrix that can identify more difficult cases. The more difficult the sample, the higher the score it will ultimately receive in the similarity matrix.

After identifying more difficult cases, we want the model to dynamically distinguish these cases during training. Thus, we explicitly introduced NegNCE loss. The traditional method of calculating the similarity between videos and texts uses InfoNCE loss. The numerator of InfoNCE is the positive sample, and the denominator consists of all negative samples. InfoNCE treats positive and negative samples equally in the process of pulling positive samples closer and pushing negative samples away.

During training, NegNCE first identifies which are negative samples. For the same text, if the similarity of the negative sample’s video to the text is greater than that of the positive sample, it is considered a difficult case. This way, during training, we can mine all video-text pairs where the similarity of the negative samples exceeds that of the positive samples, adding an auxiliary loss (as shown in the formula above) specifically for mining difficult negative samples. The auxiliary loss is combined with the previous InfoNCE loss through weighted addition. We can adjust the model’s focus on difficult samples during training by adjusting the weight of r2.

We validated the effectiveness of the difficult sample strategy mentioned above. DMAE is primarily used to expand the range of difficult samples, while NegNCE encourages the model to focus more on difficult negative samples after expanding the range of difficult samples. We observed that after incorporating DMAE and NegNCE, the overall performance improved.

3. Introducing Fine-Grained Recognition

Next, we introduce the third method, introducing fine-grained recognition, which can quickly enhance the effectiveness of video-text semantic retrieval. In experiments, introducing fine-grained recognition can improve the R@Sum metric by 2.8%.

Existing work on video-text semantic retrieval lacks the ability to distinguish finer semantic granularity. For example, commonly used Pairwise loss relies more on binary quantization to determine the similarity between video and text, which is essentially a binary classification. Another method based on Triplet loss does not classify relevance or irrelevance but models the partial order relationship, allowing the model to model the semantic relevance at a finer granularity. However, how to construct video-text pairs with different semantic granularities is our core challenge.

To construct video-text pairs with different semantic granularities, we adopted a generative approach to create such partial order pairs. The specific idea draws on the CSE work in single-modal text. When extending to multimodal contexts, if the current complete text and complete video are entirely relevant, then removing some key frames from the video will gradually weaken the relevance of the video. Therefore, the constructed partial order relationship is: the relevance of the text to the complete video > the relevance of the text to the video after key frames are removed. Similarly, for the text side, the relevance of the current video to the text > the relevance of the current video to the text after key words are removed. Based on this idea, we generate pairs with different semantic granularities. One key difference in generating partial order pairs is that, in multimodal contexts, it is necessary to see the text to determine which frames in the video are important and which are not; similarly, seeing the video is required to identify which words in the text description are important and which are not.

Based on this observation, we proposed two modules: the first is the prediction of token importance across modalities, whose core algorithm is to predict the importance of the tokens in the other modality given one modality input. For example, given global information input from the visual side, the model predicts the importance of the current text tokens, i.e., which words in the text are important; the same applies to the visual side, where the model predicts which tokens are significant given the overall representation of the current text. By this method, we select important text and visual tokens, and further mask these important tokens. The samples generated from masking are weaker in relevance to another modality than the complete text or video before masking. In this way, we can generate triplet samples with partial order relationships.

The specific implementation is divided into two stages: the first is to generate partial order samples, where we first need to predict which tokens are more critical for the other modality; after determining the weights of these critical tokens, we need to identify which tokens to mask that have the most significant semantic impact on the current text tokens. The second step is to directly mask these tokens that have the most substantial semantic influence, generating partial order pairs. Similar to the idea of Triplet loss, the samples after masking will have weaker relevance to the other modality than the unmasked samples. The triplet data establishes relevance among the three samples, meaning that the relevance of the unmasked samples to the text is higher than the masked samples, and the relevance of the unmasked text to the video is higher than that of the masked text to the video.

We validated the specific effects of introducing fine-grained recognition. After introducing the fine-grained TPM-CL method, the R@SUM metric on the MSRVTT dataset improved by about 2.8%. It can also be combined with DMAE; by expanding the introduction of more negative samples through DMAE, TPM-CL allows the model to focus more on difficult-to-distinguish negative samples during training. The combination of these two methods resulted in a 4.4% improvement.

To summarize, the third method to quickly enhance video-text semantic retrieval is to introduce fine-grained recognition, specifically through the generation of partial order samples and the introduction of partial order loss.

This concludes the introduction to the three main optimization methods for video-text semantic retrieval. Next, we will discuss the application of video multimodal technology in video-video source retrieval.

Video-Video Source Retrieval

The core of video-video source retrieval is how to efficiently and cost-effectively implement video infringement detection. In this field, we proposed a self-developed end-to-end segment matching localization technology that can quickly achieve copyright retrieval from video to video. Compared to traditional methods, it can save 85% of storage space and accelerate video infringement retrieval by 18 times. In terms of retrieval effectiveness, F1 improved by 2.78% compared to existing methods.

1. Challenges in Video-to-Video Source Retrieval

The challenges faced in video-to-video source retrieval mainly include:

First, the types of video infringement are complex, and content changes are diverse and drastic. This affects the accuracy of copyright retrieval. The complex infringement types involved include: geometric transformations (e.g., cropping, scaling, rotation, etc.), optical transformations (e.g., hue, noise, contrast, brightness, etc.), temporal transformations (e.g., frame dropping, fast playback, editing, frame rate changes, acceleration/deceleration, etc.), and compound transformations that utilize the above various transformations. These special transformations make source detection in videos extremely challenging. For example, applying various filters to the spatial domain of a video and cropping and blurring the original video results in all these videos infringing upon the original. Similarly, speeding up or stitching the original video also constitutes infringement.
On the other hand, the massive amount of data means that every frame of the video needs to be computed, leading to a large computational burden. The large storage and computation requirements pose a high-cost dilemma.

Therefore, to achieve video-to-video source retrieval, the core lies in how to improve retrieval accuracy while reducing costs.

Traditional video-to-video source retrieval methods do not meet the requirements. For example, the research in MultiMedia09 used sequential networks based on dynamic programming to find the longest path for infringing segments. Its advantage lies in being unsupervised and having relatively precise localization. However, its drawback is poor robustness, especially when facing acceleration or deceleration, or temporal and spatial compound transformations, where its performance may not meet expectations. Other works based on deep learning models process video infringement detection by constructing feature similarity matrices, transforming the problem of whether a video infringes copyright into a binary classification of the video under detection and the infringing video. This method cannot achieve segment localization for video infringement detection.

2. Framework and Core Technologies

Based on the inability of existing algorithms to meet the demands, and given the significant business implications of video infringement detection, we developed a complete infringement detection framework to address the aforementioned effectiveness and cost issues.

The overall design of the framework is illustrated in the image above.

First, we process the video database by extracting key frames from the videos, then performing frame-level feature extraction and storing them in a feature library. When processing query videos, we also extract key frames and features from the query video. Next, we match the features of the query video with those in the database, perform fine sorting, and ultimately determine whether the current query video infringes copyright.

The core technologies include the following two aspects: first, how to accurately extract key frames from videos, which is actually a way to reduce costs. If we save every frame of a video, the storage cost will be relatively high. Therefore, we hope to replace the entire video with key frames to lower costs in the copyright retrieval process. Second, we need to quickly locate the infringing parts of the video, which involves balancing accuracy and cost. For example, the previously mentioned ICCV research mainly involves Pairwise-style video infringement detection, which, while theoretically feasible, is impractical for real business due to its high comparison cost.

Our self-developed solutions are the self-supervised SKE method and the detection localization SPD module. Next, we will elaborate on these two methods.

First, let’s introduce the SPD module. The core idea of this module is to compare the features of key frames from both candidate videos and query videos pairwise, constructing a similarity matrix. In the feature similarity graph, we can see that some similarity values are higher and exhibit some continuity. Based on this observation, we can transform the problem of segment matching between videos into a pattern detection task in the feature similarity graph. This means we can construct a training set for copyright similarity graphs and label the start and end times of infringement in the feature similarity graph, allowing us to train a YOLO object detection model directly on the feature similarity graph for rapid identification. This helps determine whether the candidate video has any similarities with any video in the database and whether it infringes copyright.

The core SPD module outperforms mainstream dynamic programming methods in the industry, achieving an 18-fold speed increase. This is primarily due to the fast object detection capabilities of YOLO. Additionally, for more complex scenarios, such as those involving acceleration/deceleration or filtered videos, we have seen significant improvements compared to mainstream industry solutions.

The second core task of infringement detection is to reduce costs. The core idea is to replace traditional uniform frames with key frames. Compared to uniform frames, the number of key frames is typically reduced by about 70% to 80%, significantly decreasing storage space. Key frame detection is the core of the key frame extraction module, which first requires preprocessing the video, flattening all frames at the same time, and stitching them into a large image. The next step is to perform a task similar to image segmentation on the large image, aiming to output the exact category corresponding to each pixel. In our scenario, each pixel actually represents a key frame, so the goal is to output the probability of each frame becoming a key frame.

If we want to combine the aforementioned key frame extraction module with the infringement localization module, the core lies in selecting key frames. However, directly using the key frame extraction module is not differentiable. Therefore, we first output a probability mask for the key frames, while also constructing a uniform frame mask, and then add these two masks together. Finally, we use the mask to perform element-wise multiplication on the dense sampling of the uniform frame’s feature similarity graph. The resulting feature similarity graph can be jointly trained with the SPD module. This way, we can ensure that the SPD module’s gradients can dynamically backpropagate to the key frame extraction module, allowing both the key frame extraction and SPD modules to be jointly trained end-to-end, forming a complete end-to-end model, rather than training the key frame extraction first and then training other modules as in traditional methods.

Joint training of the key frame extraction and SPD modules shows significant improvements compared to using SPD alone. Additionally, after testing on large-scale datasets, we found that both the costs and storage requirements have significantly decreased. It can be observed that compared to using SPD alone, storing key frames can save 85% of storage space. Furthermore, in terms of infringement detection effectiveness, fewer key frames can achieve better results, with overall performance improving by 2.78%.

Conclusion

To review the content shared in this presentation, we primarily introduced two directions of video multimodal retrieval: video-text semantic retrieval and video-video source retrieval.

For video-text semantic retrieval, we proposed three methods to quickly enhance retrieval effectiveness: first, through video pre-training, achieving a 24.5% improvement; second, by focusing on difficult samples, resulting in an 8.1% improvement, where difficult samples are divided into two types: one based on manually specified attention levels for different difficult samples at different training stages, and the other allowing the model to adaptively learn the attention levels for difficult samples during training; third, by introducing fine-grained recognition, leading to a 2.8% improvement, primarily involving how to generate partial order samples and introducing triplet partial order loss to model fine-grained semantics.

The video-video source retrieval section introduced our self-developed end-to-end segment matching localization method, which saves 85% of storage and accelerates infringement detection by 18 times, significantly improving retrieval effectiveness compared to using uniform frames.

The work mentioned above represents the research achievements of Ant Group’s intelligent engine multimodal cognitive team over the past year, focusing on advancements in video semantic retrieval and video-to-video copyright detection. If you are interested, we welcome you to learn more about our work. At the same time, we warmly invite more individuals to join us and work together to promote related projects.

Q&A

Q1: Does the key frame need to be annotated before training the segmentation model?

A1: The key frame module has two usage methods: if the module is extracted separately, it acts similarly to a segmentation model, which requires annotation. For example, key frames can be manually annotated, and then the model can be trained to extract key frames from videos.

However, if we adopt the end-to-end method used here, the key frame module is used in conjunction with the downstream task of similar frame comparison. In this context, the downstream task is more focused on infringement localization, such as comparing two similar videos by comparing two similar frames, allowing for an adaptive end-to-end approach that filters key frames based on task characteristics without requiring annotation.

Q2: Are there any existing key frame extraction models available on Hugging Face?

A2: Currently, the model has not been open-sourced, but there are plans for open-sourcing, and it is currently undergoing internal open-source processes.

Q3: In multimodal embedding, when applied to downstream recommendation scenarios, it often lacks effectiveness. What are the solutions?

A3: We may prefer to refer to the earlier content on video-text semantic retrieval. In terms of semantic retrieval, applying text semantic retrieval to search or recommendation scenarios involves several closely coordinated aspects. First, during the recall phase of search and recommendation, we can enhance the recall phase by increasing the video-text connection. Second, in the ranking phase, we can incorporate features from video and text after pre-training into the ranking features. Third, in the fine-tuning phase, we need to scatter the videos. At this point, we can use the trained embeddings to perform scattering. If the results are not satisfactory, it may relate to specific business scenarios or how to utilize this multimodal pre-trained representation, requiring a clear understanding of the specific scenario and problem to provide answers.

Q4: The storage savings mentioned earlier, what kind of storage media are the main data stored in?

A4: For small-scale video copyright retrieval, it can be directly stored on a NAS drive, which is an ordinary hard disk. For large-scale storage, we store these features directly in a vector retrieval database. Storing in a database saves more space than NAS, but using key frames brings significant storage savings.

Q5: Can the key frame solution also be applied to video-to-video translation? For translations between different languages?

A5: Video-to-video translation specifically refers to converting English videos into corresponding ASR voice transcripts.

The key to translating spoken content in videos is that not only do the audio tracks need to match, but they also need to correspond with the lip movements. Due to differences in speech rates between different languages, traditional translation methods, such as converting Chinese to English or vice versa, require a certain degree of editing work. From another perspective, I believe this technology essentially solves the alignment problem between two videos. Regarding the translation scenario I described, while I do not have in-depth knowledge, I believe that if alignment issues exist between video segments, this method should be widely applicable.

Q6: Can you provide more details about the team’s recruitment situation?

A6: Thank you sincerely for your close attention to our team. We are the Ant Group’s intelligent engine multimodal cognitive team, and we are always committed to recruiting talent. Our ongoing recruitment covers various fields, not limited to the development directions discussed today. Our main research directions include multimodal large models, video large models, and copyright retrieval, among others. In summary, our work can be divided into two major areas: video processing and image-text processing. In image-text processing, we focus on multimodal and large models; in video processing, we specialize in real-time processing, video-text semantic retrieval, and video-to-video copyright retrieval. We warmly welcome students with a strong interest or relevant experience in these areas to send us your resumes. Our recruitment bases are in Hangzhou and Beijing, and our teams in both locations are eagerly welcoming your joining!

Q7: Does extracting video features refer to using visual input?

A7: The extraction of video features discussed here, if referring to the features of videos within source retrieval, follows this operational process: first, we extract key frames at the frame level from the video and perform feature extraction on these key frames. For the previously mentioned video-text pre-training, the feature extraction process may be performed directly at the video level, similar to the video swing model, producing an overall representation of the video. Therefore, these two tasks may differ in how they extract video features, with one focusing on frame-level extraction and the other on overall video-level extraction. However, regardless of the method, visual input is required.

Q8: How are video features extracted by fusing key frame features?

A8: This is because the information discussed in today’s presentation is primarily at the video frame level. Video segments, as an important component, along with their underlying library and key frames, collectively construct a similarity matrix feature vector matrix. However, this method does not achieve the aggregation process of overall video and its key frame features into the entire video features.

Common aggregation methods include both non-parametric methods, such as pooling operations based on key frame features at the LV level, and parameterized methods, such as adding a temporary encoder at the upper layer, treating video frame features as tokens similar to Transformer inputs, utilizing Transformers for continuous modeling. This may involve parameterized methods, as well as strategies similar to temporal video modeling methods, such as the Token Shift method. These approaches can convert frame-level features into video-level features. These methods have been practiced and attempted, yielding certain results.

Q9: Should we understand that video features are the overall features of the video, which may be artificially assigned, but the actual features, i.e., substantial features, are still reflected?

A9: Yes, the actual operation is more influenced by the granularity of the problem being addressed. How to design video features? For example, in the field of video-text semantic retrieval, the core issue is how to conduct video retrieval from the text perspective. Since the entire video is treated as a whole, research in this area tends to focus on how to express the overall presentation of video content. In the case of video-to-image retrieval, this approach emphasizes image source retrieval. For instance, in copyright retrieval scenarios, the retrieval results may include segments of the query video, which may pose infringement risks with certain segments in the database. Therefore, research in this field focuses more on expressing video segments or more microscopic video frames. Thus, in such cases, the focus of video features is not on the overall embedding representation of the video but on the representation of video frames. We should analyze based on specific issues in detail.

Q10: Is ASR and key frame OCR information used?

A10: Yes, it is used. For example, in the source retrieval field, it does not involve much text information; however, in semantic retrieval, when processing video data, we first need to introduce OCR (Optical Character Recognition) technology. Generally, when performing semantic retrieval operations, we construct a pairing relationship between video and text. When a video lacks an overall description, the pairing of video and text is usually derived from ASR. The starting time and structural time of ASR are aligned with the corresponding video segments, serving as the visual input corresponding to the ASR text. Meanwhile, we also utilize the key frames in the visual segments corresponding to the current ASR to perform OCR and extract OCR text, which is then integrated into the ASR. Therefore, we can say that ASR and OCR collectively form the content of the text. However, it is worth noting that if OCR text is added to this part of the text, it may lead to some issues. This is because the OCR text in key frames often has high similarity, necessitating the use of video-level OCR methods for de-duplication of OCR text.

Q11: Are the questions answered earlier all related to the video-text pre-training part?

A11: Yes. In fact, this area of research is not limited to video-text pre-training but also involves video-text semantic retrieval. The focus of our discussion has been on how to improve the video-text retrieval effectiveness at the model or data level. Additionally, a crucial aspect is how to construct text to enhance its association with the video, such as deriving text from the video’s title, overall description, or corresponding segments via automatic speech recognition (ASR) or optical character recognition (OCR) from key frames, all of which may closely relate to the video. The specific implementation must consider the particular business scenario. For example, if you plan to utilize text from videos for video retrieval, including OCR is undoubtedly necessary.

Q12: How is noise from ASR, such as BGM, filtered? Is it done using Facebook’s library?

A12: Our designed noise filtering model has robust capabilities for recognizing BGM, and there are mature open-source models available for this. On the other hand, even if the model does not successfully filter out the noise, it is still acceptable because the BGM generally consists of lyrics. For purely musical BGM, its ASR does not output any text. The text that can be outputted by background music is usually lyrics, but the relevance of lyrics to video content can typically be adjusted through training vision-text relevance models, such as a Chinese CLIP model.

Q13: Does video retrieval involve online real-time inference? Is it done offline T+1, or is it real-time streaming? If it is real-time inference, how does such a large model handle it?

A13: Real-time inference is possible. Taking video-text semantic retrieval as an example, after effective training, we can utilize the model to obtain the overall embedding of the video when it is entered into the database. Then we can store this embedding in a vector retrieval database. For text retrieval, the typical method is to query text online. We can deploy some lighter-weight solutions, such as lightweight processing of the text model, using quantization interpretation and enhanced streaming methods to produce lightweight models that match the video part. In actual retrieval, we only need to run this lightweight model to extract the text representation in real-time and then conduct quantized retrieval from the previously stored video vectors in the retrieval database. There are many ways to enhance speed in retrieval, such as implementing real-time retrieval based on vector libraries like FAISS.

Q14: Which vector database do you typically use?

A14: We internally use a retrieval platform called Qianxun, which is not an open-source product. However, its underlying principles are fundamentally similar to those of the Facebook open-source vector retrieval database FAISS.

That concludes the content of this presentation. Thank you all.

Ant Group's Technical Exploration in Video Multimodal Retrieval

Speaker

INTRODUCTION

Guo Qingpei

Ant Group

Senior Algorithm Expert

Senior Algorithm Expert at Ant Group, responsible for intelligent engine video and multimodal technology.

Ant Group’s Technical Exploration in Video Multimodal Retrieval

Leave a Comment Cancel reply