Introduction
This article mainly introduces how BIGO, a video live-streaming company with 400 million global users, utilizes the vector search engine Milvus for deduplication of massive short videos. With the acceleration provided by the Milvus vector search engine, BIGO’s short video product Likee can control each search within 200ms while ensuring a high recall rate. Additionally, we adopted horizontal scaling of Milvus to improve the throughput of vector queries, ensuring business query efficiency.
Business Background
Since its establishment in 2014, BIGO has launched a series of audio and video social and content products such as BIGO LIVE and Likee, based on powerful audio and video processing technology, global real-time audio and video transmission technology, and artificial intelligence technology. As of the second quarter of 2020, Likee, BIGO’s short video product, reached 150 million monthly active users on mobile, and the system needs to process massive user-uploaded videos daily. In this process, to recommend high-quality content to users, the system needs to eliminate duplicate and low-quality content from the massive video pool.
Deduplication Process
We used deep learning methods for the deduplication work.
First, we cut the user-uploaded videos into 15-20 frames, then convert each frame into feature vectors, and search the top k vectors in a database with over 700 million data points, identifying the corresponding videos for comparison of video similarity.
During vector similarity searches, we handle billions of complete data points, while also dealing with a large volume of new data daily, which poses high performance demands and challenges for the vector search system.
After comprehensive analysis and comparison, we adopted the distributed vector search engine Milvus to assist us in performing vector similarity retrieval.
Overall Architecture
Next, we will introduce the overall business architecture of our deduplication work using Milvus.
As shown in the figure below, new videos on the Likee platform are written in real-time to Kafka, where they enter the review process after being consumed by Kafka-consumer. The reviewed content is then processed using a deep learning model for video feature extraction, transforming unstructured data (video) into structured data (feature vectors). The system packages the feature vectors and sends requests to the video similarity review program.

Video Deduplication Business Architecture
Every video that has undergone feature extraction and transformed into multiple feature vectors will first index in Milvus, then be stored in Ceph, and subsequently loaded by Milvus query nodes to provide search capabilities.Meanwhile, we also synchronize video IDs and corresponding feature vectors to TiDB or Pika based on business requirements.
Video Similarity Retrieval
In the above process, we can see that the focus of this solution is on performing similarity retrieval of massive feature vectors.
The similarity-audit in the above figure utilizes Milvus’s batch search function, performing similarity searches on multiple feature vectors of each new video, recalling the top 100 similar vectors for each feature vector (each recalled similar vector is linked with its corresponding video ID). Next, we deduplicate all video IDs recalled from each similarity search, then query the corresponding feature vectors from TiDB or Pika. Finally, we perform specific video similarity calculations and scoring between the retrieved feature vectors and the requested video’s feature vector, returning the video ID with the highest score as a result, thus completing the video similarity retrieval.
The complete process is illustrated in the figure below:

Similarity-Audit Business Process
Conclusion and Outlook
This concludes the sharing of content regarding the use of Milvus for short video deduplication tasks in the Likee business. Milvus, as a high-performance and high-recall distributed vector search engine, has shown remarkable performance in Likee’s short video deduplication business, greatly aiding BIGO’s business development.
BIGO hopes to engage in more in-depth cooperation with Milvus in the future, such as in the areas of content review or banning, personalized video recommendation services, etc., to jointly promote the development of both parties’ businesses. We look forward to the continued growth of the Milvus community!
About Likee
With high-quality and diverse entertainment content, Likee has become a pioneer and benchmark in the global internet short video social product space.
-
By mid-2020, Likee’s mobile monthly active user count reached 150 million. -
By the end of September 2019, Likee’s mobile monthly active user count reached 100.2 million, ranking in the top five of the global download list on Google Play, surpassing well-known applications like Instagram and SnapChat, with downloads only second to Facebook. -
By mid-2019, Likee’s mobile monthly active user count reached 80.7 million. -
In 2017, BIGO established the short video community Likee, officially launched on the App Store in August of the same year, targeting the overseas market, and won the annual best entertainment application on Google Play that year. -
In 2014, BIGO was founded in Singapore by David Li and Jason Hu, focusing on artificial intelligence technology.
Author Introduction
Guo Xinyang, Head of the BIGO Machine Learning Platform, Senior Staff Engineer
Han Baoyu, Engineer, BIGO Machine Learning Platform Team
Editor Introduction
Xiong Ye, Zilliz Community Intern
Zang Peng, Zilliz Community Intern
Zilliz has built the Milvus vector database to accelerate the development of the next-generation data platform. Milvus database is the graduation project of LF AI & Data Foundation that can manage large amounts of unstructured data sets and has widespread applications in areas such as new drug discovery recommendation systems chatbots and more.