The field of new drug development has long been known for its lengthy timelines, high costs, significant risks, and low return on investment. The average development cost for a new drug has reached $2.6 billion, with an average time frame of ten years. Despite such high R&D costs and prolonged timelines, there is no guarantee that the developed drugs will successfully pass all clinical trials and reach the market. Even the development of generic drugs, which are generally easier, progresses very slowly.

With the continuous advancement of AI technologies such as deep learning, integrating AI with drug development can significantly reduce the time and costs associated with new drug development and accelerate the development and market entry of generic drugs. Undoubtedly, artificial intelligence and machine learning will usher in a faster, cheaper, and more efficient era of drug development.

Zilliz, in collaboration with leading global pharmaceutical research and development companies, has developed the MolSearch compound molecular structure analysis software, creating a new technological breakthrough for AI drug development.

| Milvus Vector Search Engine

As an open-source feature vector similarity search engine, Milvus has been widely applied in key areas of AI technology, such as machine vision (image and video processing), natural language processing, and speech recognition, due to its powerful unstructured data processing capabilities.

With the deep integration of AI technology and the drug development field, Milvus also has vast application prospects in this area. For example, the prediction of drug polymorphs during the new drug development process can effectively utilize Milvus’s application in image recognition to predict suitable drug polymorphs; the target screening and patient recruitment processes can be abstracted as problems of text semantic analysis, allowing for quick analysis of text data related to drug development using Milvus’s natural language processing capabilities.

Virtual drug screening is a key step in the new drug development process, predicting the potential activity of compounds by simulating the drug screening process and conducting targeted physical screening of compounds that are likely to become drugs, significantly reducing the costs of drug development. This process, limited by algorithms and computing power in traditional approaches, takes minutes to analyze millions of compound molecules for similarity, substructure, and superstructure, while a solution integrated with Milvus can analyze billions of chemical formula datasets in seconds, greatly enhancing the efficiency of new drug development.

Milvus can be widely applied at various stages of drug development. By integrating mature AI models with the Milvus vector search engine, it will undoubtedly bring more disruptive technological breakthroughs to the field of drug development.

| MolSearch Virtual Drug Screening Tool

MolSearch is an open-source compound analysis software developed based on the Milvus vector similarity retrieval engine, with the front-end design referencing the open-source software MolView^[1]. For specific setup and functionality, please refer to https://github.com/zilliztech/MolSearch.

Drug chemistry experts typically optimize molecular modules based on scaffold jumps to design new drug structures for subsequent screening. Our initial intention in developing MolSearch was to accelerate virtual screening of massive compounds. Virtual screening is an essential and crucial step in new drug development, and its results significantly influence the success of later mouse experiments and clinical trials. The more extensive the compound database during virtual screening, the higher the accuracy of screening, thus increasing the likelihood of successful new drug development.

The molecular retrieval function in the MolSearch system, by integrating Milvus as the core retrieval engine, achieves the capability of performing second-level analysis on billions of chemical molecular structures, which is also a significant technological breakthrough for MolSearch in the field of drug development. Currently, MolSearch integrates an 820 million Zinc open chemical formula dataset^[2], utilizing chemical formulas converted into 2048-bit chemical fingerprints (feature vectors) for high-performance vector calculations in Milvus to achieve similarity, substructure, and superstructure retrieval of molecular structures. The end-to-end retrieval performance is as follows:

The p in the table represents the percentage, and the response time (p99) indicates how much time 99% of the retrieval can be completed.

| System Overview

The virtual compound screening technology applied in the MolSearch system first converts the chemical formulas of compound molecules into chemical fingerprints (Chemical Fingerprint) using the RDKit tool^[3], which is a set of feature vectors. Then, the distances between these vectors are calculated to analyze the similarities among compound molecules.

1. Chemical Fingerprint Generation

Chemical fingerprints are typically used for structure retrieval and similarity retrieval. As shown in the figure below, fingerprints are ordered lists represented in (1/0) bits, where each bit represents the presence of specified elements, molecular fragments, etc., in the chemical structure.

The MolSearch system utilizes the RDKit tool to generate RDKit fingerprints. The algorithm analyzes all molecular fragments along paths (usually linear) from one atom to a specified number of bonds, and hashes each path to produce a fingerprint. The image below displays all paths starting from NH₂ (highlighted) with a length of 6, and each path is hashed into binary bits.

The above image only shows fragments and bits starting from a single initial atom. For a complete fingerprint, this process will be repeated for every atom in the molecule. Such fingerprints are applicable to any molecule and can specify fpSize to adjust the vector dimensions, and the final generated vectors can be imported into Milvus for retrieval.

from rdkit import Chem
mols = Chem.MolFromSmiles(smiles)
fp = Chem.RDKFingerprint(mols, fpSize=VECTOR_DIMENSION)
bits_fp = DataStructs.BitVectToFPSText(fp)
vectors = bytes.fromhex(hex_fp)

2. Compound Retrieval

By importing the generated vectors into Milvus, a compound library can be established, enabling similarity retrieval, substructure retrieval, and superstructure retrieval based on different calculation methods.

from milvus import *
milvus = Milvus()
milvus.insert(collection_name=MILVUS_TABLE, records=vectors)
milvus.search(collection_name=MILVUS_TABLE, query_records=query_list, top_k=topk, params={})

Similarity Retrieval

Used to find molecules similar to the input reference molecule.

Substructure Retrieval

Detects whether a molecular structure is a substructure of another molecule.

Superstructure Retrieval

Detects whether a molecular structure is a superstructure of another molecule.

3. Chemical Fingerprint Calculation

Milvus supports various commonly used similarity calculation metrics, including Euclidean distance, inner product, Hamming distance, and Jaccard distance. For binary data, the MolSearch system chooses Jaccard/Substructure/Superstructure distance to calculate similarity.

Based on the above parameters, the calculation of chemical fingerprints can be described as follows: (for the complete formula, please click the CSDN link in the comments)

| Conclusion

With its advanced software and hardware algorithms, Milvus can provide enterprise-level stable and high-performance vector retrieval support for various AI applications. The MolSearch system fully leverages this feature of Milvus, achieving high-performance analysis capabilities for massive molecular formulas, disrupting traditional virtual drug screening solutions, and achieving a technological breakthrough. We believe that Milvus will have broader application prospects in other areas of drug development, and we look forward to collaborating with like-minded individuals in the AI drug development field to build the Milvus AI data processing platform.

Finally, we welcome you to click “Read Original” or visit the Milvus official website to experience the MolSearch Demo!

| References

http://molview.org/
Sterling and Irwin, J. Chem. Inf. Model, 2015, https://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00559
Landrum, G. 2010. “RDKit.” Q2. https://www.rdkit.org/

| Welcome to Join the Milvus Community

github.com/milvus-io/milvus | Source Code

milvus.io | Official Website

milvusio.slack.com | Slack Community

zhihu.com/org/zilliz-11/columns | Zhihu

zilliz.blog.csdn.net | CSDN Blog

space.bilibili.com/478166626 | Bilibili

Milvus Empowers AI Drug Development

| Milvus Vector Search Engine

| MolSearch Virtual Drug Screening Tool

| System Overview

| Conclusion

| References

Leave a Comment Cancel reply