AI for Science: Redefining Scientific Research Paradigms

AI has developed over more than seventy years, with each technological breakthrough opening up new possibilities for humanity. Its deep integration with scientific research will create countless or infinite possibilities.

Under public attention, the Nobel Prize in Chemistry, known for its “direct lineage” to the Nobel Prize, was finally announced this October, awarding three chemists for their contributions to the development of “click chemistry and bioorthogonal chemistry.” They are American chemist Carolyn R. Bertozzi, Danish chemist Morten Meldal, and American chemist K. Barry Sharpless.

In fact, before the announcement, there was much debate about who would win this award. The authoritative international chemistry journal, Chemical Reviews, conducted a reader poll predicting the awardees, where John Jumper, leading the DeepMind team that developed AlphaFold 2 capable of accurately predicting protein structures, received the highest votes.

Despite not winning due to “timing issues,” John Jumper’s team had already successfully received another prestigious award, often referred to as the “luxury Nobel Prize” or “the Oscars of science”—the 2023 Breakthrough Prize in Life Sciences, which is the highest monetary award in the field of biology and medicine to date.

Why has John Jumper and his team’s AlphaFold garnered such acclaim? The main reason is that AlphaFold’s birth solved a classic problem that has plagued the biological community for over half a century, namely the protein folding problem proposed by 1972 Nobel laureate Christian Anfinsen—”the amino acid sequence of a protein should fully determine its structure.”

John Jumper’s team innovatively utilized artificial intelligence technology to finally crack this famous conjecture, not only advancing the research on protein structure prediction to a new stage but also significantly increasing attention towards “AI for Science.”

In simple terms, AI for Science refers to utilizing the powerful data induction and analysis capabilities of artificial intelligence to learn scientific laws and principles, deriving models to solve practical research problems, especially assisting scientists in conducting numerous repetitive verifications and trial-and-error processes under different hypothetical conditions, thereby greatly accelerating the pace of scientific exploration. This method has already achieved significant results in various cutting-edge scientific fields.

Compared to the more familiar and accessible applications of artificial intelligence, the scientific fields involved in AI for Science, such as biopharmaceuticals, energy, and materials research, may seem distant from everyday life. However, the commonality behind them lies in utilizing artificial intelligence to “liberate” productivity—allowing people to be freed from many repetitive and mechanized foundational tasks and conduct more efficient production work with the assistance of artificial intelligence. This is precisely the value and appeal of artificial intelligence.

AI for Science: Catalyzing a New “Scientific Revolution” with Artificial Intelligence

Let’s return to AlphaFold and observe how the integration of AI can disrupt the evolution of protein analysis technology.

As the material basis of life, proteins are closely related to life and various life activities, including the occurrence of almost all diseases in the human body being related to abnormal protein functions. In other words, if we can artificially stimulate or inhibit protein targets and “control” the structure and function of proteins, it can significantly accelerate the development of targeted drugs and effective therapies for difficult diseases.

In the past, biologists widely used experimental techniques such as X-ray diffraction and cryo-electron microscopy to decipher the three-dimensional structures of proteins, which were time-consuming and costly. Therefore, starting in 1994, multiple research teams have showcased their skills at the biennial Critical Assessment of protein Structure Prediction (CASP) competition, leading to the emergence of protein structure prediction models such as I-TESSER, RaptorX, and RoseTTAFold.

However, problems arose as most of these protein structure models, based on theoretical predictions by computers, were actually far from the experimental data observed, with an accuracy of less than 40%. Their subsequent development required continuous improvement in the accuracy of prediction models to minimize the prediction structure and experimental error.

Moreover, transitioning from protein structure prediction to drug development involves significant differences in the principles and application scenarios of various drug design methods. For instance, in the pharmaceutical process, from target discovery and lead compound screening optimization to later ADMET prediction and even clinical effect prediction, each stage faces unique technical challenges. In this process, researchers must conduct high-throughput repetitive experiments, often spending years, with verification counts reaching millions.

Now, looking back at this problem that has fascinated countless scholars for half a century yet remained difficult to overcome, it is merely the tip of the iceberg in the ongoing barriers within the research field. The emergence of “AI for Science,” resulting from mature AI technology and interdisciplinary integration with research fields, undoubtedly brings new possibilities for this problem and further explorations in the scientific unknowns.

Since 2020, AI for Science has entered a concentrated development phase, including the AlphaFold project, whose latest achievement—AlphaFold 2 released by DeepMind in 2021—can successfully predict 98.5% of human protein three-dimensional structures, with predicted results differing from the actual structures of most proteins by only one atom’s width, reaching levels previously achieved through complex experiments like cryo-electron microscopy.

Similar to the life sciences field, the molecular dynamics field has also seen the emergence of the DeePMD-kit project, which, by combining machine learning, high-performance computing technology, and physical modeling, can elevate the limits of molecular dynamics to a billion-atom scale while maintaining high precision, greatly solving the traditional molecular dynamics challenges of “fast but not accurate” and “accurate but not fast.”

In the field of weather forecasting, the neural network model FourCastNet, based on new operator learning, can accelerate weather forecasts by 45,000 times; in the industrial sector, AI methods based on the fusion of data and physical mechanisms have also proven to be breakthroughs in solving complex high-dimensional physical problems in fluid dynamics and structural PDE equations…

In short, whether it is the currently popular AI applications such as AI painting or AI dialogue models like ChatGPT, or numerous project cases in the field of AI for Science, they all prove that AI is bringing a paradigm shift to various industries and fields. However, the more significant meaning of AI for Science lies in its accelerating effect on cutting-edge research, which will have more fundamental and far-reaching impacts on human society and economic development.

Moreover, the application of AI for Science is not limited to efficiently verifying or trial and error based on known scientific principles; it also allows more researchers to explore more complex scenarios based on AI, combining data to infer more accurate physical laws in complex scenarios.

It is no exaggeration to say that artificial intelligence will become a new production tool for scientists, following the computer, and is catalyzing a new “scientific revolution.”

Breaking Through Barriers: Starting from Deep Learning Frameworks

However, returning from imagination to reality, for the artificial intelligence industry to achieve substantial development and truly become a new production tool for humanity, it must cross the barrier of practical application. The comprehensive and deep-seated innovative value of AI for Science also makes it face barriers much higher than those of common AI applications.

The main reason is that the practical application of AI for Science requires a large amount of industrial scene data support and reasonable scientific mechanism equivalence, and the high-dimensional, massive data also poses higher demands on computing power and memory. Overall, the most significant barriers to the practical application of AI for Science currently lie in data, platform technology, software-hardware collaboration, domain-solving capabilities, and an excellent research and development ecosystem.

From the data perspective, the data dimensions in industrial scenes are high, the formats are complex, and there are islands of data. Additionally, due to privacy and legal restrictions, some data are difficult to publicly share. Therefore, how to efficiently govern these multi-feature, multi-source data, and solve small-sample and zero-sample data modeling is the foundation for AI to land in the research field.

From the software-hardware collaboration perspective, the development of AI for Science relies not only on the support of deep learning frameworks but also cannot be separated from the support of underlying high-performance hardware. On the one hand, AI for Science needs to solve real physical problems more scientifically, such as solving high-order PDE equations and developing models driven by data and physical mechanisms.On the other hand, traditional scientific computing centers have widely supported various research tasks, and while continuously increasing intelligent computing hardware capabilities, they also need to deeply integrate scientific computing/intelligent computing hardware with AI development frameworks to support various new AI for Science computing scenarios and achieve leading performance.

From the research and development ecosystem perspective, AI for Science, as an emerging research paradigm that fully embodies interdisciplinary collaboration, involves fields such as biology, molecular dynamics, computational fluid dynamics, and solid mechanics, requiring a large number of cross-domain research talents, and the continually expanding open-source ecosystem must connect with traditional dataset simulation software and datasets to meet the needs of developers for development toolchains, gradually forming a stable and high-quality research ecosystem.

To overcome these barriers and lower the application threshold of AI for Science, scientists and enterprises from various sectors have begun to embark on the path of paradigm innovation and inclusivity in AI for Science.

In the deep learning framework field, foreign AI frameworks like TensorFlow, PyTorch, and MXNet have been helping numerous scientists and engineers conduct academic research and engineering implementation since their inception, significantly promoting the development of the AI field. As a pioneer in the domestic AI field, Baidu has also launched the first domestic AI framework open-source initiative with PaddlePaddle in 2016 and has been evolving towards comprehensive AI technology layout. Today, the PaddlePaddle platform can widely adapt to various hardware and can be directly deployed to large-scale scientific computing clusters, closely integrating with the existing scientific computing ecosystem, strongly supporting the deployment and application of AI for Science solutions.

Similarly, in 2016, Xiang Hui began to engage with the AI industry at Baidu, experiencing the rapid evolution of AI technology applications in computer vision, natural language processing, recommendation, and other fields. She is now the product head of Baidu PaddlePaddle AI for Science.

In an interview with 36Kr, Xiang Hui discussed the challenges of implementing AI for Science, stating that Baidu PaddlePaddle believes the core issue to solve is building a generalized deep learning platform that can connect various heterogeneous computing powers downstream, providing APIs that support scientific computing problem-solving, as well as compilation acceleration mechanisms, to better support the construction and analysis of typical scientific computing scenarios, such as supporting meteorological forecasting, fluid simulations, and materials discovery. “At the same time, we also need to build a sustainable, open ecosystem that integrates research, scientific computing, platforms, and end users,” she said.

To allow scientists from different fields to flexibly use currently popular research models, Baidu PaddlePaddle began planning technology forms and product routes in the AI for Science field as early as 2019, and from early 2020 to the end of 2021, successively released the biocomputing platform “PaddleHelix”, the quantum computing platform “PaddleQuantum”, and the scientific computing platform “PaddleScience” aimed at fluid, solid, electromagnetic, and other fields.

Additionally, Baidu PaddlePaddle also provides mainstream models such as PINN, FNO, DeepONet, and standard cases that users can directly reuse, such as obstacle flow around in CFD, vortex-induced vibration, and Darcy flow.

Baidu PaddlePaddle also supports customized problem reproduction and analysis based on components, supporting various methods that combine data-driven and physical mechanisms, achieving breakthrough progress in scenarios such as physical simulation, compound molecular characterization, and quantum entanglement processing.

To better serve the needs of scientific computing users for solving various PDE equations, Baidu PaddlePaddle is also actively implementing full model support with the excellent scientific computing Repo-DeepXDE, having initially completed the accuracy alignment work for all models, and with the latest high-order automatic differentiation mechanism, automated distributed strategies, and compilation acceleration mechanisms from Baidu PaddlePaddle, the solution efficiency of some use cases has surpassed similar products.

To further promote the practical application of AI for Science, Baidu PaddlePaddle has also collaborated with several universities, research institutions, and others to build examples in fluid dynamics, materials, biology, and formed some open, interdisciplinary ecological communities. In May of this year, it also launched the “PaddlePaddle AI for Science Co-creation Plan,” hoping to jointly develop technology, promote resource sharing, and build ecological opportunities with various parties.

Reflecting on the development experiences of these communities, Xiang Hui vividly remembers several student team projects. She recalls that a student team from Beihang University conducted a vacuum flow simulation experiment, which could not be replicated on the ground due to the need for a vacuum condition. However, through the PaddlePaddle AI for Science product, the team derived some coefficients of the Boltzmann equation, achieving impressive results. “These cases have proven that in certain scenarios, Baidu PaddlePaddle AI for Science can solve developers’ research problems to some extent,” Xiang Hui said.

As of now, the Baidu PaddlePaddle AI for Science toolkit can support the intersection of AI methods with foundational discipline methods, with the biggest feature being its ability to overcome challenges faced in foundational disciplines, such as high dimensions, long times, cross-scale, and insufficient computing power, by equating numerical differentiation to the implementation of “data-driven, physical mechanism-driven neural network models.”

Launching the AI for Science track is undoubtedly another challenge and leap for Baidu PaddlePaddle’s AI capabilities. While significantly accelerating the resolution of scientific problems, it will also deeply accelerate the exploration of more unknown scientific questions in the industry.

Empowering Software-Hardware Collaborative Development Under the Platform

As mentioned earlier, the acceleration of scientific problem-solving and industrial implementation in AI for Science not only requires support at the framework or software platform level but also requires infrastructure to provide powerful computing and software optimization capabilities.

In the field of scientific computing, numerous chip manufacturers are making corresponding layouts on how to improve AI computing power and accelerate the practical application of AI. Intel is one of the leading companies in this track, committed to “making AI ubiquitous.”

In an interview with 36Kr, Intel AI architect Yang Wei provided a unique perspective and viewpoint on AI for Science from the perspective of a chip company.

Yang Wei believes that the main difficulty in popularizing AI for Science lies in how to reduce the cost of AI hardware and the need for easy-to-use AI software optimization tools.

He emphasized: since the second generation of Xeon Scalable Processors, Intel has achieved AI acceleration built into CPUs. Through AI acceleration technologies such as AVX-512 and DL Boost, “running AI on CPUs” has become possible. This move is significant as it fully activates and utilizes the computing power of CPUs that are more widely deployed and cost-effective, providing the general computing power required for most applications while also accelerating AI inference to promote the practical application of AI. At the same time, Intel has also open-sourced various AI software optimization tools for free, including oneAPI and OpenVINO, which have lower technical thresholds and usage difficulties, helping users unleash the AI acceleration capabilities of Xeon CPUs.

Additionally, considering that models or similar variants in the AI for Science field are very sensitive to memory consumption and that CPU platforms typically have more advantages in computing resources for large memory applications, Intel has further strengthened this capability—its Intel Optane persistent memory, paired with Xeon CPUs, can provide far more capacity than mainstream DRAM, making it easier to achieve TB-level memory configurations while maintaining performance close to DRAM. This means it can minimize latency in the entire link of scientific computing models while breaking through the memory capacity bottleneck limiting AI for Science applications.

Although at this stage, Intel’s core hardware layout for AI for Science and other AI applications is CPU-centric, with accelerated application types primarily focused on inference, this is merely the first step in expanding Intel’s AI product portfolio in the XPU era. In Intel’s “XPU vision,” as the types of data and applications grow and evolve rapidly in the future, its underlying hardware architecture will expand from CPUs to include GPUs, FPGAs, and AISC accelerators.

Based on this strategy, in 2023, Intel will not only launch the fourth generation Xeon Scalable Processor, codenamed Sapphire Rapids, but will also release a data center GPU product, codenamed Ponte Vecchio, designed for scientific computing and AI acceleration, thereby forming a more comprehensive layout where CPU is primarily used for AI inference with high cost-effectiveness, while GPU is primarily used for AI training. Moreover, this XPU combination can achieve unified programming and management of heterogeneous hardware through the oneAPI toolkit, featuring flexible allocation, seamless collaboration, and high efficiency.

With the powerful computing support brought by the aforementioned product combinations, Intel is optimizing AI for Science from multiple dimensions, striving to allow more researchers to personally participate in development and customization, achieving true popularization of scientific intelligence. With its ongoing efforts, many partners have already realized product implementation.

For example, in the field of AI small molecule drug design, Intel has collaborated with Jitai Bio to achieve high-throughput molecular generation in small molecule drug optimization, promising to explore more potential candidate molecules in a larger chemical space. In the field of macromolecule drug design, Intel has conducted in-depth cooperation with Baidu PaddlePaddle, Jingtai Technology, Shanghai Jiao Tong University, and other institutions and universities, optimizing high-throughput and long-sequence protein structure prediction inference based on AlphaFold 2, and has introduced TB-level memory technology into AlphaFold 2, achieving cost reduction and efficiency increase overall.

Intel and Baidu PaddlePaddle began their collaboration focused on software-hardware synergy as early as 2017. As both parties continue to expand their layouts in the AI field, the breadth and depth of their cooperation are continuously improving. For instance, Intel and Baidu PaddlePaddle are committed to achieving mutual support between Intel’s full-stack software and hardware and PaddlePaddle, optimizing performance through oneAPI, and co-building deployment ecology through PaddlePaddle + OpenVINO.

Interestingly, the current collaboration between Baidu PaddlePaddle and Intel in the AI for Science field is not only related to these prior collaborations but also intricately linked to the developer ecosystem.

For a long time, Baidu PaddlePaddle has been actively developing its developer ecosystem, such as building the PaddlePaddle Special Interest Group (PPSIG), hoping to collaboratively build an open, diverse, and architecture-inclusive ecosystem with global developers. One Intel expert happens to be one of the earliest members of the PPSIG-Scientific Computing Science group, who actively participated in the construction of the PaddlePaddle scientific computing open-source community and developed a strong interest in the application of molecular dynamics simulation in biological protein molecules and energy materials.

In this context, the collaboration between the two parties in AI for Science has naturally progressed. Starting in March 2022, Baidu PaddlePaddle and Intel, based on their respective realities, determined task directions and cooperation content after multiple discussions and communications, jointly carrying out substantial work in AI for Science in the fields of molecular dynamics and life sciences, achieving a series of results, including: Baidu PaddlePaddle realized the first domestic AI deep learning framework that completed the integration work with traditional molecular dynamics software LAMMPS and AI potential function training software DeepMD-kit, and achieved a breakthrough “0 to 1” progress in the full process from training to inference based on Intel’s oneAPI; the Baidu Helix Fold model was optimized based on the AVX-512, oneDNN, and large memory capabilities of the Xeon platform, achieving significant performance improvement and easily predicting inference lengths exceeding 4000, i.e., ultra-long sequence protein structures.

Conclusion: The Inclusive Path of AI for Science is Nearing a Critical Point

One is Baidu PaddlePaddle, which has been deeply engaged in the deep learning field for many years and has grown to become the number one open-source AI framework in China, and the other is Intel, a top player in the scientific computing field. Both parties are relying on their respective advantageous products and ongoing layouts in the AI field, continuously lowering the application threshold of AI for Science with flexible and diverse “combinations,” and jointly advancing towards the goals of “making AI ubiquitous and more inclusive across various industries” and “ensuring cooperation spans industry, academia, and research to help AI for Science bridge theoretical, experimental, and industrial application pathways. “

At this critical juncture, as we revisit the over seventy years of AI development, we may see more clearly that every explosive development phase has created ripples of innovation in the long river of history, and today, these ripples have finally converged into a giant wave driving industrial transformation. Just like today’s AI for Science, it is continuously driving the critical point of paradigm innovation in scientific research, and every participant within it is eagerly imagining the possibilities that this impact will open up for humanity’s future.

After all, this will be an infinite array of possibilities, akin to nuclear fission linking reactions or the Cambrian explosion of life.

Leave a Comment Cancel reply