The Rise of Graph Neural Networks in Alibaba's Large-Scale Practices

Interview Guest | Yang Hongxia

Author | Cai Fangfang

Editor | Chen Si

AI Frontline Guide: Graph Neural Networks (GNN) have undoubtedly become the “new darling” of AI in 2019. However, due to the inherent complexity of training GNNs, supporting efficient and scalable parallel computation is very challenging. Currently, GNN platforms have also become a key focus for major tech companies. In March this year, Alibaba launched its large-scale GNN platform, AliGraph, at the Alibaba Cloud Summit in Beijing. AI Frontline promptly provided a detailed interpretation of the AliGraph paper, attracting widespread attention in the industry. Recently, AI Frontline had the honor of interviewing Dr. Yang Hongxia, one of the authors of the AliGraph project from Alibaba’s Damo Academy Intelligent Computing Laboratory, to further discuss the architectural design principles of AliGraph, its applications within Alibaba, and the challenges currently faced by GNNs. For more quality content, please follow our WeChat public account “AI Frontline” (ID: ai-front)

“Combining deep learning with graph computation, and enhancing the interpretability of deep learning with generative models such as Scalable Bayesian DL will be a key direction in the next 3 to 5 years. Moreover, it is crucial for algorithmic model results to interact with users. For example, simply recommending products is unlikely to change consumer perception significantly. However, if we can abstractly learn user group interests through the aggregation of embedding results, reflect these interests onto corresponding products, and express them through text, images, and video generation for explicit interaction with consumers, we can achieve a more successful and profound impact.”

What prompted the creation of AliGraph?

The AliGraph platform began construction around June 2018, aiming to address challenges encountered by Alibaba in its vast data operations. Among the departments facing the most significant data and model challenges are the search recommendation and advertising departments, from which many of AliGraph’s needs have been derived. At the same time, AliGraph is also a result of cross-departmental collaboration, with its developers coming from various teams including the Damo Academy’s Intelligent Computing and the Computing Platform PAI team. Its outcomes also serve many departments within Alibaba Group, such as search recommendation, advertising, security, and entertainment.

Most graph data related to real business scenarios exhibit four characteristics: large-scale, heterogeneous, attribute-rich, and dynamic. For instance, the current e-commerce graph usually contains billions of vertices and edges, which possess various types and rich attributes, rapidly evolving over time. These characteristics pose significant challenges for embedding and representing graph data, which can be summarized into the following four questions:

How to improve the time and space efficiency of GNNs on large-scale graphs?
How to elegantly integrate heterogeneous information into a unified embedding result?
How to uniformly store and define topological structure information and non-structured attribute information?
How to design effective incremental GNN methods for dynamic graphs?

Existing GNN methods mostly address only 1-2 of the above issues. However, real-world commercial data often faces more challenges, which is precisely what AliGraph aims to alleviate. It is a comprehensive and systematic GNN solution that provides a corresponding set of systems and algorithms to tackle more practical problems, better supporting various GNN methods and applications.

The system components of AliGraph consist of a storage layer, sampling layer, and operation layer. The storage layer can store large-scale raw data to meet the rapid data access requirements of advanced operations and algorithms; the sampling layer optimizes the critical sampling operations in GNN methods; the operation layer provides optimized implementations of two commonly used application operations in GNN algorithms to accelerate the computation process. Additionally, the AliGraph system offers a flexible interface to design GNN algorithms, allowing existing GNN methods to be easily implemented on AliGraph.

According to reports, the architectural design of the AliGraph platform primarily follows two core principles: first, a close integration of algorithms and systems, initially isolating many challenging problems from the business, developing foundational models for some very important directions, while also proposing system requirements for platform development; second, comprehensively considering algorithm model issues based on Alibaba’s actual business.

Currently, AliGraph has been practically deployed within Alibaba’s business systems and has also been launched on the Alibaba PAI platform, allowing both internal and external users to customize their algorithm models based on AliGraph to solve problems.

AliGraph’s Applications Within Alibaba

The product recommendation scenario in Taobao e-commerce is a typical case of AliGraph’s application in real business. Here, GNNs are mainly applied to the mining and aggregation of user interests, as well as the representation of interests and the interpretability of recommendations. The development team of AliGraph has made preliminary attempts to combine GNNs with text auto-generation under the theme of recommendation business cloud, which not only allows for the mining and inference of user interests but also enhances the interpretability of the model, achieving significant improvements in actual business metrics such as discoverability, fatigue, and payment per thousand impressions, which have increased by 5%-90% respectively.

Yang Hongxia stated that the next generation of recommendation systems will not merely be about single-item recommendations but will certainly involve a comprehensive influence on consumers through various content (text, images, videos, etc.), truly understanding and impacting consumer perceptions. The cloud theme is a scenario being created for search recommendation products. After aggregating user interests through graph embeddings, a personalized group venue is formed, transitioning from single-item recommendations to group venues, cultivating cognitive upgrades and influencing consumer mindsets through various forms, including single-item recommendations, recommendation reasons, guides, knowledge cards, scene-based ad embeddings, and video recommendations.

The interpretability of machine learning is becoming increasingly important; if machine learning algorithms can be made interpretable, it can help decision-makers determine whether to trust the results. In the context of recommendations, so-called interpretable recommendations involve providing supporting arguments (i.e., explanations) alongside the recommendation results. One common process is post-hoc processing, where explanations are generated after the recommendation results have been provided, and the explanation content is unaffected by the recommendation system. Even if a different recommendation system is used, as long as the same user and item are given, the explanation remains the same. Post-hoc methods mainly research the generation of explanatory text, which can be divided into three types: rule-based, retrieval-based, and generative model-based. Rule-based and retrieval-based methods require templates when generating explanatory text, which may lead users to feel that the content is monotonous and lacks surprise. The development team utilized copy provided by sellers and the click-through rates of displayed products, employing improved sequence generation techniques to automatically generate text, continuously learning from data to further enhance the diversity and persuasiveness of explanations.

Regarding the choice of GNN models, after extensive online and offline experiments, the development team ultimately selected GraphSAGE, extending it to meet the needs of complex scenarios, including multi-edge, sampling, heterogeneous, and multi-modal modules.

Ultimately, the GNN model is applied to the entire monthly active user base of Taobao and the premium pool, with a data scale exceeding billions of nodes and over a hundred billion edges, making it the largest heterogeneous attributed graph embedding model currently within Alibaba Group. Based on the learned user/item embeddings, user interests can be easily identified and relevant products recalled.

AliGraph has also achieved some breakthroughs in the following business metrics:

Graph construction completed at the level of tens of billions within minutes
Sampling performance of over 100 million points per second for 200 nodes
Semi-GCN performance significantly improved, with a 6-fold enhancement in 2-hop performance and a 50-fold enhancement in 3-hop performance
Performance of GCN algorithm based on SparseKernel improved 3-fold without loss
In the context of a scale of 100 billion edges (200 workers, 20 parameter servers), training performance without loss, saving SQL time by 3-10 hours and approximately 300TB in storage
Model metrics such as discoverability, fatigue, and payment per thousand impressions increased by 5%-90% respectively

The Future and Challenges of Graph Neural Networks

Yang Hongxia believes that AI 1.0 can be roughly equated to deep learning models. In recent years, deep learning has indeed achieved significant results in many fields, but it also has considerable limitations, with the two most criticized bottlenecks being reasoning and interpretability. In recent years, both academia and industry have been exploring new directions, with some scholars, such as Google’s DeepMind and Tsinghua University’s Academician Zhang Bo, advocating for what is termed AI 2.0. This at least indicates that after deep learning, we should try some other approaches. Currently, the industry generally believes that GNNs, combining graph computation and deep learning, may precisely address the two shortcomings previously mentioned that deep learning cannot resolve.

In the industrial sector, especially among IT companies with vast amounts of data, the focus has traditionally been on making predictions, as there are traffic dividends; as long as the data flow is large enough, the algorithm’s predictive performance will not be too poor. However, many platforms are now competing for traffic, with countless apps dividing consumers’ effective time, and the traffic dividends for consumers have become quite limited. If, in this context, a company merely focuses on making predictions, that is far from sufficient. Therefore, more and more people are beginning to attempt more causal analysis and reasoning. For example, when recommending products on Taobao, inferring why users need to buy a particular product, and then presenting the recommendation rationale alongside the product to optimize the user experience, is an important direction for future algorithm development.

GNNs have significant application potential in both prediction and causal reasoning, but are not limited to these areas. Meanwhile, causal reasoning can also be performed using other methods, such as statistical methods, which often have substantial assumptions. Therefore, under the premise of big data where assumptions cannot be satisfied, how should we proceed? GNNs are one solution, but to truly implement causal reasoning, much work still needs to be done.

Overall, GNNs possess very strong expressive capabilities; wherever deep learning models can be applied, GNNs can also be utilized.

However, just as deep learning heavily relies on massive amounts of data, GNNs similarly require large datasets. The algorithmic complexity of GNNs is higher, and they demand more from the data. Fortunately, the industry is not lacking data. When there is a vast amount of diverse data, the challenge lies in whether the expressive capabilities of the algorithmic models can match such complex data.

The current challenges in constructing large-scale GNNs primarily stem from high system demands. In the earlier era of small data, algorithms could run on a single machine. However, with the advent of the big data era, many distributed algorithms emerged. Yet, graphs have different requirements for systems; traditional distributed algorithm models randomly partition samples and send them to different workers for distributed computation. However, graphs are not suitable for simple random partitioning, as this may send adjacent nodes to different workers.

For GNNs, aggregating information from adjacent nodes is a crucial and indispensable step. We need to aggregate the information of adjacent nodes to the target node for incremental updates. If random partitioning is performed, these adjacent nodes may be scattered across other workers, leading to very high IO communication volumes. The data volume itself is not the problem; Alibaba often addresses scenarios with extremely large data volumes, potentially involving nearly a hundred billion nodes and over a thousand billion edges. If random partitioning is used, the system cannot handle such massive IO consumption, which is a significant challenge. It is precisely because GNNs have very high system requirements that companies like Google, Amazon, and Microsoft are promoting their own GNN-related systems.

In addition, another challenge in building large-scale GNNs comes from algorithms. How to isolate problems from business scenarios, design targeted algorithms, and then integrate them with systems is not a simple task.

Currently, the algorithms in the GNN platforms launched in the industry are still quite basic. AliGraph has made some achievements in heterogeneous graphs, attribute information, and ultra-large-scale graphs, but dynamic graphs and temporal graphs are still being explored. Because the computational complexity of dynamic graphs and temporal data is very high, it leads to long computation times, while the industry has high demands for latency, making it difficult to achieve both. Therefore, there are currently no particularly good enterprise-level case studies for dynamic GNNs.

In Yang Hongxia’s view, current GNNs resemble deep learning four years ago, having begun to demonstrate their effectiveness and garnering increasing attention, with their future potential already recognized. However, only by reducing computational complexity to a certain extent can they be better promoted in the industry.

It is understood that AliGraph is still in the intensive development process, with the development team optimizing various aspects, including performance and algorithm optimizations, sparse kernel optimizations, and hardware-level optimizations. The development of upper-layer APIs is also underway. Yang Hongxia revealed to AI Frontline that after rigorous testing in Alibaba’s actual business scenarios, AliGraph is planned to be open-sourced in December this year.

Interview Guest Introduction

Yang Hongxia, Senior Algorithm Expert at the Damo Academy Intelligent Computing Laboratory, Ph.D. from Duke University. Author of over 40 top-tier papers. Former researcher at IBM Watson and Chief Data Scientist at Yahoo!. Currently dedicated to developing the next generation of reasoning systems that combine ultra-large-scale knowledge graphs and graph computation.

Are you also “watching”? 👇

The Rise of Graph Neural Networks in Alibaba’s Large-Scale Practices

Leave a Comment Cancel reply