In recent years, the enthusiasm for research on Graph Neural Networks (GNNs) in the field of deep learning has been growing steadily, making GNN a hot research topic at major deep learning conferences. The remarkable ability of GNNs to handle unstructured data has led to breakthroughs in network data analysis, recommendation systems, physical modeling, natural language processing, and combinatorial optimization problems on graphs.There are many good reviews on Graph Neural Networks [1][2][3] that can be referenced, and more papers can be found in the GNN paper list compiled by Tsinghua University [4]. This article will provide a simple summary of the currently popular GNN architectures, including GCN, GraphSAGE, GAT, GAE, and graph pooling strategies such as DiffPool.
1. Why Do We Need Graph Neural Networks?
With the development of machine learning and deep learning, significant breakthroughs have been made in speech, image, and natural language processing. However, speech, images, and text are simple sequences or grid data, which are structured data types that deep learning is good at processing (Figure 1).
Figure 1However, in the real world, not all entities can be represented as sequences or grids, such as social networks, knowledge graphs, and complex file systems (Figure 2). This means that many entities are unstructured.Figure 2Compared to simple text and images, this type of unstructured data in networks is very complex, and the challenges in processing it include:
The size of the graph is arbitrary, and the topology is complex, lacking spatial locality like images;
Graphs do not have a fixed node order, or in other words, there is no reference node;
Graphs are often dynamic and contain multimodal features.
So how should we model this type of data? Can deep learning be extended to model this type of data? These questions have prompted the emergence and development of Graph Neural Networks.
2. What Are Graph Neural Networks Like?
Compared to the basic structure of neural networks, the fully connected layer (MLP), which involves multiplying a feature matrix by a weight matrix, Graph Neural Networks add an adjacency matrix. The calculation is quite simple: it involves multiplying three matrices and adding a nonlinear transformation (Figure 3).Figure 3Thus, a common application pattern of Graph Neural Networks is shown in the following figure (Figure 4). The input is a graph, which undergoes multiple layers of graph convolution and various operations along with activation functions, ultimately obtaining representations for each node for tasks such as node classification, link prediction, and graph and subgraph generation.Figure 4The above provides a simple and intuitive understanding of Graph Neural Networks, but the underlying principles and logic are still quite complex. This will be discussed in detail later. Next, we will introduce the development of Graph Neural Networks through several classic GNN models.
3. Classic Models and Development of Graph Neural Networks
1. Graph Convolutional Networks (GCN) [5]GCN can be regarded as the “foundational work” of Graph Neural Networks, as it was the first to apply convolution operations used in image processing to graph-structured data processing and provided specific derivations involving complex spectral graph theory. The derivation process is quite complicated, but the final result is very simple (Figure 5).Figure 5Let’s take a look at this equation. My goodness, isn’t this just aggregating the features of neighboring nodes and performing a linear transformation? Yes, that’s correct. In order for GCN to capture the information of K-hop neighboring nodes, the authors stack multiple layers of GCN layers, so stacking K layers gives:The above equation can also be expressed in matrix form as follows:Where is the normalized adjacency matrix, and corresponds to a linear transformation of the embeddings of all nodes in the layer, with the left multiplication by the adjacency matrix representing that for each node, its feature representation is the result of adding the features of neighboring nodes. (Note that replacing with a matrix is equivalent to the three matrix multiplication mentioned in Figure 3).So how effective is GCN? The authors applied GCN to node classification tasks, conducting experiments on datasets such as Citeseer, Cora, Pubmed, and NELL, and the improvements compared to traditional methods were significant. This is likely due to GCN’s ability to encode the structural information of graphs, allowing it to learn better node representations.Figure 6Of course, the disadvantages of GCN are also quite evident. First, GCN requires the entire graph to be loaded into memory and GPU memory, which can be very memory-intensive and cannot handle large graphs. Second, GCN needs to know the entire graph’s structural information (including the nodes to be predicted) during training, which is not feasible in some real-world tasks (for example, predicting tomorrow’s data using today’s trained graph model, where tomorrow’s nodes are not available).2. Graph Sample and Aggregate (GraphSAGE) [8]To address the two shortcomings of GCN, GraphSAGE was proposed. Before introducing GraphSAGE, let’s first explain Inductive Learning and Transductive Learning. Noticing the difference between graph data and other types of data, each node in graph data can utilize information from other nodes through edge relationships. This leads to a problem where GCN inputs the entire graph, and when training nodes collect information from neighboring nodes, it uses samples from the test and validation sets, which we refer to as Transductive Learning. However, most of the machine learning problems we handle are Inductive Learning because we intentionally split the sample set into training/validation/testing, and only use training samples during training. This is beneficial for graphs because it allows for processing new nodes that arrive in the graph, using known nodes’ information to generate embeddings for unknown nodes, which is what GraphSAGE does.GraphSAGE is an Inductive Learning framework. In its implementation, during training, it only retains edges from training samples to training samples, then includes two major steps: Sample, which refers to how to sample the number of neighbors, and Aggregate, which refers to how to aggregate the embeddings of the neighboring nodes after obtaining them to update its own embedding information. The following figure illustrates the learning process of GraphSAGE.Figure 7 The first step is to sample neighbors;The second step is to pass the sampled neighbors’ embeddings to the node and use an aggregation function to combine this neighbor information to update the node’s embedding;The third step is to predict the node’s label based on the updated embedding;Next, we will explain in detail how a trained GraphSAGE generates embeddings for a new node (i.e., a forward propagation process), as shown in the algorithm diagram:First, (line 1) the algorithm initializes the feature vectors of all nodes in the input graph, (line 3) for each node , after obtaining its sampled neighboring nodes, (line 4) it aggregates the information of the neighboring nodes using the aggregation function, (line 5) and combines it with its own embedding to update its embedding representation through a nonlinear transformation.Note that in the algorithm, refers to the number of aggregators, which also corresponds to the number of weight matrices and the number of layers in the network. This is because the aggregator and weight matrices are shared within each layer of the network. The number of layers in the network can be understood as the maximum number of hops to access neighbors. For example, in Figure 7, the red node’s update accesses information from its first and second-hop neighbors, so the network layer count is 2. To update the red node, first, in the first layer (k=1), we will aggregate information from the blue nodes to the red node and from the green nodes to the blue nodes. In the second layer (k=2), the embedding of the red node is updated again, but this time it uses the updated embedding of the blue node, ensuring that the updated embedding of the red node includes information from both the blue and green nodes, which is two-hop information.To make it clearer, let’s take a closer look at the process of updating a certain node, as shown in Figure 8, which illustrates the processes of updating Node A and Node B. We can see that the process of updating different nodes shares the same aggregator and weight matrices in each layer of the network.Figure 8So how does GraphSAGE perform sampling? GraphSAGE uses a fixed-length sampling method. Specifically, it defines the required number of neighbors , and then uses resampling/negative sampling methods to achieve . This ensures that the number of neighbors for each node (after sampling) is consistent, allowing multiple nodes and their neighbors to be concatenated into a Tensor for batch training on GPUs.What aggregators does GraphSAGE have? There are mainly three:It should be noted that the Mean Aggregator operates similarly to GCN (GCN effectively sums the values).At this point, we have completed the entire model architecture. How does GraphSAGE learn the parameters of the aggregators and weight matrices? In the case of supervised learning, the cross-entropy of each node’s predicted label and true label can be used as the loss function. In the case of unsupervised learning, we can assume that the embeddings of adjacent nodes should be as close as possible, thus designing the following loss function:How effective is GraphSAGE in practice? The authors provided results for unsupervised and fully supervised learning on datasets such as Citation, Reddit, and PPI, showing significant improvements compared to traditional methods.At this point, we have completed the introduction of GraphSAGE. Let’s summarize some advantages of GraphSAGE:(1) By utilizing a sampling mechanism, it effectively addresses the issue of GCN needing to know the entire graph’s information, overcoming the memory and GPU limitations during GCN training, and can obtain representations even for unknown new nodes;(2) The parameters of the aggregators and weight matrices are shared across all nodes;(3) The number of model parameters is independent of the number of nodes in the graph, allowing GraphSAGE to handle larger graphs;(4) It can handle both supervised and unsupervised tasks.(I like this idea that solves the problem with a simple method and good results!!!)Of course, GraphSAGE also has some disadvantages. Each node has many neighbors, and GraphSAGE’s sampling does not consider the varying importance of different neighboring nodes, and the importance of neighboring nodes during aggregation also differs from that of the current node.3. Graph Attention Networks (GAT) [9]To address the issue of GNN not considering the varying importance of neighboring nodes during aggregation, GAT borrowed the idea from Transformers, introducing a masked self-attention mechanism. When calculating the representation of each node in the graph, it assigns different weights based on the characteristics of the neighboring nodes.Specifically, for the input graph, a graph attention layer is illustrated in Figure 9:
Figure 9
Where employs a single-layer feedforward neural network to implement the calculation process (note that the weight matrix is shared across all nodes):After calculating the attention, we can obtain a new representation for a node aggregating the information of its neighboring nodes, calculated as follows:To enhance the model’s fitting ability, a multi-head self-attention mechanism is also introduced, which simultaneously computes multiple self-attention values and merges the results (by concatenation or summation):Additionally, due to the structure of GAT, it does not require a pre-constructed graph, making it suitable for both Transductive Learning and Inductive Learning.So how effective is GAT? The authors conducted experiments on three Transductive Learning tasks and one Inductive Learning task, with the results shown below:Whether in Transductive Learning or Inductive Learning tasks, GAT outperforms traditional methods.At this point, we have completed the introduction of GAT. Let’s summarize some advantages of GAT:(1) Training GCN does not require knowledge of the entire graph structure, only the neighboring nodes of each node;(2) Fast computation speed, allowing for parallel computation across different nodes;(3) Suitable for both Transductive Learning and Inductive Learning, capable of processing unseen graph structures.(Once again, a simple idea that solves the problem and performs well!!!)So far, we have introduced the most classic models in GNN: GCN, GraphSAGE, and GAT. Next, we will introduce some popular GNN models and methods based on specific task categories.4. Unsupervised Node Representation LearningDue to the high cost of labeled data, being able to learn node representations effectively using unsupervised methods would be of great value and significance, such as discovering communities with similar interests or identifying interesting structures in large-scale graphs.Figure 10Some classic models in this area include GraphSAGE, Graph Auto-Encoder (GAE), etc. We have already discussed GraphSAGE, so we won’t elaborate on it here. Next, we will detail the latter two.
Graph Auto-Encoder (GAE) [10]
Before introducing Graph Auto-Encoder, it is necessary to understand Auto-Encoders and Variational Auto-Encoders. For more details, please refer to [11]; we will not elaborate on it here.Once you understand Auto-Encoders, understanding Variational Graph Auto-Encoders becomes much easier. As shown in Figure 11, the input is the adjacency matrix of the graph and the feature matrix of the nodes, through an encoder (graph convolutional network) to learn the mean μ and variance σ of the low-dimensional vector representations of the nodes, and then use a decoder (link prediction) to generate the graph.Figure 11The encoder employs a simple two-layer GCN network, while the decoder calculates the probability of an edge existing between two points to reconstruct the graph. The loss function includes a distance metric between the generated graph and the original graph, as well as the KL divergence between the distribution of the node representation vectors and the normal distribution. The specific formula is illustrated in Figure 12:Figure 12Additionally, for comparison, the authors proposed a Graph Auto-Encoder, which is simpler than the Variational Graph Auto-Encoder. The Encoder consists of two layers of GCN, and the Loss only includes Reconstruction Loss.How effective are the two types of graph auto-encoders? The authors conducted link prediction tasks on the Cora, Citeseer, and Pubmed datasets, with the experimental results shown in the following table. The Graph Auto-Encoder (GAE) and Variational Graph Auto-Encoder (VGAE) generally outperform traditional methods, with VGAE performing better overall; however, GAE achieved the best results on Pubmed. This may be because the Pubmed network is larger, and the more complex VGAE model is harder to tune.5. Graph PoolingGraph pooling is a popular operation in GNN, aimed at obtaining a representation of the entire graph, mainly used for graph-level classification tasks, such as supervised graph classification and document classification.Figure 13There are many methods for graph pooling, such as simple max pooling and mean pooling, but these two methods are inefficient and ignore the order information of nodes. Here, we introduce a method: Differentiable Pooling (DiffPool).1. DiffPool [12]In graph-level tasks, many current methods use global pooling of all node embeddings, ignoring any hierarchical structures that may exist within the graph. This is especially problematic for graph classification tasks, as the goal is to predict the label of the entire graph. To address this issue, a team from Stanford University proposed a differentiable pooling operation module for graph classification—DiffPool, which can generate hierarchical representations of graphs and can be integrated into various graph neural networks in an end-to-end manner.The core idea of DiffPool is to hierarchically aggregate graph nodes through a differentiable pooling operation module. Specifically, this differentiable pooling operation module is based on the node embeddings generated in the previous layer of the GNN and a distribution matrix, which is assigned to the clusters in the next layer in an end-to-end manner, and then these clusters are input into the next layer of GNN, thereby achieving the idea of stacking multiple GNN layers in a hierarchical manner (Figure 14).Figure 14So how are the node embeddings and distribution matrix computed? After calculating, how is it assigned to the next layer? This involves two parts: the learning of the distribution matrix and the pooling distribution matrix.
Learning of the Distribution Matrix
Two separate GNNs are used to generate the distribution matrix and the new embeddings of each cluster node, both of which take the cluster node feature matrix and the coarsened adjacency matrix as input:
Pooling Distribution Matrix
After obtaining the distribution matrix and the new embeddings of each cluster node, the DiffPool layer generates a new coarsened adjacency matrix and a new embedding matrix for each node/cluster based on the distribution matrix:Overall, each layer of DiffPool updates the embeddings of each cluster node and the feature matrix of the cluster nodes, as shown in the following formula:Thus, we have completed the basic idea of DiffPool. How effective is it? The authors conducted experiments on various benchmark datasets for graph classification, such as protein datasets (ENZYMES, PROTEINS, D&D), social network datasets (REDDIT-MULTI-12K), and scientific collaboration datasets (COLLAB), with the experimental results shown below:Among them, GraphSAGE uses global average pooling; DiffPool-DET is a variant of DiffPool that uses a deterministic graph clustering algorithm to generate the distribution matrix; DiffPool-NOLP is a variant of DiffPool that removes the link prediction objective. Overall, the DiffPool method achieves the highest average performance among all pooling methods in GNN.To further demonstrate the effectiveness of DiffPool for graph classification, the paper also used other GNN architectures (Structure2Vec (s2v)) and constructed two variants for comparative experiments, as shown in the following table:It can be seen that DiffPool significantly improved the performance of S2V on the ENZYMES and D&D datasets.Moreover, DiffPool can automatically learn the appropriate number of clusters.Now, let’s summarize the advantages of DiffPool:(1) It can learn hierarchical pooling strategies;(2) It can learn a hierarchical representation of the graph;(3) It can be integrated into various graph neural networks in an end-to-end manner.However, it should be noted that DiffPool also has its limitations. The distribution matrix requires a large amount of space to store, with space complexity of , being the number of pooling layers, thus it cannot handle very large graphs.
References
[1] Graph Neural Networks: A Review of Methods and Applications. arxiv 2018 https://arxiv.org/pdf/1812.08434.pdf[2] A Comprehensive Survey on Graph Neural Networks. arxiv 2019. https://arxiv.org/pdf/1901.00596.pdf[3] Deep Learning on Graphs: A Survey. arxiv 2018 https://arxiv.org/pdf/1812.04202.pdf[4] GNNpapers https://github.com/thunlp/GNNPapers/blob/master/README.md[5] Semi-Supervised Classification with Graph Convolutional Networks (ICLR2017) https://arxiv.org/pdf/1609.02907[6] How to Understand Graph Convolutional Network (GCN)? https://www.zhihu.com/question/54504471[7] GNN Series: The Foundational Work of Graph Neural Networks GCN Model https://mp.weixin.qq.com/s/jBQOgP-I4FQT1EU8y72ICA[8] Inductive Representation Learning on Large Graphs (2017 NIPS) https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf[9] Graph Attention Networks (ICLR2018) https://arxiv.org/pdf/1710.10903[10] Variational Graph Auto-Encoders (NIPS2016) https://arxiv.org/pdf/1611.07308[11] VGAE (Variational graph auto-encoders)Detailed Paper https://zhuanlan.zhihu.com/p/78340397[12] Hierarchical Graph Representation Learning with Differentiable Pooling (NIPS2018) https://arxiv.org/pdf/1806.08
Highlights of This Article
1. The disadvantages of GCN are quite evident:
First, GCN needs to load the entire graph into memory and GPU memory, which can be very memory-intensive and cannot handle large graphs;
Second, GCN needs to know the entire graph’s structural information (including the nodes to be predicted) during training.
2. Advantages of GraphSAGE:
(1) By utilizing a sampling mechanism, it effectively addresses the issue of GCN needing to know the entire graph’s information, overcoming the memory and GPU limitations during GCN training, and can obtain representations even for unknown new nodes;
(2) The parameters of the aggregators and weight matrices are shared across all nodes;
(3) The number of model parameters is independent of the number of nodes in the graph, allowing GraphSAGE to handle larger graphs; (4) It can handle both supervised and unsupervised tasks. 3. Advantages of GAT:
(1) Training GCN does not require knowledge of the entire graph structure, only the neighboring nodes of each node;
(2) Fast computation speed, allowing for parallel computation across different nodes;
(3) Suitable for both Transductive Learning and Inductive Learning, capable of processing unseen graph structures.
Source: Machine Learning Algorithms and Related Topics