Overview of Graph Attention Networks (GAT)

Author: Deng Yang





This article is approximately 6300 words long and is recommended for a 10-minute read.
This article briefly introduces the working principles of GAT based on the order discussed in the paper by Velickovic et al. (2017).

When numbers are intangible, intuition is sparse; when forms are few, it is hard to delve deeply – Hua Luogeng

1 Introduction to Graph Attention Networks

1.1 Principles and Characteristics of GAT

A graph, composed of nodes, edges, surfaces, and volumes, represents an effective tool for understanding abstract concepts and expressing abstract ideas. The advantage of graphical language lies in its ability to transcend language barriers, a capability that has been developed by humanity to understand the world. The rapid advancements in computer science and artificial intelligence have made it possible to understand and learn deeper objective relationships between things. The birth of Graph Neural Networks (GNNs) has further aided humanity in understanding and solving problems through graphs. Graph Attention Networks (GAT) are specialized neural networks designed specifically for processing graph-structured data. Unlike traditional neural networks, GAT fully considers the relationships between data when processing input data, enabling it to more accurately capture the correlations between data when handling graph-structured data. The main advantage of GAT lies in its ability to automatically learn relationships between nodes without manual presetting.

The core working principle of GAT is to calculate the relationships between nodes through attention mechanisms. In traditional neural networks, the state updates of each node are performed independently. However, in GAT, the state update of each node takes into account the states of its neighboring nodes. GAT calculates the attention weights between a node and its neighboring nodes and updates the state of the node based on these weights. By updating information through calculated weights, GAT can better capture the structural information within the graph. In terms of weight calculation and information capturing, GAT employs a masking self-attention mechanism similar to that of Transformer, consisting of stacked graph attention layers. Each graph attention layer takes node embeddings as input and outputs transformed embeddings, with node embeddings focusing on the embeddings of other connected nodes (Velickovic et al., 2017). In the actual computation of GAT, the calculation of attention scores is completed through a structure called “attention head.” Each attention head calculates a set of attention scores, and in the final result, the outputs of all attention heads are averaged or concatenated to obtain the final node embedding. This approach allows each attention head to focus on different features or patterns, enabling GAT to capture more information. The specific mathematical content will be explained in the following articles.

Furthermore, GAT introduces the concept of graph pooling, a method for selecting the most informative subset of nodes, which can generate more distinctive graphs. During the graph pooling process, GAT uses a learnable projection vector to calculate the projection score of each node and selects the nodes to retain based on these scores. This method can further enhance the performance of GAT. Another important feature of GAT is model-level fusion. When dealing with complex problems, GAT can leverage different information sources through model-level fusion. This capability has already demonstrated GAT’s superiority in various fields, including image recognition, natural language processing, and recommendation systems. In image recognition, GAT can effectively handle the relationships between pixels in an image, thereby improving the accuracy of image recognition. In natural language processing, GAT can effectively manage the relationships between words in a text, thereby enhancing the accuracy of text understanding. In recommendation systems, GAT can efficiently process the relationships between users and products, improving the accuracy of recommendations.

1.2 Examples of GAT in Daily Life

To better understand Graph Attention Networks (GAT), we can reveal its working principles and applications through a daily life example.

In traditional Chinese weddings, seating arrangements are an important task. The organizers need to consider the relationships between all guests to ensure that everyone enjoys a pleasant experience at the wedding. This process can be viewed as a graph, where each guest represents a node, and the relationships between guests represent edges. The goal of the organizers is to find an optimal seating arrangement so that guests at each table can get along harmoniously.

Within the framework of GAT, this process is modeled as an attention mechanism. Each node (guest) is represented by a vector, referred to as an embedding, which can be seen as the features or attributes of the node. In this example, the embeddings of guests may include their age, gender, interests, and other information. The attention mechanism works by calculating the similarity between each node (guest) and other nodes (other guests) to determine the importance of each node. This similarity is referred to as the attention score, which is calculated using a function called “dot-product attention.” The higher the attention score, the better the relationship between this node and other nodes, and the greater the likelihood that they will be seated together. In this scenario, if two guests have a high attention score, they may be seated at the same table. In this process, GAT also considers the table host. This host needs to have a high attention score because they need to take care of every guest at the table, ensuring that everyone enjoys the wedding. This is akin to identifying the most important nodes in the graph.

However, just like in actual wedding seating arrangements, GAT also has some limitations. For instance, if there are a large number of guests, calculating the attention scores for each guest may become very complex. Furthermore, GAT may overlook some important information, such as some guests who may not have a good relationship with others but are significant figures at the wedding. This necessitates the introduction of more information when calculating attention scores, such as the guests’ status, their contributions to the wedding, etc.

Overall, GAT is a powerful tool that can help address some complex problems. However, it is also important to understand its limitations and to consider the specific circumstances of the problem when using it. By relating GAT to everyday experiences, we can better understand and apply this powerful tool. The following sections of this article will focus on explaining GAT’s working principles and some algorithmic design principles and mathematical knowledge.

2 Working Principles of GAT

This article briefly introduces the working principles of GAT according to the order discussed in the paper by Velickovic et al. (2017). If this is your first encounter with knowledge related to Graph Neural Networks, it is recommended to first visit DGL Team (2023) and LabML Team (2023) to understand the foundational work.

GAT typically consists of multiple single-layer graph attention layers; below is an explanation of a single-layer graph attention layer. The input with N nodes and F features can be denoted as:

Overview of Graph Attention Networks (GAT) (1)

Here,

represents the input features. Such an input layer will generate new features for the nodes, so the output can be expressed as:

(2)

Here,

To convert the input features into high-dimensional features, at least one linear transformation from a scientific perspective is required. In (Velickovic et al., 2017), the authors used a shared linear transformation for each node and introduced a weight matrix

to parameterize the linear transformation.

2.1 Self-Attention Mechanism

Unlike the attention mechanism, self-attention focuses on the relationship between each point and itself, deriving weights according to the importance of relationships between points. Combined with the previously mentioned W, (Velickovic et al., 2017) proposed the self-attention mechanism

Thus, for node i, the importance of node j’s features can be measured by the following equation:

(3)

This mechanism allows each node to establish connections with one another while discarding all structured new messages.

2.2 Multi-head Attention Mechanism

Compared to the single attention mechanism’s handling of h1, the multi-head attention mechanism retrieves an h1k in each attention head. In the multi-head attention mechanism, the feature values of each head are concatenated, and the concatenated features are represented as follows:

(4)

Under the multi-head attention mechanism, the final output will no longer be F’ features, but KF’ features. The results obtained from repeated calculations can be averaged

or concatenated

For more detailed explanations and mathematical derivations, interested readers can continue their studies here: (Graph Attention Networks Experiment 2022; Graph Attention Networks 2022).

2.3 Step-by-Step Illustration

This article refers to the example from (LaBonne, 2022) to better explain how to use the previously mentioned calculation methods in nodes. For node 1’s embedding self-attention calculation method:

Figure 1: Example Illustration

(5)

Where αij still represents the importance of the feature relationship between nodes, hi is the attribute vector of each point. Based on the above calculation method, the graph attention mechanism will calculate the embedding value of node 1. As for the graph attention coefficients in the processing equations, they need to be computed through a ‘four-step process’ (LaBonne, 2022): linear transformation, activation function, softmax normalization, and multi-head attention mechanism to learn the attention scores related to node 1.

In the first step, the importance of connections between each point can be calculated by creating hidden vector pairs through concatenating the vectors between two points. To achieve this, a linear transformation and a weight matrix Watt are applied:

(6)

Second step, add an activation function LeakyReLU:

(7)

Third step, normalize the output results of the neural network for easier comparison:

Normalized attention scores can be calculated and compared, but at the same time, a new problem arises: self-attention is very unstable. Velickovic et al. (2017) proposed using a multi-head attention mechanism based on the transformer structure.

(8)(9)

Fourth step, according to the previously mentioned multi-head attention mechanism, this is used to process and calculate attention scores:

3 Applications of GAT in Combinatorial Optimization Problems

3.1 Combinatorial Optimization Problems

Combinatorial optimization problems are core issues in operations research and are a necessary path for scholars to begin learning operations research. Combinatorial optimization problems are a core domain in computer science and operations research, involving many practical applications such as logistics, scheduling, and network design. Combinatorial optimization problems play a crucial role in many practical applications. For example, in the logistics field, combinatorial optimization problems can help people find the optimal cargo delivery routes amidst complex transportation conditions, thereby saving transportation costs and improving freight efficiency. In scheduling problems, combinatorial optimization can help effectively allocate resources to meet various constraints while maximizing or minimizing a desired objective value (commonly referred to as the objective function).

However, traditional combinatorial optimization algorithms often require designing from scratch for each new problem and necessitate careful consideration of the problem structure by experts. Solving combinatorial optimization problems typically requires substantial computational resources, especially for real-world problems, where the problem size is often enormous, and traditional optimization algorithms may not find solutions within a reasonable time frame, or even within a reachable time frame. Therefore, how to effectively solve combinatorial optimization problems has been a focal point for researchers. In recent years, using Markov chains to construct dynamic programming has enabled the solution of combinatorial optimization problems expressed as single-player games defined by states, actions, and rewards, including minimum spanning trees, shortest paths, the traveling salesman problem (TSP), and vehicle routing problems (VRP), without expert knowledge. This method uses reinforcement learning to train graph attention neural networks (GNNs), training on unlabelled graph training sets. The trained network can output approximate solutions for new graph instances in linear runtime. In TSP problems, GAT can effectively manage the distance relationships between cities to find the shortest travel path. In VRP problems, GAT can effectively handle the relationships between vehicles, customers, and warehouses to find the optimal delivery routes. These research results indicate that GAT has tremendous potential in solving combinatorial optimization problems.

3.2 Case Study of GAT in Path Planning Papers

(Kool et al., 2018) proposed an attention mechanism-based model similar to GAT to solve various path planning problems, including TSP, VRP, OP, and others. This article primarily constructs path planning problems (such as TSP) as graph-based problems, with each customer point’s location and other information serving as node features. Through an attention mechanism-based encoder-decoder, the path results indicate a random policy π(π|s), using this policy to find the most effective path method π within given test data points, which is parameterized by θ and decomposed into:

(10)

The decoding process is sequential. At each time step, the decoder outputs node πt based on the encoder’s embedding and the output generated at that time. During the decoding process, a special context node is added to represent the decoding context. The decoder computes an attention (sub) layer on top of the encoder, but only sends messages to the context node to improve efficiency. The final probability is calculated using a single-head attention mechanism. At time t, the decoder’s context comes from the encoder and the output prior to time t. For TSP, this includes the graph’s embedding, the previous (last) node πt-1, and the first node π1. Meanwhile, to compute the output probability, a final decoder layer with a single attention head is added. The article optimizes the loss L using gradient descent, employing the REINFORCE gradient estimator and a baseline. The article uses a rollout baseline, with periodic updates to the baseline policy, serving as a better model-defined policy for determining the deterministic greedy rollout solutions.

The article also discusses in detail the handling strategies for different problems. For example, for the reward-collecting traveling salesman problem (PCTSP), the authors use separate parameters in the encoder to handle warehouse nodes and provide node rewards and penalties as input features. In the context of the decoder, the authors use the current/last position and remaining rewards for collection. In PCTSP, if the remaining rewards are greater than 0 and all nodes have not been visited, the warehouse node cannot be accessed. Only when the node has been visited can it be masked (i.e., not accessible).

Due to space limitations, this article focuses on how Kool et al. (2018) constructed an algorithm to pass weighted information between nodes in TSP based on the graph attention mechanism. In the constructed graph, the information received by the nodes is weighted and comes from themselves and their neighboring points. The information values of these nodes depend on the compatibility of their queries with their neighbors’ keys, as shown in Figure 6. The authors defined dk, dv and calculated the corresponding ki∈ ℝdk, vi∈ ℝdv, qi∈ ℝdk for all points. Through the following method, all corresponding qi, ki, vi are projected into hi for calculation:

Where WQ, WK are two parameter matrices with dimensions dk × dh, and WV is of size (dv × dh). (Readers interested in a deeper understanding of the settings and calculation methods of q, k, v in Transformer, please refer to WMathor (2020))

The compatibility between two points is realized by calculating the value uij between node i’s qi and node j’s kj (Velickovic et al., 2017):

(12)

Finally, node i will receive a vector h’i that contains a convex combination of vector vj:

(13)

4 Conclusion

4.1 Future Development and Application Prospects of GAT

Graph Attention Networks (GAT) have been widely proven in their ability to solve combinatorial optimization problems, particularly the traveling salesman problem (TSP) and vehicle routing problem (VRP). However, it should also be noted that while GAT has shown superiority in these problems, it is not a panacea. For some specific problems, it may be necessary to design specific models or algorithms to solve them. Therefore, when researching a problem, it is crucial to select appropriate tools to address different combinatorial optimization issues based on the specific circumstances of the problem and the characteristics of GAT in solving problems.

GAT also plays various roles in other fields. For example, Zhang et al. (2022) proposed a new GAT architecture that can capture potential associations between graph knowledge at different scales. This new GAT architecture outperforms traditional GAT models in terms of prediction accuracy and training speed.

Additionally, Shao et al. (2022) proposed a new dynamic multi-graph attention model that can handle long-term spatiotemporal prediction problems. This model constructs new graph models to represent the contextual information of each node and utilizes long-term spatiotemporal data dependency structures. Experiments on two large-scale datasets indicate that it can significantly enhance the performance of existing graph neural network models in long-term spatiotemporal prediction tasks.

In stock market prediction, GAT also has extensive applications. Zhao et al. (2022) proposed a stock movement prediction method based on a dual attention network. First, a market knowledge graph containing two types of entities (including listed companies and related executives) and mixed relationships (including explicit and implicit relationships) is constructed. Then, a dual attention network is proposed, which can learn momentum overflow signals in the market knowledge graph for stock prediction. Experimental results show that this method outperforms nine state-of-the-art baseline methods in stock prediction.

Overall, the graphical perspective provides a new way to understand and solve problems. Thinking about and transforming existing problems into graphical forms not only reveals new aspects and characteristics of the problems but may also spark new innovations. Similarly, thinking about new problems using graphical methods may lead to unexpected gains. The advantage of this approach is that it helps scholars better understand the structure and complexity of problems, thereby finding more effective solutions. It is hoped that readers can gain a better understanding of the principles of graph neural networks from this introduction to GAT and apply it more in their studies and research, providing strong support for problem-solving through the use of GAT.

Author Biography

Author Deng Yang is a joint PhD student from the School of Management at Xi’an Jiaotong University and the School of Systems Engineering at City University of Hong Kong, focusing on the application of reinforcement learning in urban logistics. He graduated with a Master’s degree in Transportation Engineering from the University of Southern California, with honors. He has previously worked at the engineering consulting firm HATCH in Los Angeles, California, and is a registered EIT in California. He has received titles such as ‘Thirty Under Thirty Outstanding Young Entrepreneurs’ from AACYF in 2017 and ‘Top Ten Outstanding Young Chinese Americans’ in 2019. Currently, he serves as the Vice President of the Baotou Overseas Chinese Association and is a member of the Overseas Chinese Federation and the European and American Alumni Association.

References

1. DGL Team. 9 Graph Attention Network (GAT) Deep Graph Library (DGL). https://docs.dgl.ai/en/0.8.x/tutorials/models/1_gnn/9_gat.html (2023).

2. Graph Attention Networks LabML. https://nn.labml.ai/graphs/gat/index.html (2023).

3. Graph Attention Networks Experiment LabML. https://nn.labml.ai/graphs/gat/experiment.html (2023).

4. Khalil, E., Dai, H., Zhang, Y., Dilkina, B. & Song, L. Learning combinatorial optimization algorithms over graphs. Advances in neural information processing systems 30 (2017).

5. Kool, W., Van Hoof, H. & Welling, M. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475 (2018).

6. LabML Team. Graph Neural Networks LabML. https://nn.labml.ai/graphs/index.html (2023).

7. LaBonne, M. Graph Attention Networks: Theoretical and Practical Insights https://mlabonne.github.io/blog/posts/2022-03-09-graph_attention_network.html (2023).

8. Shao, W., Jin, Z., Wang, S., Kang, Y., Xiao, X., Menouar, H., Zhang, Z., Zhang, J. & Salim, F. Long-term Spatio-Temporal Forecasting via Dynamic Multiple-Graph Attention. arXiv preprint arXiv:2204.11008 (2022).

9. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y., et al. Graph attention networks. stat 1050, 10–48550 (2017).

10. WMathor. Graph Attention Networks https://wmathor.com/index.php/archives/1438/ (2023).

11. Zhang, W., Yin, Z., Sheng, Z., Li, Y., Ouyang, W., Li, X., Tao, Y., Yang, Z. & Cui, B. Graph attention multilayer perceptron in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022), 4560–4570.

12. Zhao, Y., Du, H., Liu, Y., Wei, S., Chen, X., Zhuang, F., Li, Q. & Kou, G. Stock Movement Prediction Based on Bi-Typed Hybrid-Relational Market Knowledge Graph Via Dual Attention Networks. IEEE Transactions on Knowledge and Data Engineering (2022). REFERENCES

Editor: Wang Jing

Introduction to Datapi Research Department

The Datapi Research Department was established in early 2017, dividing into multiple groups based on interests, each group follows the overall knowledge sharing and practical project planning of the research department while having its own characteristics:

Algorithm Model Group: Actively participates in competitions like Kaggle and produces original step-by-step instructional articles;

Research and Analysis Group: Investigates the applications of big data through interviews and explores the beauty of data products;

System Platform Group: Tracks the cutting edge of big data & artificial intelligence system platform technologies and engages with experts;

Natural Language Processing Group: Focuses on practice, actively participates in competitions and organizes various text analysis projects;

Manufacturing Big Data Group: Adheres to the dream of becoming an industrial powerhouse, integrating production, learning, research, and government to tap into data value;

Data Visualization Group: Merges information with art, explores the beauty of data, and learns to tell stories through visualization;

Web Crawling Group: Crawls web information and collaborates with other groups to develop creative projects.

Click on the end of the article “Read the original text” to sign up as a volunteer for the Datapi Research Department, there is always a group suitable for you~

Click “Read the original text” to join the organization~

Leave a Comment Cancel reply