Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms

Click on the aboveBeginner Learning Vision” to select “Star” or “Top
Important content delivered promptlyAnalyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Source:Machine Heart
For academic sharing only, please contact for deletion if infringing

Despite the widespread success of DNNs in various practical applications, their processes are often viewed as black boxes because it is difficult to explain how DNNs make decisions. The lack of interpretability undermines the reliability of DNNs, thereby hindering their widespread application in high-risk tasks, such as autonomous driving and AI healthcare. Therefore, interpretable DNNs have received increasing attention.

As a typical perspective for explaining DNNs, attribution methods aim to compute attribution/importance/contribution scores for each input variable to the network output. For example, given a pre-trained DNN for image classification and an input image, the attribute score for each input variable refers to the numerical influence of each pixel on the classification confidence score.

Although many attribution methods have been proposed in recent years, most of them are based on different heuristic approaches. Currently, there is a lack of a unified theoretical perspective to test the correctness of these attribution methods, or at least to mathematically clarify their core mechanisms.

Researchers have attempted to unify different attribution methods, but these studies only cover a few methods.

In this paper, we propose “A Unified Explanation of the Internal Mechanisms of 14 Input Unit Importance Attribution Algorithms”.

Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms

Paper link: https://arxiv.org/pdf/2303.01506.pdf

Whether it is the “12 Algorithms to Enhance Adversarial Transferability” or the “14 Input Unit Importance Attribution Algorithms”, both are heavily affected areas of engineering algorithms. In these two major fields, most algorithms are empirical, designed based on experimental experience or intuitive understanding, resulting in some seemingly plausible engineering algorithms. Most studies have not rigorously defined and theoretically demonstrated “what exactly is input unit importance”, and while a few studies have some demonstration, they are often incomplete. Of course, the issue of “lack of rigorous definitions and demonstrations” is prevalent throughout the entire field of artificial intelligence, but is particularly prominent in these two directions.

  • First, in an environment where numerous empirical attribution algorithms flood the interpretable machine learning field, we hope to prove that “the intrinsic mechanisms of all 14 attribution algorithms (algorithms that explain the importance of input units in neural networks) can be represented as a distribution of the interactive utility modeled by the neural network, with different attribution algorithms corresponding to different distribution ratios of the interactive utility”. Thus, although different algorithms have completely different design focuses (some algorithms have overarching objective functions, while others are purely pipelines), we find that mathematically, these algorithms can all be incorporated into the narrative logic of “distribution of interactive utility”.
  • Based on the above interactive utility distribution framework, we can further propose three evaluation criteria for neural network input unit importance attribution algorithms to assess whether the predicted importance values of input units are reasonable.

Of course, our theoretical analysis is not only applicable to the 14 attribution algorithms; theoretically, it can unify more similar studies. Due to limited human resources, we only discuss 14 algorithms in this paper.

The real difficulty of research lies in the fact that different empirical attribution algorithms are often built on different intuitions, and each paper merely strives to “justify itself” from its own perspective, designing attribution algorithms based on different intuitions or angles, while lacking a standardized mathematical language to uniformly describe the essence of various algorithms.

Algorithm Review

Before discussing mathematics, this article briefly reviews previous algorithms from an intuitive level.

1. Gradient-Based Attribution Algorithms. This type of algorithm generally believes that the gradient of the neural network output with respect to each input unit reflects the importance of the input unit. For instance, the Gradient*Input algorithm models the importance of the input unit as the element-wise product of the gradient and the input unit value. Considering that the gradient can only reflect the local importance of the input unit, the Smooth Gradients and Integrated Gradients algorithms model importance as the element-wise product of the average gradient and the input unit value, where the average gradient in these two methods refers to the average of the gradients within the neighborhood of the input sample or the average of the gradients of the linear interpolation points between the input sample and the baseline point. Similarly, the Grad-CAM algorithm uses the average of the gradients of all features in each channel with respect to the network output to compute the importance score. Furthermore, the Expected Gradients algorithm posits that selecting a single baseline point often leads to biased attribution results, thus proposing to model importance as the expectation of the attribution results from Integrated Gradients under different baseline points.

2. Layer-Wise Backpropagation-Based Attribution Algorithms. Deep neural networks are often extremely complex, while the structure of each layer of the neural network is relatively simple (for example, deep features are usually a linear combination of shallow features + nonlinear activation functions), making it easier to analyze the importance of shallow features to deep features. Therefore, this type of algorithm estimates the importance of intermediate features and propagates this importance layer by layer until the input layer to obtain the importance of input units. This type of algorithm includes LRP-ε, LRP-αβ, Deep Taylor, DeepLIFT Rescale, DeepLIFT RevealCancel, DeepShap, etc. The fundamental difference between different backpropagation algorithms lies in the different rules they adopt for propagating importance layer by layer.

3. Occlusion-Based Attribution Algorithms. These algorithms infer the importance of an input unit based on the impact of occluding that input unit on the model output. For example, the Occlusion-1 (Occlusion-patch) algorithm models the importance of the i-th pixel (pixel block) as the change in output when the pixel i is occluded versus when it is not occluded, with other pixels unoccluded. The Shapley value algorithm comprehensively considers all possible occlusion scenarios of other pixels and models importance as the average change in output corresponding to pixel i under different occlusion scenarios. Studies have shown that the Shapley value is the only attribution algorithm that satisfies the axioms of linearity, dummy, symmetry, and efficiency.

Unified Internal Mechanisms of 14 Empirical Attribution Algorithms

After in-depth research on various empirical attribution algorithms, we can’t help but ponder: what problem is attribution in neural networks actually solving on a mathematical level? Behind numerous empirical attribution algorithms, is there a unified mathematical modeling and paradigm? To address this question, we attempt to consider it from the definition of attribution. Attribution refers to the importance score/contribution of each input unit to the neural network output. Therefore, the key to solving the above problem lies in (1) mathematically modeling “the influence mechanism of input units on network output” and (2) explaining how various empirical attribution algorithms utilize this influence mechanism to design importance attribution formulas.

Regarding the first key point, we discovered that each input unit typically influences the neural network output in two ways. On one hand, a specific input unit can independently act and influence the network output without relying on other input units; this type of influence is called the “independent effect”. On the other hand, an input unit needs to collaborate with other input units to form a certain pattern, thereby influencing the network output; this type of influence is called the “interaction effect”. We theoretically proved that the neural network output can be rigorously decomposed into the independent effects of different input variables and the interaction effects among different sets of input variables.

Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms

Where, represents the independent effect of the i-th input unit, represents the interaction effect among multiple input units in the set.

Regarding the second key point, we explored and found that the intrinsic mechanisms of all 14 existing empirical attribution algorithms can be expressed as a distribution of the aforementioned independent utility and interactive utility, with different attribution algorithms allocating the independent utility and interactive utility of neural network input units in different proportions. Specifically, let represent the attribution score of the i-th input unit. We rigorously proved that all 14 empirical attribution algorithms yield results that can be uniformly represented as the following mathematical paradigm (i.e., the weighted sum of independent utility and interactive utility):

Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms

Where, reflects the proportion of the independent effect of the i-th input unit allocated to the i-th input unit, and represents the proportion of the interaction effect among multiple input units in the set allocated to the i-th input unit. The “fundamental difference” among various attribution algorithms lies in the different allocation ratios.

Table 1 shows how the fourteen different attribution algorithms allocate independent effects and interactive effects.

Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms

Chart 1. The fourteen attribution algorithms can all be written as a mathematical paradigm of the weighted sum of independent effects and interactive effects. Where and represent the Taylor independent effect and Taylor interactive effect, respectively, satisfying and , which refine the independent effect and interactive effect.

Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms
Analyzing the Interpretability of Neural Networks: 14 Attribution Algorithms

Analyzing the Interpretability of Neural Networks: 14 Attribution AlgorithmsAnalyzing the Interpretability of Neural Networks: 14 Attribution Algorithms

Three Evaluation Criteria for Assessing the Reliability of Attribution Algorithms

In attribution explanation research, it is impossible to empirically evaluate the reliability of a specific attribution explanation algorithm due to the lack of ground truth for neural network attribution explanations. The fundamental flaw of “lack of objective evaluation criteria for the reliability of attribution explanation algorithms” has led to widespread criticism and skepticism in academia regarding the field of attribution explanation research.

However, the revelation of the common mechanism of attribution algorithms in this study enables us to fairly evaluate and compare the reliability of different attribution algorithms within the same theoretical framework. Specifically, we propose the following three evaluation criteria to assess whether a given attribution algorithm fairly and reasonably allocates independent effects and interactive effects.

(1) Criterion 1: Coverage of All Independent Effects and Interactive Effects in the Distribution Process. When we deconstruct the neural network output into independent effects and interactive effects, a reliable attribution algorithm should cover as many independent effects and interactive effects as possible in the distribution process. For example, in the attribution of the sentence “I’m not happy”, it should cover all independent effects of the three words I’m, not, happy, while also covering all possible interactive effects such as J (I’m, not), J (I’m, happy), J (not, happy), J (I’m, not, happy), etc.

(2) Criterion 2: Avoid Allocating Independent Effects and Interactions to Unrelated Input Units. The independent effect of the i-th input unit should only be allocated to the i-th input unit and not to other input units. Similarly, the interactive effects among input units within set S should only be allocated to the input units within set S and not to those outside of set S (not involved in the interaction). For example, the interaction effect between not and happy should not be allocated to the word I’m.

(3) Criterion 3: Complete Allocation. Each independent effect (interaction effect) should be completely allocated to the corresponding input unit. In other words, the attribution values allocated to all corresponding input units for a specific independent effect (interaction effect) should sum up to exactly the value of that independent effect (interaction effect). For example, the interaction effect (not, happy) would allocate part of the effect (not, happy) to the word not, while also allocating part of the effect (not, happy) to the word happy. Therefore, the allocation ratio should satisfy

Next, we use these three evaluation criteria to assess the above 14 different attribution algorithms (as shown in Table 2). We find that the algorithms Integrated Gradients, Expected Gradients, Shapley value, Deep Shap, DeepLIFT Rescale, and DeepLIFT RevealCancel satisfy all reliability criteria.

Analyzing the Interpretability of Neural Networks: 14 Attribution AlgorithmsTable 2. Summary of whether the 14 different attribution algorithms meet the three reliability evaluation criteria.

Author Introduction

The author of this article, Deng Huiqi, is a PhD student in Applied Mathematics at Sun Yat-sen University. During her PhD, she visited the Department of Computer Science at Hong Kong Baptist University and Texas A&M University for research. Currently, she is conducting postdoctoral research in Professor Zhang Quanshi’s team. Her research focuses on trustworthy/interpretable machine learning, including explaining the attribution importance of deep neural networks and the expressive power of neural networks.

Deng Huiqi has done a lot of work in the early stages. Professor Zhang only helped her reorganize the theory after the initial work was completed, making the proof methods and systems smoother. Before graduation, Deng Huiqi had not published many papers, but after coming to Professor Zhang’s team at the end of 2021, she worked on three projects within a year, including (1) discovering and theoretically explaining the common representation bottleneck in neural networks, proving that neural networks are less adept at modeling moderately complex interactive representations. This work was fortunate enough to be selected as an ICLR 2022 oral paper, ranking in the top five in reviewer scores (scores 8 8 8 10). (2) Theoretically proving the conceptual representation trends of Bayesian networks, providing a new perspective for explaining the classification performance, generalization ability, and adversarial robustness of Bayesian networks. (3) Explaining from a theoretical level the learning ability of neural networks for interactive concepts of different complexities during training.

Further reading, “A Unified Explanation of 12 Algorithms to Enhance Adversarial Transferability”:

https://zhuanlan.zhihu.com/p/546433296

Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial

Reply "Extension Module Chinese Tutorial" in the background of the "Beginner Learning Vision" public account to download the first Chinese version of the OpenCV extension module tutorial online, covering more than twenty chapters including extension module installation, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, etc.

Download 2: Python Visual Practical Project 52 Lectures

Reply "Python Visual Practical Project" in the background of the "Beginner Learning Vision" public account to download 31 visual practical projects, including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to help quickly learn computer vision.

Download 3: OpenCV Practical Project 20 Lectures

Reply "OpenCV Practical Project 20 Lectures" in the background of the "Beginner Learning Vision" public account to download 20 practical projects based on OpenCV, achieving advanced learning of OpenCV.

Communication Group

Welcome to join the reader group of the public account to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GANs, algorithm competitions, etc. (these will be gradually subdivided in the future). Please scan the WeChat number below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Visual SLAM". Please follow the format; otherwise, you will not be approved. After successful addition, you will be invited to the relevant WeChat group based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding~

Leave a Comment