Is the Attention Mechanism Interpretable?

Author: Gu Yuxuan, Harbin Institute of Technology (SCIR)

References

NAACL 2019 “Attention is Not Explanation”

ACL 2019 “Is Attention Interpretable?”

EMNLP 2019 “Attention is Not Not Explanation”

This article will explore the interpretability of the attention mechanism.

Introduction

Since Bahdanau introduced Attention as soft alignment in neural machine translation in 2014, a large amount of natural language processing work has incorporated it as an important module to enhance model performance. Numerous experiments have shown that the Attention mechanism is computationally efficient and effective. This has led to discussions and research on its interpretability, as people hope to better understand its underlying mechanisms to optimize models. On the other hand, some scholars have raised doubts about it. As a prospective PhD student at SCIR, I have written this note based on my understanding of the Attention mechanism, hoping to inspire readers. Due to my personal limitations, any errors in this article are welcome to be pointed out.

1 Attention Mechanism

1.1 Background

The Attention mechanism is currently one of the most commonly used methods in the field of natural language processing due to its significant performance enhancement on a range of tasks, especially in seq-to-seq models based on recurrent neural networks. Coupled with the extensive use of Google’s Transformer model, which is entirely based on Attention [1], and the BERT model [2], the Attention mechanism is almost a textbook technique. On one hand, Attention intuitively simulates the human behavior of focusing on certain keywords when understanding language, as Bahdanau [3] introduced it as soft alignment in neural machine translation. On the other hand, numerous experiments over the years have shown that Attention is indeed a feasible and efficient method to improve model performance. Therefore, further exploration of the intrinsic principles of this mechanism, explaining its effectiveness, and providing proof is a valuable research direction.

1.2 Structure

Despite the different implementation methods in various papers, the Attention mechanism fundamentally follows the same paradigm. It consists of three components:

Key corresponds to the value and is used to calculate similarity with the query as the basis for Attention selection.
Query is the query during a single execution of Attention.
Value is the data that is attended to and selected.

The corresponding formula is as follows:

Is the Attention Mechanism Interpretable?

The value (Value) is often the output of the previous layer, generally remains unchanged, while the other components such as key, query, and similarity function often have different implementation methods. Here we first introduce the Attention structures implemented by Bahdanau [3] and Yang [7], as they serve as the foundation for the subsequent interpretability explorations by Serrano [4] and Jain [5].

The formula for Attention is as follows:

(As Value) is the output tensor at the i-th position in the sequence from the previous layer. If the previous layer is a bidirectional RNN structure, then

;

is the key (Key) at the i-th position, obtained from the value (Value) through a fully connected layer;

is the query (Query) for this Attention layer, initialized randomly and updated synchronously during training. If Attention is not independently layered but is built on the decoder, then Is the Attention Mechanism Interpretable?

is related to the output at the corresponding position during the encoding phase; the similarity function is the matrix dot product, and the calculated

is the Attention weight,

is the output of the Attention layer.

The core idea is as follows:

Calculate a non-negative normalized weight for each input element.
Multiply these weights by the corresponding component representations.
Sum the resulting values to produce a fixed-length representation.

This is the original form of Attention, and experiments on its interpretability are also based on this model.

2 Definition of Interpretability

There are various definitions of interpretability, and most related articles often begin with differences in their arguments, leading to different conclusions. However, there is some consensus that can be summarized.

The conceptual consensus on interpretability can be described as follows: If a model is interpretable, it is understandable to humans, divided into two aspects: first, the model is transparent to humans [9], meaning that the corresponding parameters and model decisions can be predicted before training on specific tasks; second, after a model makes a decision, humans can understand the reasons behind that decision. There are also other definitions from different perspectives, for example: interpretability means being able to reconstruct the decision-making process of the model manually.

Specifically, for the interpretability of the Attention model, it can generally be refined to:

The height of Attention weights should be positively correlated with the importance of the corresponding positional information.
Input units with high weights have a decisive effect on the output results.

3 Specific Arguments

3.1 Not Explanation

Regarding non-interpretability, Serrano [4] and Jain [5] each proposed some experiments and arguments. Their work overlaps and complements each other; the former’s work explores a relatively shallow level, only considering whether the Attention layer’s weights are positively correlated with the corresponding input positions, using intermediate representation erasure as a means to observe changes in the model by continuously nullifying some weights. The article mainly explores different erasure methods, while the latter’s work is slightly deeper, exploring the impact of removing key weights on the model and introducing the idea of constructing adversarial Attention weights to test changes in the model.

3.1.1 Intermediate Representation Erasure

Intermediate representation erasure considers a relatively shallow understanding of Attention, where the main logic is that more important weights have a greater impact on output results, and setting them to zero will directly affect the results.

First, introduce evaluation metrics, which measure how much the weights and output results change:

Total Variance Distance (TVD) as a metric for distinguishing output result distributions, defined as:

where and are two different output result distributions.
Jensen-Shannon Divergence (JSD) as a metric for the difference between output result distributions and Attention weights, defined as:

where and are two different distributions, which can be either output results or Attention weights,

is the Kullback-Leibler Divergence.

The specific implementation is shown in Figure 1. The model consists of two parts: the first part is the embedding and encoding part, for example, implementing word embedding with a fully connected layer followed by a bidirectional LSTM for encoding, and the specific models used in the experiments vary. The second part is the decoding part, where the output tensor obtained from the Attention layer is decoded into the results required for specific tasks, which in text classification tasks is a dimension transformation implemented by a fully connected layer.It is important to note that here the Attention layer is not acting on the decoder as in seq-to-seq models, but is tested as an independent layer.

Figure 1. Overall model of intermediate representation erasure

The entire model runs twice: the first time with normal input and output, retaining the obtained results; the second time, the selected weights on the Attention layer are set to zero, and the weights are renormalized with Softmax, continuing the subsequent process to obtain results, and calculating the TVD metric with the data obtained from the first run.Erasing on the Attention layer rather than at the input side is to isolate its influence from the preceding encoding part. Additionally, the purpose of renormalizing is that when the model sets some high-weight parameters to zero, the output tensor of the Attention layer tends to approach zero, which is a situation the model has not encountered during training, thus leading to unpredictable decision-making behavior of the model.

Serrano [4] designed experiments using text classification tasks with four datasets, as shown in Figure 2, and also designed four models for comparative testing:

Figure 2. Four text classification datasets

Hierarchical Attention Network proposed by Yang [7], as shown on the left side of Figure 3, is divided into word-level and sentence-level parts. The experiment only tests the sentence-level Attention layer, while the previous parts are treated as the encoding phase.
Flat Attention Network modified from the previous model that only has a word-level Attention layer operating on the entire document without sentence segmentation.
Flat Attention Network with CNN replaces the bidirectional RNN structure with CNN as the encoder in the previous model, as shown on the right side of Figure 3, referencing Kim [8]’s implementation.
No Encoder does not use an encoder and directly inputs the results after word embedding into the Attention layer.This control group aims to eliminate the encoder and prevent individual tokens from obtaining contextual information.

Figure 3. Two structures of Hierarchical Attention Network using BiRNN and CNN as encoding parts

The experiments mainly have two modules: single weight nullification and a group of weight nullifications. The difference is that the former tests the change in the overall model output after erasing the intermediate representation corresponding to the highest weight, while the latter tests how many intermediate representations need to be erased and how to erase them to change the model’s final decision, thus finding interpretability evidence from the experimental data.

Single Attention Weight’s Importance First, the highest intermediate representation corresponding to the Attention weight is erased. The process for calculating the normal result and the result after erasure is shown in Figure 1, calculating the JSD metric

, where

is the distribution of the normal result,

is the result distribution after erasing the highest weight corresponding position

of the intermediate representation. To test how much distance this has, a new random position is selected

, and the same process is used to erase the intermediate representation to obtain the corresponding JSD metric

, at this point we can use

for comparison. Intuitively, if items with high weights are indeed more important, then this formula should always be positive, and the difference in size between the two nullified weights

should also be larger. The charts presented by the author in the article, as shown in Figure 4 on the left, have the horizontal coordinate as the difference between the two weights and the vertical coordinate as

The author states that the data obtained shows almost no negative results, and even if there are, they are close to zero. Moreover, as the difference in the two weights increases, the degree of change in the model output result distribution also increases accordingly. However, the author believes that “even if the difference between the two weights reaches 0.4 (weights are normalized), the majority of positive values of

are still very close to zero,” thus preliminarily suggesting that the Attention mechanism has phenomena that contradict intuition.

However, from the appendix of the article, one can find results of non-hierarchical models without an encoder as shown in the right side of Figure 4. The so-called issue of degree of change being unrelated is not so severe, meaning that under this set of data, the author’s conclusion is actually not rigorous. The article proposes a hypothesis that when the encoder processes each token at each position, it contains contextual information, meaning that the amount of information has gone through a similar averaging operation, and the output of each token from the encoder contains more or less the information that needs attention, thus weakening the influence of the Attention layer’s weights. Of course, the article does not conduct experiments to verify this hypothesis.

Figure 4. Comparison of results based on hierarchical models using BiRNN and non-hierarchical models without encoders

Subsequently, Serrano [4] designed a set of experiments to strengthen the discussion on the impact of single nullification operations, using the evaluation metric of the probability of model decision reversal, that is, erasing the highest weight corresponding position

and the randomly selected position

of the intermediate representation, and observing the situation of model decision behavior reversal, obtaining experimental data as shown in Figure 5, where the orange area indicates r causing reversal, while

does not, and the blue area is the opposite. However, in most cases, the model decision does not reverse, which means this experiment is not particularly useful, or further, that merely erasing an intermediate representation does not affect the robustness of the Attention layer, especially with a context-related encoder present.

Figure 5. Situation of model decision reversal

Jain [5] also designed a set of leave-one-out experiments, using a model framework similar to Serrano [4], but with an increased dataset.Involving text classification, question answering, and natural language inference tasks, the datasets are shown in Figure 6, where text classification tasks have been modified into binary classification tasks to simplify the model requirements.

Figure 6. Text classification, question answering, and natural language inference datasets

Jain [5] believes that the Attention weights learned by the model should align with the importance evaluation metrics of feature representations. The two evaluation metrics for comparison are the model output result after removing different position intermediate representations and the original output result’s set of TVD values, where the former’s rationale is that if the important intermediate representation is removed, the model decision result’s change will be significant, thus corresponding to a large TVD value, while the latter’s gradient values are related to the model’s decision-making focus.By calculating the JSD for both metrics against the Attention weights, the corresponding distributions are shown in Figure 7, where only the data for the gradient metric is displayed, with JSD converted into correlation, where 0 indicates no correlation and 1 indicates complete consistency. The data for the other metric is similar, so it will not be elaborated further. Generally speaking, merely erasing an intermediate representation in the experiments cannot serve as a strong argument to prove whether Attention is interpretable.

Figure 7. Correlation between Attention distribution and gradient distribution

The idea of Importance is to find a minimal set of intermediate representation erasures that can cause a model decision reversal. If the Attention weights at the corresponding positions are the largest, it indicates that the sorting of Attention weights regarding the importance of intermediate representations is reasonable.The author proposed the following process for verification:

Add weights to the nullification set in descending order of Attention weights.
If the model’s decision reverses, stop expanding the set.
Check if there is a proper subset that can also reverse the model’s decision.

However, the problem here is that finding such a proper subset requires exponential time complexity, which is nearly impractical, and the requirements can only be relaxed to find a smaller arbitrary set.The author designed three other sorting schemes to see if the number of set elements needed to reverse the model’s decision is smaller when sorted from large to small than when sorted by Attention weights.The three sorting schemes are as follows:

Random sorting
Sorting by the size of gradient values backpropagated to the Attention layer
Sorting by the product of gradient values and Attention weights

These three candidate sorts are not intended to find a Golden-Standard erasure combination but to compare with the sorting formed by Attention weights.Additionally, some data items need to be discarded to prevent damaging experimental results, such as, in some cases, even if only one intermediate representation remains after erasure, the model’s decision will not reverse; or the input sequence length is 1, making sorting impossible.

The experimental results are shown in Figure 8, where each rectangle represents the value range of data points, and the horizontal line in the rectangle indicates the mean of the data. If there is no marked horizontal line, it indicates that the mean tends toward 1.In most cases, the effect of random sorting is far inferior to the other three methods, and the data ranges of the other three sorting methods overlap, but the performance in terms of mean is generally best for the product of gradient values and Attention weights, followed by sorting by gradient values, and sorting by Attention weights performs the worst.

Figure 8. Proportion of Attention weights nullified to reverse model decisions under different sorting rules

The conclusion drawn by the author is that Attention does not necessarily maximize the description of the model’s decision behavior, and using Attention weights as a basis is effective but not optimal.However, the issue here is that the correlation between nullifying Attention weights and reversing model decisions is merely an intuitive hypothesis, not a strong prior or axiom.

3.1.2 Constructing Adversarial Weights

Jain [5] believes that changes in Attention weights should correspondingly affect the model’s predicted results. This idea is similar to the reversal of model decisions, but the difference is that here the aim is to deceive the model into making the same decision by constructing a set of counterintuitive weight distributions, thereby proving the unreliability of the Attention mechanism.

The adversarial effect the author aims to construct is shown in Figure 9, which displays a negative review example. On the left side are the original weights learned by the model, with the maximum weight corresponding to the word waste, followed by asking. The classifier gives a result of 0.01 (0 indicates negative, 1 indicates positive, and 0.5 indicates neutral), which coincides with the human judgment of the sentence’s emotional polarity. On the right side are the constructed adversarial weights, with the maximum weight corresponding to the word was, followed by myself. This counterintuitive Attention method can still yield the same result from the classifier.

Figure 9. Text sentiment analysis example

The author designed two construction methods for experimentation. The first method is to randomly shuffle the positions of the original Attention weights, while the second is to design a target function to train adversarial weights that yield a model result distribution similar to the original while differing as much as possible in Attention weight distribution.

Attention Permutation involves randomly shuffling Attention weights and observing changes in model output results, using the TVD metric to evaluate the degree of change.Due to the randomness of the process and the computational complexity limits, each sample experiment only randomly selects 100 groups of weights, calculating the TVD metrics for each and taking the median for evaluation.As shown in Figure 10, this is a heatmap where the horizontal coordinate represents the median degree of change in model output, and the vertical coordinate represents the maximum value of the original Attention weights. Different colors indicate different categories, where the author aims to verify if only a few features are needed to explain an output result based on the range of maximum Attention weights.The author uses the SST dataset as a standard and believes that shuffling Attention has little impact on the model’s output results.However, the conclusion of this experiment should be debated, as the author simply attributes the inconsistency of experimental results to the dataset. For instance, in the Diabetes dataset, only a few key features can decisively affect the result, while indeed different or even completely opposite results are shown across multiple datasets. Choosing a dataset that aligns with one’s expectations as a standard without thorough research and experimentation is not advisable.

Figure 10. Changes in model output results after randomly shuffling weights

Adversarial Attention The key experiment the author focuses on is constructing a method to generate adversarial weights, aiming to explicitly find a weight distribution that is as different as possible from the original Attention weights while effectively ensuring that the model’s predicted results remain unchanged.The mathematical expression of the construction method is as follows:

where

represents a defined small, meaning that using the constructed weights to make decisions and the results obtained using the original weights have a maximum TVD distance of Is the Attention Mechanism Interpretable?

for sentiment classification tasks is 0.01, and for question answering tasks is 0.05 (because the output vector space for question answering tasks is larger, a smaller perturbation can produce a larger TVD distance). Is the Attention Mechanism Interpretable?

is the original Attention weights, Is the Attention Mechanism Interpretable?

is the model result obtained using the original weights. Is the Attention Mechanism Interpretable?

represents the i-th constructed weight distribution, Is the Attention Mechanism Interpretable?

is the model result obtained using the i-th constructed weights.The optimization goal is to obtain the Attention weights generated by each group, ensuring that the results produced by the model with these weights do not exceed the TVD distance of Is the Attention Mechanism Interpretable?

, while maximizing the JS distance between each group of weights and the original weights as well as among the groups of weights themselves.In practical implementation, the optimization objective is Is the Attention Mechanism Interpretable?

, where λ is a hyperparameter set to 500 during training.

Figure 11 shows the maximum JS distance of the constructed weight distributions from the original weight distributions across various datasets, where the vertical coordinate represents the corresponding quantity ratio. Since the experiment only selected a portion of samples from the dataset, the sum of each column in the histogram may not equal 1.The color is also used to distinguish different classification results.It is noteworthy that the JSD metric itself has an upper bound; two completely different distributions can at most reach a JSD of 0.69. From the data, it can be seen that for each dataset, a large number of weight distributions can be constructed that are close to the upper bound of the original Attention weight distribution, while ensuring that the degree of change in the output results remains within Is the Attention Mechanism Interpretable?

. This indicates that the weights of Attention can be easily constructed to be completely counterintuitive without affecting results, and the author believes this is key evidence proving the Attention mechanism’s non-interpretability.

Figure 11. Histogram of JS distance distributions that satisfy constraints

The author continues to consider the relationship between the maximum Attention weight value (aiming to consider the impact of feature effectiveness, that is, the model focuses more on features with larger maximum weights) and the maximum JS distance of the adversarial weights that can be constructed. Intuitively, if the original Attention weight distribution is steeper, it should be more difficult to generate effective (large JSD) adversarial weights. As shown in Figure 12, this is also a heatmap where the horizontal coordinate is the maximum JS distance and the vertical coordinate is the maximum original Attention weight. The author believes that although there is indeed a trend as previously stated, it is undeniable that there are indeed many examples in each dataset where high original Attention weights can still construct adversarial weights with large JS distances. This means that the assertion that a specific set of features can completely influence the result is misleading.

Figure 12. Relationship between JS distance and maximum original Attention weight distribution

To summarize the above conclusions, Serrano [4] designed multiple experiments around intermediate representation erasure to observe changes in model decisions, concluding that: ‘Attention does not necessarily correspond to importance.’ because in experiments, Attention often fails to successfully identify the most important intermediate representations that influence model decisions. Additionally, an important but unverified hypothesis is: ‘Attention magnitudes do seem more helpful in uncontextualized cases.’ This implies that the context-related encoder may lead to the Attention mechanism being difficult to interpret, but the author did not delve into this further.

Jain [5], on the other hand, raised two questions regarding the contribution of Attention to the model’s transparency [9]:

How much correlation is there between Attention weights and the importance metrics of features?
Do different weight distributions necessarily lead to different model decisions?

The answers provided are as follows:

‘Only weakly and inconsistently’ the correlation between Attention weights and importance metrics (gradient distributions, TVD distributions after removing certain intermediate representations) is not strong and unstable.
‘It is very possible to construct adversarial attention distributions that yield effectively equivalent predictions.’ It is easy to construct adversarial weights that focus on features completely different from the original weights, and furthermore, random weight settings often do not significantly affect model decision results.

3.2 Not Not Explanation

This section titled Not Not Explanation is a rebuttal to previous arguments regarding the non-interpretability of the Attention mechanism, rather than proving that Attention is interpretable. It mainly references Wiegreffe [6]’s rebuttal to Jain [5].Wiegreffe [6] offers two reasons:

Attention Distribution is not a Primitive. The distribution of Attention weights does not exist independently; due to the entire process of forward and backward propagation, the parameters of the Attention layer cannot be isolated from the entire model, otherwise, it loses practical significance.
Existence does not Entail Exclusivity. The existence of Attention distributions does not imply exclusivity; Attention merely provides an explanation, not the only explanation.Especially when the dimensionality of intermediate representations is large and the number of output categories is small, the dimensionality reduction mapping functions can easily have considerable flexibility.

The author designed four sets of control experiments for rebuttal, which can be referenced in Figure 13. The right side shows the model used, which is consistent with those previously introduced, including an embedding layer, encoding layer, and independent Attention layer, with the task also being text classification.The left side corresponds to the range of each experiment, J&W refers to Jain [5]’s experiment, which only involves modifying Attention weights. It is important to note that in the figure, Attention Parameters refers to the parameters trained in the Attention layer, and the final obtained α weights are calculated from them.§3.2 is the section 3.2 of Wiegreffe [6]’s paper, with the brackets indicating the layers involved in various experiments, from right to left, the four experiments are:

Uniform as Adversary Training a baseline where Attention weights are fixed to the mean to model.
Variance with a Model Re-initializing and retraining the model with a random seed as a normal baseline for the deviation of Attention weight distribution.
Diagnosing Attention Distribution by Guiding Simpler Models Implementing a diagnostic framework that utilizes fixed pre-trained Attention weights.
Training an Adversary Designing an end-to-end adversarial weight training method; it is important to note that this is not an independent experiment but a specific implementation of adversarial weight generation from the previous experiment.

Figure 13. Model results and control experiments

Uniform as Adversary This part of the experiment aims to verify whether the Attention mechanism is necessary across various datasets. If the task is too simple and does not require Attention to perform well, then this data is not persuasive. Conversely, if Attention is necessary, then its removal will lead to a significant decline in model performance.The experimental design is straightforward, directly setting the Attention weights to their average and freezing them, training only the remaining parts, where all intermediate representations fed from the encoding layer are directly averaged to compute the output of the Attention layer.The F1 metrics for text classification obtained are shown in Figure 14, where the leftmost is the result given by Jain [5], the middle is the result replicated by Wiegreffe [6], and the rightmost is the result obtained from this experiment. Intuitively, it can be seen that the performance improvement with Attention is not significant for most datasets, especially for the last two datasets, where there is almost no increase. The author summarizes this as ‘Attention is not explanation if you don’t need it.’

Figure 14. F1 metrics for text classification tasks

Variance with a Model The author uses a random seed to reinitialize and train the model, obtaining different Attention weights, which are treated as normal weights, meaning they are merely disturbed by noise rather than modified by human intervention, and calculating the JSD metric for the distribution of these weights against the original weights yields a normal baseline, with weights exceeding this baseline deemed abnormal.As shown in Figure 15, this is a comparison figure, where the horizontal coordinate is the JS distance and the vertical coordinate is the range of maximum Attention weights. This is a set of comparative graphs, from a to d are baselines generated using random seeds across various datasets, while e and f are experimental data from Jain [5]’s adversarial weights. It can be observed that the generated weights on the SST dataset have a JSD distance far exceeding the baseline from the original weights.This indicates that the artificially constructed adversarial data is indeed abnormal, far exceeding the level of noise. This result aligns with the previous phenomenon, where the lesser the role of Attention, the more unpredictable its weight distribution becomes. However, up to this point, it remains a theoretical discussion without concrete data to explain the situation of the constructed Attention weights.

Figure 15. Comparison of weight distributions based on random seed initialization against adversarial weight distributions to the original distributions’ JS distances

Diagnosing Attention Distribution by Guiding Simpler Models This part of the model differs from the previous ones in that it aims to more accurately test the Attention weight distribution, eliminating the influence of context-related structures. Thus, the previously used LSTM encoding layer is removed, and a simple MLP is used to complete the affine transformation from the embedding layer to the classification results, directly using preset, untrained Attention weights to compute the model decision results, as shown in Figure 16.The author designed four control experiments based on this:

Setting the preset Attention weights to the mean as a baseline.
Not fixing the weights, allowing them to be trained along with the MLP.
Using the original Attention weights used when the LSTM was the encoder.
Using adversarially generated Attention weights (implemented in the next section).

Figure 16. Model using preset Attention weights

The results of the four control experiments are shown in Figure 17, where using the original Attention weights to replace the weights trained with the MLP yields improved results. Furthermore, in the first experiment, where Attention weights are meaningful on the dataset, the adversarial weights perform significantly worse than the original weights, indicating that the importance information encoded by the Attention mechanism is not independent of the model; rather, it can be transferred across models using the same dataset, while the constructed adversarial weights do not extract additional information from the data but rather exploit specific vulnerabilities of the model (where the model’s capability exceeds the task’s requirements).

Figure 17. F1 metrics obtained from four control experiments

Train an Adversary This section details the generation model for adversarial weights. The primary distinction from Jain [5]’s proposed model lies in the loss function: Is the Attention Mechanism Interpretable?

where

is the model using the original weights, and Is the Attention Mechanism Interpretable?

is the model using adversarial weights. As shown in Figure 18, it is indeed possible to obtain good adversarial weights, but it has also been proven in the previous experiment that these weights merely exploit vulnerabilities in the model rather than uncovering more critical information.

4 Conclusion

The interpretability of the Attention mechanism is currently a hot topic of discussion. Many authors have designed various experiments to prove or disprove it, but there are always some incomplete aspects, such as issues with datasets, model designs, and sometimes even disagreements on the definition of interpretability. However, step-by-step exploration is always necessary.Whether it is Serrano [4]’s intermediate representation erasure experiments or Jain [5]’s adversarial weight generation experiments, the ultimate goal is to find examples where Attention weights do not accurately express the importance of tokens.Both perspectives are limited to the Attention weights themselves and overlook many factors that significantly influence the results, such as whether the dataset requires the Attention mechanism and the influence of context-related encoders on Attention weights, all of which may lead to experimental failures.Wiegreffe [6] effectively rebutted Jain [5] regarding flaws in experimental design, but the reliability of the method for testing the transfer of Attention weights remains debatable.

In addition, many questions have been raised but not explored:

What types of tasks require the Attention mechanism, and is it universal?
What impact does the choice of encoding layer have on Attention?
Is it reasonable to transfer Attention weights between different models on the same dataset?
How does the complexity of the model and dataset affect the ease of constructing adversarial weights?

Moreover, Serrano [4] also proposed some prospects; currently, research is still limited to importance based on Argmax, where everyone only considers the impact of Attention weights on the model’s final decision results. However, this is not the whole picture, as there may also be a feature that decreases the probability of a certain category appearing, meaning that each item in the output of the Softmax function has research value, not just the maximum probability item as the result.

Finally, we look forward to what new developments the ACL2020, which has a submission deadline in December, will bring us.

References

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, pp. 5998-6008.

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186.

[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations.

[4] Sofia Serrano, and Noah A. Smith. 2019. Is Attention Interpretable. arXiv preprint arXiv:1906.03731.

[5] Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).

[6] Sarah Wiegreffe, and Yuval Pinter. 2019. Attention is not not Explanation. arXiv preprint arXiv:1908.04626.

[7] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

[8] Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

[9] Zachary C Lipton. 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490.

This issue’s editor: Zhang Weinan

This issue’s editor: Li Zhaopeng

Follow the “Harbin Institute of Technology SCIR” public account

Chief Editor: Che Wanxiang

Deputy Editors: Zhang Weinan, Ding Xiao

Executive Editor: Li Jiaqi

Editors: Zhang Weinan, Ding Xiao, Cui Yiming, Li Zhongyang

Editors: Lai Yongkui, Li Zhaopeng, Feng Zixian, Wang Ruoke, Gu Yuxuan

Long press the image below and click “Recognize the QR code in the picture” to follow the WeChat public account of the Social Computing and Information Retrieval Research Center at Harbin Institute of Technology: ”Harbin Institute of Technology SCIR.”

Is the Attention Mechanism Interpretable?

References

Introduction

1 Attention Mechanism

1.1 Background

1.2 Structure

2 Definition of Interpretability

3 Specific Arguments

3.1 Not Explanation

3.1.1 Intermediate Representation Erasure

3.1.2 Constructing Adversarial Weights

3.2 Not Not Explanation

4 Conclusion

References

Leave a Comment Cancel reply