Introduction to Explainable Natural Language Processing Methods

Follow the official account “ML_NLP“

Set as “Starred“, valuable content delivered to you first!

Author: Yang Chongyang, Harbin Institute of Technology SCIR

1. Introduction

Traditional natural language processing methods are interpretable, including rule-based methods, decision tree models, hidden Markov models, logistic regression, etc., which are also known as white-box techniques. In recent years, deep learning models using language embeddings as features (black-box techniques) have emerged, which, although significantly improving model performance in many cases, have made the models difficult to interpret. Users find it challenging to understand the processes by which data yields the desired results, leading to many issues, such as weakening the interaction between users and systems (e.g., chatbots, recommendation systems, etc.). The machine learning community’s recognition of the importance of interpretability has grown, creating an emerging field called Explainable Artificial Intelligence (XAI). There are various definitions regarding interpretability, resulting in differences in arguments in most related articles. Here, we focus on the interpretability that Explainable AI provides to users regarding how models derive results, referred to as the outcome explanation problem^[1]. In Explainable AI, explanations can help users build trust in NLP-based AI systems. This article discusses the classification of interpretability based on previous reviews^[2], introduces techniques that can provide explanations, and briefly describes each technique and its representative papers.

2. Classification of Explanations

Explanations can be classified in two main ways^[1]^[3].

2.1 Local vs Global Explanations

Local explanations provide explanations for the prediction results of specific inputs, while global explanations reveal the process of how predictions are made, independent of specific inputs.

2.2 Self-Explaining vs Post-Hoc Explanations

In addition to being classified as local and global explanations, another classification method exists based on whether explanations are part of the prediction process or require post-processing of prediction results, namely self-explaining and post-hoc explanations. Self-explaining methods, also known as directly interpretable^[4], use information generated during the prediction process as explanations, producing explanations simultaneously with the model’s prediction results. Decision trees and rule-based models are global self-explaining models, while methods like attention mechanisms are local self-explaining models. Post-hoc explanation methods require additional operations after providing prediction results; for example, LIME uses simple models to explain complex models, where the simple model serves as an additional operation after the complex model provides prediction results, illustrating a case of local post-hoc explanation.

Table 1 Classification of Explanations

Introduction to Explainable Natural Language Processing Methods

3. Explainable Techniques

This section investigates five explainable techniques that provide explanations to users using different mechanisms.

3.1 Feature Importance

The main idea is to derive explanations by studying the extent to which different features influence the final prediction results. These methods can be built on different types of features, such as manually extracted features using feature engineering^[6], lexical features like words/tokens and n-grams^[7]^[8], or latent features learned by neural networks^[9]. Attention mechanisms^[10] and first-derivative saliency^[11] are two widely used explainable techniques based on feature importance. Text-based features are inherently easier for humans to interpret than general features, which is why attention-based methods are widely used in NLP.

3.2 Surrogate Models

Model predictions can be explained by learning another model with interpretability as a surrogate. A very typical example is LIME^[5], which uses a technique called input perturbation to learn a similar surrogate model. Surrogate model methods are model-agnostic and can be used to achieve local^[12] or global^[13] explanations. However, the learned surrogate model may have a completely different mechanism for making predictions than the original model, leading to skepticism about whether surrogate-based methods are faithful to the original model.

3.3 Example-Driven Explanations

This method explains the predictions of input instances by identifying and presenting other labeled instances that are semantically similar to the input instance. They are conceptually similar to nearest neighbor-based methods^[14] and have been applied to various NLP tasks, such as text classification^[15] and question answering^[16].

3.4 Provenance-Based Explanations

This is an intuitive and effective explainability technique that shows part or all of the derivation process of predictions when the final prediction results from a series of reasoning steps. Many works in the question-answering field have adopted this explanatory approach^[16][17]^[18].

3.5 Declarative Induction

This method directly uses human-readable and understandable representations, such as rules^[19], tree structures^[6], and programs^[20] as explanations.

4. Operations for Implementing Explanations

4.1 First-Derivative Saliency

This is a strategy influenced by backpropagation. It measures the contribution of each input unit to the final decision, which can be approximated using the first derivative. The formal definition is as follows: for a classification model, the input is, and the output’s correct classification label is (for NLP tasks, the input can be a word or a sequence of words, while labels can be part-of-speech tags, sentiment labels, the index of the next word to predict, etc.). The model will give the association score between E and c. The purpose of this algorithm is to determine which input unit contributes the most to the above score. Since it is a highly non-linear function, the original function is transformed into a linear representation using the first-order Taylor expansion:

Where is the derivative of the embedding:

The size of the derivative (absolute value) indicates the sensitivity of the final result to changes in a specific dimension, i.e., the contribution of a specific dimension of that word to the final decision. The saliency score is given by the following formula:

The following example is from the Stanford Sentiment Treebank corpus. It analyzes why the label for “I hate the movie.” is negative sentiment.

Introduction to Explainable Natural Language Processing Methods — Figure 1 Heatmaps obtained from three models for the input “I hate the movie.” (each row represents the saliency score for the corresponding word, and each grid/column represents different dimensions)

In these three models, “hate” has the highest saliency score, diminishing the influence of other tokens. LSTM focuses more clearly on “hate” than the standard recurrent model, while the bidirectional LSTM shows the most pronounced effect, with almost no influence from other words besides “hate”. This may be due to the gating mechanisms in LSTMs and Bi-LSTMs controlling the flow of information, allowing these architectures to filter out less relevant information better. Moreover, first-derivative saliency is commonly used in neural network models, as it can be calculated by simply invoking the automatic differentiation provided by deep learning engines. Recent work has also proposed improvements to first-derivative saliency^[21]. As its name and definition suggest, first-derivative saliency can achieve interpretability of feature importance, especially at the token-level features.

4.2 Layer-Wise Relevance Propagation

Layer-wise Relevance Propagation (LRP) is a technique that integrates interpretability into highly complex deep learning neural networks. It backpropagates prediction results through a specially designed backpropagation rule in the neural network. Based on the backpropagated relevance scores, it generates heatmaps (mainly used in image domains, but also applied in natural language processing). Below is a brief introduction to its concept.

The propagation process implemented by LRP follows the conservation property, meaning that the quantity received by a neuron must be redistributed equally to lower layers. Let and be neurons in two consecutive layers of the neural network. The following rule is applied to pass the relevance scores to the lower layer neurons:

This models the extent to which a neuron influences the relevance score of another neuron, with the denominator maintaining the conservation property. The propagation process terminates upon reaching the input nodes. If the above rule is applied to all neurons in the network, it is easy to verify the conservation property for each layer. Extending this property globally, we get:

There are two improved forms of the backpropagation rule used in LRP, which are and , both of which can avoid the potential issue of a zero denominator in the original rule and exhibit stronger robustness. Layer-wise relevance propagation has also been used to achieve interpretability of feature importance^[22] and example-driven interpretability^[23]. Here is an example using images, demonstrating the actual application effect:

4.3 Input Perturbations

Using the LIME^[5] model as an example, input perturbations can explain the output of input x by generating random perturbations of input x and training an interpretable model (usually a linear model). The main idea of LIME is to explain how the original model makes predictions by training a simpler model that simulates the original model. Therefore, a simpler model (also known as a surrogate model) can be used to explain the predictions of the original model. This method does not involve the internal mechanisms of the model but instead perturbs the input significantly and slightly (e.g., removing different words) and observes the changes in the outputs of the black-box model, then trains a simple and interpretable model based on these input-output observations. Since the surrogate model is a simple linear classifier that predicts by assigning a weight to each word, we can see the weights assigned to each word to understand how each word affects the final prediction. The final effect is shown in the following figure:

As exemplified by LIME, such operations are commonly used in surrogate models^[5]12].

4.4 Attention Mechanism

The Attention mechanism ^[10] is one of the most commonly adopted explainable methods in the field of natural language processing, significantly enhancing model performance across a range of tasks, especially in sequence-to-sequence models based on recurrent neural network structures. The attention mechanism simulates the human behavior of focusing on certain keywords when understanding language, achieved through weight allocation. Meanwhile, previous works have introduced the attention layer into neural networks to realize interpretability of feature importance^[8][9]. However, the extent to which it can achieve this is not well understood. “Attention! Is the attention mechanism interpretable?” This note discusses the interpretability of attention in detail (click the card below to view the article).

4.5 Explainability-Aware Architecture Design

This method designs neural network architectures that mimic the human problem-solving process, incorporating components that simulate human cognition, making the learned model (partially) interpretable. Such model architectures can be used to solve mathematical problems MathQA^[18][20] or sentence simplification problems^[24]. This design can also be applied to explainable surrogate models. For example, in MathQA, test-takers are often required to provide step-by-step explanations of how they arrived at their answers; if any key steps in the derivation are missing, the test-taker is unlikely to receive full credit. The MathQA model attempts to reproduce this derivation process using an explainable structural design. MathQA is a large dataset of multiple-choice math problems in English, with training data labeled through crowdsourcing. For a problem, annotators are asked to iteratively perform the following steps: first select an operation (addition, subtraction, multiplication, division, squaring, etc.), then label the parameters needed for that method from the problem stem or previous calculation results until the obtained set of operations can jointly yield the answer. The intended outcome in practical training is an encoder-decoder structure that treats the input problem stem as encoded data and aims to decode it into a sequence of various operations. Through this sequence, the entire solution process can be clearly understood. The instance effect is shown below:

5. Conclusion

This article discusses the main classifications of explanations (local vs global explanations, self-explaining vs post-hoc explanations), common techniques for achieving explainability, and the specific operations of these explainable techniques. In addition, there are other explainability operations, such as using reinforcement learning to learn simple negation rules, etc.^[19]. Explainable AI still faces many issues that need to be addressed, including clearer definitions of related concepts in the field, a more explicit understanding of interpretability, and how to establish connections with goals. We look forward to more work in the future to improve this field and promote interaction between users and AI.

References

[1]

Guidotti R, Monreale A, Ruggieri S, et al. A Survey of Methods for Explaining Black Box Models[J]. ACM Comput. Surv., New York, NY, USA: Association for Computing Machinery, 2018, 51(5).

[2]

Danilevsky M, Qian K, Aharonov R, et al. A Survey of the State of Explainable AI for Natural Language Processing[J]. 2020(Section 5).

[3]

Tjoa E, Guan C. A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020: 1–21.

[4]

Arya V, Bellamy R K E, Chen P-Y,et al. One Explanation Does Not Fit All: A Toolkit and Taxonomy of AI Explainability Techniques[J]. 2019.

[5]

Ribeiro M T, Singh S, Guestrin C. “Why should I trust you?” Explaining the predictions of any classifier[A]. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining[C]. 2016: 1135–1144.

[6]

Voskarides N, Meij E, Tsagkias M, et al. Learning to explain entity relationships in knowledge graphs[J]. ACL-IJCNLP 2015 – 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference, 2015, 1: 564–574.

[7]

Godin F, Demuynck K, Dambre J, et al. Explaining Character-Aware Neural Networks for Word-Level Prediction: Do They Discover Linguistic Rules?[J]. 2018.

[8]

Mullenbach J, Wiegreffe S, Duke J, et al. Explainable Prediction of Medical Codes from Clinical Text[J]. 2018.

[9]

Xie Q, Ma X, Dai Z, et al. An Interpretable Knowledge Transfer Model for Knowledge Base Completion[J]. 2017.

[10]

Bahdanau D, Cho K H, Bengio Y. Neural machine translation by jointly learning to align and translate[A]. 3rd International Conference on Learning Representations, ICLR 2015[C]. 2015.

[11]

Li J, Chen X, Hovy E H, et al. Visualizing and Understanding Neural Models in NLP[A]. HLT-NAACL[C]. 2016.

[12]

Alvarez-Melis D, Jaakkola T. A causal framework for explaining the predictions of black-box sequence-to-sequence models[A]. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing[C]. 2017: 412–421.

[13]

Liu N, Huang X, Li J, et al. On interpretation of network embedding via taxonomy induction[A]. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining[C]. 2018: 1812–1820.

[14]

Dudani S A. The distance-weighted k-nearest-neighbor rule[J]. IEEE Transactions on Systems, Man, and Cybernetics, IEEE, 1976(4): 325–327.

[15]

Croce D, Rossini D, Basili R. Auditing deep learning processes through kernel-based explanatory models[A]. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)[C]. 2019: 4028–4037.

[16]

Abujabal A, Roy R S, Yahya M, et al. Quint: Interpretable question answering over knowledge bases[A]. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations[C]. 2017: 61–66.

[17]

Zhou M, Huang M, Zhu X. An interpretable reasoning network for multi-relation question answering[J]. arXiv preprint arXiv:1801.04726, 2018.

[18]

Amini A, Gabriel S, Lin S, et al. MathQA: Towards interpretable math word problem solving with operation-based formalisms[J]. arXiv, 2019: 2357–2367.

[19]

Pröllochs N, Feuerriegel S, Neumann D. Learning interpretable negation rules via weak supervision at document level: A reinforcement learning approach[A]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies[C]. Association for Computational Linguistics, 2019, 1: 407–413.

[20]

Ling W, Yogatama D, Dyer C, et al. Program induction by rationale generation: Learning to solve and explain algebraic word problems[J]. arXiv preprint arXiv:1705.04146, 2017.

[21]

Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks[A]. International Conference on Machine Learning[C]. PMLR, 2017: 3319–3328.

[22]

Poerner N, Roth B, Schütze H. Evaluating neural network explanation methods using hybrid documents and morphological agreement[J]. arXiv e-prints, 2018: arXiv-1801.

[23]

Croce D, Rossini D, Basili R. Explaining non-linear classifier decisions within kernel-based deep architectures[A]. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP[C]. 2018: 16–24.

[24]

Dong Y, Li Z, Rezagholizadeh M, et al. EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing[A]. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics[C]. 2019: 3393–3402.

This issue’s editor: Feng Xiaocheng

This issue’s editor: Zhu Wenxuan

Introduction to Explainable Natural Language Processing Methods

Recommended Reading:
Overview of Causal Inference and Basic Method Introduction (Part One)
Overview of Causal Inference and Basic Method Introduction (Part Two)
Research Progress on Multi-label Text Classification

Click the card below to follow the official account “Machine Learning Algorithms and Natural Language Processing” for more information:

Author: Yang Chongyang, Harbin Institute of Technology SCIR

1. Introduction

2. Classification of Explanations

2.1 Local vs Global Explanations

2.2 Self-Explaining vs Post-Hoc Explanations

3. Explainable Techniques

3.1 Feature Importance

3.2 Surrogate Models

3.3 Example-Driven Explanations

3.4 Provenance-Based Explanations

3.5 Declarative Induction

4. Operations for Implementing Explanations

4.1 First-Derivative Saliency

4.2 Layer-Wise Relevance Propagation

4.3 Input Perturbations

4.4 Attention Mechanism

4.5 Explainability-Aware Architecture Design

5. Conclusion

References

Leave a Comment Cancel reply