Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

Hierarchical Attention-based

Framework Introduction

Hierarchical Text Classification (HTC) refers to a given hierarchical label system (typically a tree structure or directed acyclic graph structure) that predicts the label path of the text (the parent node labels contain the child node labels along the path). Generally, there is at least one label at each level, making it a path-based multi-label classification task.

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

▲Tree-like Hierarchical Label System

In the solutions for hierarchical text classification, there are approaches that treat HTC as a seq2seq problem, such as seq2tree and A-PNC-B, as well as those that treat HTC as a matching task, like HiMatch (which also uses classification). More traditionally, HTC is treated as a classification task, such as HiAGM and HMCN.

Label Embedding is currently widely used in various NLP tasks to enhance generalization ability, and it is also common in zero-shot and few-shot scenarios. In HTC, the application of Label Embedding is also quite extensive, as HTC has a hierarchical label system. To encode the relationships between levels, many current effective papers use graph networks and other methods to encode Label Embedding.

This introduces a model framework that treats HTC as a classification task: Hierarchical Attention-based Framework (hereinafter referred to as HAF). This framework utilizes methods like Label Embedding and Attention mechanisms; its structure is relatively simple, yet it performs quite well.

To describe this framework simply, it can be summarized in one sentence:

HAF consists of multiple Units, each corresponding to a level in the hierarchical label system, responsible for encoding and predicting the text at the corresponding level. The prediction results from the upper-level Unit are sent to the next-level Unit, which considers the results from the upper level while using the current level’s label embedding and token representations to perform attention, measuring the matching degree between the tokens in the text and the current level’s label, thus obtaining the encoding of the text concerning the current level’s label, which is then used for predictions at the current level.The structure of HAF is as follows:

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

▲Rough Structure of HAF

There are several issues to address within the HAF framework:

● How to derive label embedding

– It can either be randomly initialized

– Or initialized using the textual description of the labels; a dedicated label encoder can also be used, among other methods (e.g., concept-based label embedding in the CLED model introduced later)

– Furthermore, graph networks can also be employed for further encoding within the predefined hierarchical label system

● The attention mechanism needs to simultaneously consider the results from the previous layer, the token representations of the text, and the label embedding of the current layer; how to achieve this attention;

● How to predict at each layer and what information is passed to the next layer

In addition to the main framework description above, there are some optional components:

● Units can either share parameters or not.

● The results from the Units are local results for each layer, and there can also be a global result, which is combined for the final output.

● Generally, the order of Units is top-down, but a bottom-up sequence of Units can also be added for a bidirectional approach.

Moreover, there are some criticisms. Recent works that include the HTC task have shown that many HTC papers evaluate only whether the labels are correct, without considering path decoding. This is not a significant issue for tree-like label systems, as decoding can be done based on the label system. However, for directed acyclic graph (DAG) label systems, this may not be appropriate, as path decoding is needed, as illustrated in the following example.

▲ Assuming the upper-level labels A and B have been predicted, and the lower-level labels C and D have been predicted, the correct paths are A->C and B->D, then the issue of path decoding arises

In fact, there are many methods that can be applied to HAF. Below are two papers that utilize HAF as the main architecture for HTC: HARNN and CLED.

HARNN

Paper:Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach

Published by:University of Science and Technology of China, published at CIKM-19

Paper link:

http://base.ustc.edu.cn/pdf/Wei-Huang-CIKM2019.pdf

Open-source code:

https://github.com/RandolphVI/Hierarchical-Multi-Label-Text-Classification

First, let’s define a hierarchical label system: a hierarchical label system with H levels is represented as: , where is the set of labels at the i-th level, and the number of labels at each level is .

▲ Overall Structure of the Model

HARNN is an RNN structure, consisting of the following three parts, as shown in the figure above. It can be seen that the HARL module is the HAF framework we introduced earlier:

Documentation Representation Layer

This module mainly obtains two representations:

●Token Representation Vectors: Given a text sequence of length N

■ The text passes through word2vec to obtain the embedding vector for each token

■ Then it goes through BILSTM, concatenating the forward and backward vectors to obtain the representation vector V for each token, where u is the size of the BILSTM’s hidden state

■ For the entire text, average pooling is used to obtain the overall representation

●Hierarchical Label Representation Vectors

■ Randomly generate an embedding for each label in the multi-layer label system

■ Then assemble them into hierarchical category representation vectors S based on the multi-layer label system, where H is the number of layers in the multi-layer label system, is a matrix of the embedding of all labels at the i-th level, is the number of labels at the i-th level, and is the dimension of the embeddings.

Hierarchical Attention-based Recurrent Layer (HARL)

Here, they consider using recurrent neural networks to model the relationships between levels. Below is an introduction to the Unit proposed in their RNN structure: Hierarchical Attention-based Memory (HAM)

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

▲HAM Unit

Hierarchical Attention-based Memory (HAM) consists of three components:

Text-Category Attention (TCA): Obtains the relationship between the text and the current level
Class Prediction Module (CPM): Based on the results of TCA, CPM generates the representation for the current level and predicts the current category
Class Dependency Module (CDM): Generates the information from the current layer to the next layer

The input to the HAM unit is:

Text representation sequence
Current layer’s label representation
Information passed down from the previous layer

The overall formula for the HAM unit at the h-th layer is as follows:

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

In the above formula:

TAC：Calculates the attention score between each label at the current layer and each word in the text based on the attention mechanism, thus obtaining a text-label representation

> Input is:

● : Information passed from the previous layer, which is the attention of each token from the previous layer

● : Token representation

● : Current layer’s label representation

> Output is:

● : The attention score for the text-label at the h-th layer

● : The text-label representation at the h-th layer, which is the text representation influenced by the h-th layer label

CPM： is for prediction

> Input is:

● : The average text representation obtained from the token representations

● : Text-label representation

> Output is:

● : Representation at the h-th layer

> Output is:

● : Sigmoid value for each label at the h-th layer

CDM： is to model the dependencies between different layers

> Input is:

● : The attention score for the text-label at the h-th layer (output of TAC)

● : Sigmoid value for each label at the h-th layer (output of CPM)

> Output is:

● : Attention values for all tokens at the h-th layer

Next, we will detail the three components:

Text-Category Attention (TCA)

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

▲TCA Structure

The TCA layer primarily considers the information between the text and this layer and the previous layers, allowing the label at this layer to interact with the text, using the attention mechanism:

1. First, the token representations are element-wise multiplied by the attention scores from the previous layer for each token, obtaining the text representation enhanced by the previous labels, with dimensions N*2u;

2. Then, using the formula below, the attention scores between each word in the text and each label at the current layer are calculated, with dimensions , which is also one of the outputs of this component.

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

3. Finally, this attention score is multiplied by , resulting in , with dimensions , followed by average pooling in the token direction to obtain the text representation for this layer , with dimensions .

Class Prediction Module (CPM)

This layer is for prediction, where the original average text representation and the text representation for this layer are concatenated, passed through a fully connected layer, and a nonlinear mapping using RELU is applied to obtain the representation for this layer , which is then passed through a sigmoid to obtain the results for this layer.

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

Class Dependency Module (CDM)

CDM is used to generate the information from the current layer to the next layer:

● Different categories at the current layer have varying impacts on the results, so they first use the attention score of the text at this layer and the predicted results from the CPM layer to perform element-wise multiplication, obtaining a weighted text-label attention matrix.

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

● Then, average pooling at the label level is performed to obtain , resulting in a vector with N elements, representing all labels at this level, and their attention to each word, which can be considered as the attention of all previous levels on the words.

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

● Finally, broadcast this to obtain the information passed to the next layer , to be multiplied with the text representation in TCA.

Hybrid Predicting Layer

Although the prediction results for each layer have been obtained in the above RNN structure, they believe this is a local result. Therefore, they put all the results together and predict again to generate a global result:

● All representations from each layer are combined, and then average pooling is performed across layers:

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

● Then a fully connected layer is applied for prediction

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

● Finally, the local and global results are merged: the parameter alpha=0.5. They conducted experiments varying alpha and found that when alpha=0.5, it performed significantly better than using just Global or Local results.

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

▲ Experiment on Alpha

● Finally, the local and global losses are calculated using BCE, along with L2 regularization, as the final loss.

Model Results

The results are quite good, significantly better than the previous HMCN.

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

▲ 1) HARNN-LG does not consider the dependencies between different layers; 2) HARNN-LH does not consider the information from the entire multi-layer structure; 3) HARNN-GH does not consider Local information

CLED

Paper:Concept-Based Label Embedding via Dynamic Routing for Hierarchical Text Classification

Published by:Tencent, published at ACL2021

Paper link:

https://aclanthology.org/2021.acl-long.388/

Open-source code:

https://github.com/wxpkanon/CLEDforHTC

This paper introduces the concept of Concepts. For example, under the category of sports, there are football, basketball, skating, and curling. However, these can be further divided into ball sports (football and basketball) and ice sports (skating and curling). Here, “ball sports” and “ice sports” represent the concept information, which can help better model the shared information between categories (this concept is quite good and can be seen as a supplement to the insufficient refinement of label hierarchy in datasets). So how is concept information captured?

● If the dataset is annotated with keywords, use them directly.

● If the dataset is not annotated with keywords, use chi-squared tests to extract the top n keywords.

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

▲ Model Structure

The overall framework they used is also HAF, mainly consisting of the following modules:

Text Encoder: Obtains token representations
Concept-based Classifier Module: Responsible for encoding and classification at each layer, which is also the main module of the paper
Concept Share Module: Mainly generates label representations based on the concept and label hierarchy
Label Embedding Attention Layer: Primarily generates text representations concerning this layer’s labels
Classifier: Just for classification

Text Encoder

After obtaining text embeddings, CNN is used to extract n-gram features, and then BiGRU is used to extract contextual features, resulting in the representation for each token .

Concept-based Classifier Module

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

2.1

Label Embedding Attention Layer

The input to this module is the token representation S of a text sequence of length |d| and the label embeddings for the i-th layer of length . The specific steps are as follows:

● Calculate the cosine similarity matrix G between the text representation and the label embeddings, where G has dimensions

● Use a convolution kernel F to extract features of length k around the p-th word. Theoretically, the dimensions are

● Next, apply max pooling to obtain the maximum correlation of the p-th word to the i-th layer label (since even if two labels are related to the p-th word, only the maximum one is needed).

● Then, apply softmax to the maximum correlations of all words to the i-th layer label to obtain the relevant scores for this layer’s labels concerning the text , with dimensions |d|, i.e., the attention values of this layer’s labels to the words.

● Finally, perform a weighted sum of the word embeddings to obtain the text representation concerning this layer’s label: . Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

In the module diagram of CCM, there are three parts of the Attention Layer, each coming from different sources:

●External Knowledge is obtained through the EK Encoder, and attention is applied between it and the text representation. Here, External Knowledge refers to the textual descriptions of the labels, and the EK Encoder averages the word embeddings of the label descriptions.

●Predicted soft label embedding is the sum of the EK Embedding of the labels predicted from the previous layer.

●Concept Share Module: is the concept-based label embedding for this layer, which is obtained through attention with the text representation. Below, we will introduce how the concept representations are derived.

2.2

Concept Share Module

This section mainly introduces how the concept representations are derived. In CSM, concept embeddings are first obtained using a Concepts Encoder, and then concept-based label embeddings are derived via Dynamic Routing.

We know that a concept is represented by n keywords, and the method for obtaining the concept has been introduced above.

※2.2.1 Concepts Encoder

For a given category c, they will obtain the top-n keywords (or directly annotated ones).

Two methods are used to obtain the concept embeddings for category c:

● Directly use their word embeddings as the embedding for category c.

● Perform k-means clustering on all keywords, and select the embeddings of all cluster centers as the embedding for category c.

Next, we will introduce how to obtain the concept.

※2.2.2 Concepts Sharing via Dynamic Routing

They modified the dynamic Routing mechanism from CapsNet to allow routing only between a category (layer l) and its subcategories (layer l+1).

The general idea is that a particular subclass may only be related to a portion of the parent class’s concept words. Therefore, each concept word from the parent class is allowed to perform attention with the subclass, attempting to generate the subclass’s concept-based label embedding using only the relevant concept words from the parent class.

▲Dynamic Routing Steps

Where:

● Refers to the logarithmic prior probability of a particular parent class concept i with subclass j, initially set to 0;

●Then, calculate the weight scores between each parent class concept i and each subclass j through softmax, which essentially allows each concept word from the parent class to perform attention with the subclass, attempting to generate the subclass’s concept-based label embedding using only the relevant concept words from the parent class.

●Next, for each subclass, the embeddings of all parent class concepts are weighted to obtain an intermediate label embedding;

●By applying a nonlinear squeeze to the intermediate embedding, the concept-based label embedding for the subclass is obtained.

●Based on the obtained concept-based label embedding, update the similarity with parent class concept i.

2.3

Classifier

The classifier’s input comes from the outputs of the three Attention Layers, which are:

● : The text representation obtained from the previous layer after attention,

● : The label embedding obtained by averaging the word embeddings of the external knowledge (class definitions), which is then used for attention with the text to obtain the text representation.

● : The text representation obtained through attention with the concept-based label embedding;

All three inputs are concatenated, passed through linear mapping, and then softmax is applied to obtain the final output:

The Loss Function uses cross-entropy.

Model Results, Analysis, and Thoughts

Simple Architecture of Label Embedding and Attention Mechanism in Hierarchical Text Classification

▲ Much better than HARNN and also significantly better than HIAGM

It can be seen that CLED performs much better than previous models, achieving SOTA on these two datasets at the time. During discussions with peers, several points were noted:

● On WOS, other papers like HIAGM report Micro-F1 (which is accuracy in multi-class scenarios) at 85%+, while Macro-F1 is around 79. However, comparing the two papers reveals that the data splits are quite different; CLED has less training data but more test data. HIAGM’s results should be from their re-run experiments.

● HARNN performs well on DBpedia, but poorly on WOS. I consider that the data volume may have some impact, as the training data for DBpedia is 278K, while WOS has only 28K. Additionally, HARNN’s label embedding is randomly initialized, which I speculate may have some influence.

Looking at the ablation experiments, they tested:

w/o CL is the removal of concept-based label embedding,
w/o EK is the removal of the extra attention module
w/o PRE is the removal of information passed from the previous layer
w/o reference in CSM is the removal of the initialization using EK for the CSM module
w/o DR is the removal of dynamic Routing

▲ Although removing different modules results in some degree of performance degradation, it can be seen that removing the input from the previous layer has the most significant impact.

Although each component shows a slight decrease in performance when removed, the most significant reduction is seen when the PRE component is removed, indicating that the results passed from the previous layer are quite important. Additionally, the concept-based label embedding proposed in this paper also has a certain impact when removed, but it is not the most significant. This concept seems quite good (essentially compensating for the insufficient refinement of the label hierarchy in the dataset), although it has some effect, it may not be substantial. However, it should not be ruled out that in some datasets, the relationships between upper and lower layers may not be closely tied, and the concept hypothesis may not hold in such cases.

Finally, here are some thoughts:

● Why not add a step to generate the current Label embedding using the current label’s concept? It seems worth trying.

● If Label Embedding could be further encoded through graph networks or TreeLSTM, etc., it would effectively incorporate the label hierarchy structure (referencing HIAGM, HGCRL, etc.), which could potentially enhance the results.

While the two papers introduced above may have been lengthy, I hope that by this point, you have a clearer understanding of what HAF is like.

See you in the next post (*^▽^*)

To join the technical discussion group, please add the AINLP assistant on WeChat (id: ainlp2)

Please specify the specific direction + relevant technical points

About AINLP

AINLP is an interesting AI natural language processing community focused on sharing technologies related to AI, NLP, machine learning, deep learning, recommendation algorithms, etc. Topics include text summarization, intelligent Q&A, chatbots, machine translation, automatic generation, knowledge graphs, pre-trained models, recommendation systems, computational advertising, recruitment information, and job experience sharing. Welcome to follow! To join the technical discussion group, please add the AINLP assistant on WeChat (id: ainlp2), specifying your work/research direction + purpose for joining the group.

Having read this far, please share, like, or give feedback 🙏

Hierarchical Attention-based

Framework Introduction

Leave a Comment Cancel reply