Deep Learning Methods for NLP Text Classification

Li Dakang1 minute ago

1. The purpose of this library is to explore methods for NLP text classification using deep learning.

2. It has various benchmark models for text classification.

3. It also supports multi-label classification, where multiple labels are associated with sentences or documents.

Although many of these models are quite simple and may not help you excel in this text classification task, some of these models are very classic and can be said to be very suitable as benchmark models.

Each model has a test function under the model type.

We also explored text classification using two seq2seq models (the seq2seq model with attention, and the transformer: attention is all you need). Both of these models can also be used for sequence generation and other tasks. If your task is multi-label classification, you can transform the problem into sequence generation.

We implemented a memory network: the recurrent entity network: tracking the state of the world. It uses blocks of key-value pairs as memory and runs in parallel to obtain new states. It can be used to answer modeling questions using context (or history). For example, you can have the model read some sentences (as text) and pose a question (as a query), then ask the model to predict the answer; if you provide a story as a query, it can perform classification tasks.

If you want to learn more about text classification or details of the datasets for tasks where these models can be applied, you can click the link for inquiry, we have selected one:https://biendata.com/competition/zhihu/

Models

1. fastText

2. TextCNN

3. TextRNN

4. RCNN

5. Hierarchical Attention Network

6. seq2seq model with attention

7. Transformer (“Attention Is All You Need”)

8. Dynamic Memory Network

9. Entity Network: Tracking the State of the World

Other Models

1. BiLstm Text Relation;

2. Two CNN Text Relation;

3. BiLstm Text Relation Two RNN

Performance

(Multi-label prediction task, requiring predictions to reach the top 5, with 3 million training data, full score: 0.5)

Note: “HierAtteNetwork” refers to Hierarchical Attention Network

Usage

1. The model is in xxx_model.py

2. Run python xxx_train.py to train the model

3. Run python xxx_predict.py for inference (testing).

Each model has a testing method under the model. You can run the testing method first to check if the model works properly.

Environment

python 2.7+tensorflow 1.1

(tensorflow 1.2 is also applicable; most models should work well in other tensorflow versions as we use very few features to combine them into certain versions; if you are using python 3.5, it will also work well as long as you change the print / try catch functions.)

Note

Some util functions are in data_util.py; typical input is: “x1 x2 x3 x4 x5 label 323434”, where “x1, x2” are words, “323434” is the label; it has a function that loads pre-trained words and assigns embeddings to the model, where word embeddings are pre-trained in word2vec or fastText.

Model Details

Deep Learning Methods for NLP Text Classification

fastText

Implementation of the paper “Bag of Tricks for Efficient Text Classification” (https://arxiv.org/abs/1607.01759)

1. Use bi-gram or tri-gram.

2. Use NCE loss to accelerate our softmax computation (not using the hierarchical softmax in the original paper) result: performance is as good as in the original paper, speed is also very fast.

See: p5_fastTextB_model.py

Text Convolutional Neural Network (Text CNN)

Implementation of the paper “Convolutional Neural Networks for Sentence Classification”Structure: Dimensionality Reduction —> conv —> Max Pooling —> Fully Connected Layer ——–> softmax

See: p7_Text CNN_model.py

To achieve very good results using TextCNN, you also need to carefully read the paper “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”, which can help you understand some insights that affect performance. Of course, you also need to change certain settings according to the specific task.

Text Recurrent Neural Network (Text RNN)

Structure: Dimensionality Reduction —> Bidirectional LSTM —> Concat Output —> Average —– —> softmax

See: p8_Text RNN_model.py

Bidirectional Long Short-Term Memory Text Relation (BiLstm Text Relation)

The structure is the same as Text RNN. But the input is specially designed. For example: input “How much is the computer? What is the stock price of the laptop?”. “EOS” is a special marker that separates question 1 and question 2.

See: p9_BiLstm Text Relation_model.py

Two Convolutional Neural Networks Text Relation (two CNN Text Relation)

Structure: First, use two different convolutions to extract features from two sentences, then concatenate the two features, use a linear transformation layer to project outputs to the target labels, and finally use softmax.

See: p9_two CNN Text Relation_model.py

Bidirectional Long Short-Term Memory Text Relation Two Recurrent Neural Networks (BiLstm Text Relation Two RNN)

Structure: One bidirectional LSTM for one sentence (output 1), another bidirectional LSTM for the other sentence (output 2). Then: softmax (output 1 M output 2)

See: p9_BiLstm Text Relation Two RNN_model.py

For more details, you can visit: “Deep Learning for Chatbots” Part 2 – Implementing a Retrieval-Based Model in Tensorflow

Recurrent Convolutional Neural Network (RCNN)

Recurrent Convolutional Neural Network for Text Classification.

Implementation of the paper “Recurrent Convolutional Neural Network for Text Classification”.

Structure: 1) Recurrent structure (convolution layer) 2) Max pooling 3) Fully connected layer + softmax

It learns the representation of each word in the sentence or document using left-side text and right-side text:

Representation of the current word= [left_side_context_vector, current_word_embedding, right_side_context_vector].

For left-side text, it uses a recurrent structure, the nonlinear transformation of the previous word and the left-side previous text; similarly for right-side text.

See: p71_TextRCNN_model.py

Hierarchical Attention Network

Implementation of the paper “Hierarchical Attention Networks for Document Classification”.

1. Structure:

1. Dimensionality Reduction

2. Word Encoder: Word-level bidirectional GRU for rich word representation

3. Secondary Attention: Word-level attention to capture important information in sentences

4. Sentence Encoder: Sentence-level bidirectional GRU for rich sentence representation

5. Sentence Attention: Sentence-level attention to capture key sentences in the document

6. FC + Softmax

2. Data Input:

In general, the input to this model should be several sentences, not just one sentence. The format is: [None, sentence_length]. Where None indicates batch_size.

In my training data, for each sample, I have four parts. Each part has the same length. I form a single sentence from the four parts. The model will split the sentence into four parts, forming a tensor with the shape: [None, num_sentence, sentence_length]. Where num_sentence is the number of sentences (in my setup, its value equals 4).

See: p1_HierarchicalAttention_model.py

Seq2seq Model with Attention

The implementation of the Seq2seq model with attention is based on the paper “Neural Machine Translation by Jointly Learning to Align and Translate”.

1. Structure:

1) Dimensionality Reduction

2) Bi-GRU also gets rich representations from the source sentence (forward and backward).

3) Attention-based decoder.

2. Data Input:

Use two of the three inputs:

1) Encoder input, which is a sentence;

2) Decoder input, which is a fixed-length list of labels;

3) Target labels, which are also a list of labels.

For example, if the labels are: “L1 L2 L3 L4”, then the decoder input will be: [_ GO,L1,L2,L2,L3,_PAD]; the target labels will be: [L1,L2,L3,L3,_END,_PAD]. The length is fixed at 6, any excess labels will be truncated, and if there are not enough labels to fill, it will be padded.

3. Attention Mechanism:

1. Transfer the list of encoder inputs and the hidden states of the decoder

2. Calculate the similarity of each encoder input’s hidden state to obtain the probability distribution for each encoder input.

3. Weighted sum of encoder inputs based on the probability distribution.

Use this weighted sum with the decoder input through the RNN Cell to obtain new hidden states

4. How Vanilla E Encoder-Decoder Works:

In the decoder, the source sentence is encoded using RNN as a fixed-size vector (“thought vector”):

During training, another RNN will try to predict a word using this “thought vector” as the initial state and take input from the decoder input at each timestamp. The decoder starts with the special instruction “_GO”. After one step, the new hidden state will be obtained along with the new input, and we can continue this process until we reach the special instruction “_END”. We can calculate the loss by computing the cross-entropy loss against the target labels. logits are obtained through the projection layer of the hidden state (for the output of the decoder step, we can simply use the hidden state from the decoder as output).

During testing, there are no labels. So we should provide the output we obtained from previous timestamps and continue until we reach the “_END” instruction.

5. Notes:

Here I use two vocabularies. One is the words used by the encoder; the other is the labels used for the decoder.

For the vocabulary, three special instructions are inserted: “_ GO”, “_ END”, “_ PAD”; “_UNK” is not used since all labels are predefined.

Transformer (“Attention Is All You Need”)

Status: Completed the main part, able to generate sequences in the reverse order in the task. You can check it by running the test function in the model. However, I have not yet obtained useful results in actual tasks. We also use parallel style.layer normalization, residual connections, and masking in the model.

For each building block, we have included test functions in each of the files below, and we have successfully tested each block.

Seq2seq with attention is a typical model for solving sequence generation problems, such as translation and dialogue systems. Most of the time, it uses RNNs to accomplish these tasks. Until recently, convolutional neural networks have also been applied to sequence order problems. However, the Transformer, which relies solely on attention mechanisms to perform these tasks, is fast and achieves new state-of-the-art results.

It also has two main parts: an encoder and a decoder. See the following article:

Encoder:

Composed of 6 layers, each layer has two sub-layers. The first is multi-head self-attention; the second is a position-wise fully connected feed-forward network. LayerNorm (x + Sublayer(x)) is used for each sub-layer, dimension = 512.

Decoder:

1. The decoder consists of a stack of N = 6 identical layers.

2. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer that performs multi-head attention over the output of the encoder stack.

3. Similar to the encoder, we adopt residual connections around each sub-layer, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking combined with the fact that the output embeddings are offset by one position ensures that the prediction at position i can only depend on the known outputs at positions less than i.

Key Takeaways from This Model:

1. Multi-head self-attention: Using self-attention, linear transformations multiple times to obtain projections of keys and values, and then starting the attention mechanism

2. Some performance-enhancing tricks (residual connections, positional encoding, feed-forward, label smoothing, masking to ignore what we want to ignore).

For detailed information on the model, see: a2_transformer.py

11. Recurrent Entity Network

Input:

1. Story: It consists of multiple sentences as context.

2. Question: A single sentence that poses a question.

3. Answer: A single label.

Model Structure:

1. Input Encoding:

Using a word to encode the story (context) and the query (question); taking positions into account using position masks.

By using bidirectional RNNs to encode the story and query, performance improves from 0.392 to 0.398, a 1.5% increase.

2. Dynamic Memory:

a. Calculate gating using the “similarity” of the keys to the values of the input story.

b. Obtain candidate hidden states by transforming each key, value, and input.

c. Combine gates and candidate hidden states to update the current hidden state.

3. Output (using attention mechanism):

a. Obtain probability distribution by calculating the “similarity” between the query and hidden state.

b. Use the probability distribution to obtain a weighted sum of hidden states.

c. Nonlinear transformation of the query and hidden state to obtain predicted labels.

Key Points of This Model:

1. Using independent key and value blocks that can run in parallel.

2. Modeling context and question together. Using memory to track the state of the world, and using the hidden state and question (query) nonlinear transformation for prediction.

3. Simple models can also achieve very good performance. Simple encoding as the use of packages for words.

For detailed information on the model, see: a3_entity_network.py

This model includes a test function that requires the model to compute the numbers of the story (context) and query (question), but the weight of the story is less than that of the query.

Dynamic Memory Network

Module: Outlook

1. Input Module: Encodes raw text into vector representation.

2. Question Module: Encodes the question into vector representation.

3. Unique Memory Module: Selects which parts of the input to attend to through attention mechanisms, considering the question and previous memory====> it produces a “memory” vector.

4. Answer Module: Generates answers from the final memory vector.

Details:

1. Input Module:

A sentence: uses GRU to obtain the hidden state b.list of sentences: uses GRU to obtain the hidden state for each sentence. For example [hidden_state1, hidden_state2, hidden_state…, hidden_staten].

2. Question Module: Uses GRU to obtain the hidden state.

3. Memory Module:

Uses attention mechanisms and recurrent networks to update its memory.

a. Requires multiple sets ===> pass reasoning.

e.g. ask where is the football? it will attend to sentence of "john put down the football"), then in second pass, it need to attend location of john.

b. Attention Mechanism:

two-layer feed forward neural network. input is candidate fact c, previous memory m and question q. feature get by take: element-wise, matmul and absolute distance of q with c, and q with m.

c. Memory Update Mechanism: h = f(c,h_previous,g). The last hidden state is the input to the answer module.

4. Answer Module

Tasks to be done:

1. Character-level convolutional networks for text classification

2. Convolutional neural networks for text classification: shallow word-level and deep character-level

3. Deep convolutional networks for text classification

4. Adversarial training methods for semi-supervised text classification

References

1. “Bag of Tricks for Efficient Text Classification”

2. “Convolutional Neural Networks for Sentence Classification”

3. “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”

4. “Deep Learning for Chatbots”, Part 2 – Implementing a Retrieval-Based Model in Tensorflow

5. “Recurrent Convolutional Neural Network for Text Classification”

6. “Hierarchical Attention Networks for Document Classification”

7. “Neural Machine Translation by Jointly Learning to Align and Translate”

8. Attention Is All You Need

9. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing”

10. “Tracking the State of the World with Recurrent Entity Networks”

Leave a Comment Cancel reply