This article is reprinted with authorization from the WeChat public account “Robot Circle” (WeChat ID: ROBO_AI)
The length of this article is 4473 words, and it is recommended to read it in 10 minutes
This article introduces a library of NLP text classification deep learning methods and its 12 models.
The purpose of this library is to explore methods for NLP text classification using deep learning.
It includes various benchmark models for text classification, and supports multi-label classification, where multiple labels are associated with sentences or documents.
Although many of these models are quite simple and may not allow you to excel in this text classification task, some of these models are very classic and can be considered very suitable as benchmark models.
Each model has a test function under its model type.
We also explored text classification using two seq2seq models (seq2seq model with attention and transformer: attention is all you need). Both models can also be used for sequence generation and other tasks. If your task is multi-label classification, you can transform the problem into sequence generation.
We implemented a memory network: Recurrent Entity Network: tracking the state of the world. It uses blocks of key-value pairs as memory and runs in parallel to obtain new states. It can be used to answer modeling questions using context (or history). For example, you can let the model read some sentences (as text) and ask a question (as a query), then request the model to predict the answer; if you provide a story as a query, it can perform classification tasks.
If you want to know more about text classification or the datasets that these models can be applied to, you can click the link for inquiries, we have selected one:
https://biendata.com/competition/zhihu/
Models:
1.fastText
2.TextCNN
3.TextRNN
4.RCNN
5.Hierarchical Attention Network
6.Seq2seq with attention
7.Transformer (“Attention Is All You Need”)
8.Dynamic Memory Network
9.Entity Network: tracking the state of the world
Other models:
1.BiLstm Text Relation;
2.Two CNN Text Relation;
3.BiLstm Text Relation Two RNN
Performance:
(Multi-label prediction task, requiring prediction to reach the top 5, 3 million training data, full score: 0.5)
Note: “HierAtteNetwork” refers to Hierarchical Attention Network
Usage:
-
The model is in xxx_model.py
-
Run python xxx_train.py to train the model
-
Run python xxx_predict.py for inference (testing).
Each model has a test method under the model. You can run the test method first to check if the model works properly.
Environment:
python 2.7+tensorflow 1.1
(tensorflow 1.2 is also applicable; most models should also work normally in other tensorflow versions since we use very few features to integrate them into certain versions; if you are using python 3.5, it will also work well as long as you change the print / try catch functions.)
Note:Some util functions are in data_util.py; typical input is: “x1 x2 x3 x4 x5 label 323434”, where “x1, x2” are words, and “323434” is the label; it has a function that loads pretrained words and assigns embeddings to the model, where word embeddings are pretrained in word2vec or fastText.
Model Details:
1.Fast Text
Implementation of the paper “Bag of Tricks for Efficient Text Classification” (https://arxiv.org/abs/1607.01759)
-
Uses bi-gram or tri-gram.
-
Uses NCE loss to accelerate our softmax calculation (not using the hierarchical softmax in the original paper) result: performance is as good as in the original paper, and speed is very fast.
See: p5_fastTextB_model.py
2.Text CNN
Implementation of the paper “Convolutional Neural Networks for Sentence Classification” (http://www.aclweb.org/anthology/D14-1181)
Structure: Dimensionality reduction —> conv —> max pooling —> fully connected layer ——–> softmax
See: p7_Text CNN_model.py
To obtain very good results using TextCNN, you also need to carefully read this paper “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification” (https://arxiv.org/abs/1510.03820), which can help you understand some insights that affect performance.Of course, you also need to change certain settings based on the specific task.
3.Text RNN
Structure: Dimensionality reduction —> bi-directional lstm —> concat output —> average —–> softmax
See: p8_Text RNN_model.py
4.BiLstm Text Relation
Structure: Same as Text RNN. But the input is specially designed. For example: input “How much is the computer? EOS price of laptop”. “EOS” is a special marker that separates question 1 and question 2.
See: p9_BiLstm Text Relation_model.py
5.Two CNN Text Relation
Structure: First, use two different convolutions to extract features from the two sentences, then connect the two features, use a linear transformation layer to project the output to the target label, and then use softmax.
See: p9_two CNN Text Relation_model.py
6.BiLstm Text Relation Two RNN
Structure: One bi-directional lstm for one sentence (producing output 1), another bi-directional lstm for the other sentence (producing output 2). Then: softmax(output 1 M output 2)
See: p9_BiLstm Text Relation Two RNN_model.py
For more details, you can refer to Part 2 of “Deep Learning for Chatbots” – Implementing a Retrieval-Based Model in Tensorflow
7.Recurrent Convolutional Neural Network (RCNN)
Recurrent Convolutional Neural Network for text classification.
Implementation of the paper “Recurrent Convolutional Neural Network for Text Classification” (https://scholar.google.com.hk/scholar?q=Recurrent+Convolutional+Neural+Networks+for+Text+Classification&hl=zh-CN&as_sdt=0&as_vis=1&oi=scholart&sa=X&ved=0ahUKEwjpx82cvqTUAhWHspQKHUbDBDYQgQMIITAA)
Structure:
-
Recurrent structure (convolution layer)
-
Max pooling
-
Fully connected layer + softmax
It learns the representation of each word in the sentence or document using left context and right context:
Current word representation = [left_side_context_vector, current_word_embedding, right_side_context_vector].
For the left context, it uses a recurrent structure, the non-linear transformation of the previous word and the left previous text; similarly for the right context.
See: p71_TextRCNN_model.py
8.Hierarchical Attention Network
Implementation of the paper “Hierarchical Attention Networks for Document Classification” (https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf)
Structure:
-
Dimensionality reduction
-
Word encoder: word-level bi-directional GRU for rich word representation
-
Sub-attention: word-level attention to get important information in the sentence
-
Sentence encoder: sentence-level bi-directional GRU for rich sentence representation
-
Sentence attention: sentence-level attention to get key sentences in the paragraph
-
FC + Softmax
Data input:
Generally, the input to this model should be several sentences, not just one sentence. The format is: [None, sentence_length]. Where None means batch_size.
In my training data, I have four parts for each sample. Each part has the same length. I form a single sentence from four parts. The model will split the sentence into four parts, forming a tensor with the shape: [None, num_sentence, sentence_length]. Where num_sentence is the number of sentences (in my setup, its value is equal to 4).
See: p1_HierarchicalAttention_model.py
9.Seq2seq Model with Attention
The implementation of the seq2seq model with attention is achieved through the paper “Neural Machine Translation by Jointly Learning to Align and Translate” (https://github.com/brightmart/text_classification/blob/master/README.md)
Structure:
-
Dimensionality reduction
-
bi-GRU also captures rich representations from the source sentence (forward and backward).
-
Decoder with attention.
Data input:
Use two of the three inputs:
-
Encoder input, which is a sentence;
-
Decoder input, which is a fixed-length list of labels;
-
Target labels, which are also a list of labels.
For example, if the labels are: “L1 L2 L3 L4”, the decoder input will be: [_ GO, L1, L2, L2, L3, _PAD]; the target labels will be: [L1, L2, L3, L3, _END, _PAD]. The length is fixed at 6, any excess labels will be truncated, and if there are not enough labels to fill, they will be padded.
Attention mechanism:
-
Transfer the list of encoder input and the hidden state of the decoder
-
Calculate the similarity of each encoder input hidden state to obtain the probability distribution of each encoder input.
-
Weighted sum of encoder inputs based on the probability distribution.
Using this weighted sum and decoder input through RNN Cell to obtain the new hidden state.
Vanilla E encoder-decoder works:
In the decoder, the source sentence will be encoded using RNN as a fixed-size vector (“thought vector”):
During training, another RNN will be used to try to obtain a word by using this “thought vector” as the initial state and taking input from the decoder at each time step. The decoder starts with the special instruction “_GO”. After one step, the new hidden state will be obtained with the new input, and we can continue this process until we reach the special instruction “_END”. We can calculate the loss by computing the log and the cross-entropy loss with the target labels. logits are obtained through the projection layer of the hidden state (for the output of the decoder step, in GRU we can just use the hidden state from the decoder as output).
During testing, there are no labels. So we should provide the outputs we obtained from the previous time steps and continue the process until we reach the “_END” instruction.
Notes:
Here I use two vocabularies. One is the vocabulary used by the encoder; the other is the vocabulary used for the decoder labels.
For the vocabulary, insert three special instructions: “_GO”, “_END”, “_PAD”; “_UNK” is not used because all labels are predefined.
10. Transformer (“Attention Is All You Need”)
Status: Completed the main parts, able to generate sequences in reverse order in tasks. You can check it by running the test function in the model. However, I have not yet obtained useful results in actual tasks. We also use parallel style.layer normalization, residual connections, and masks in the model.
For each building block, we include a test function in each file below, and we have successfully tested each block.
Seq2seq with attention is a typical model for solving sequence generation problems, such as translation and dialogue systems. Most of the time, it uses RNN to accomplish these tasks. Until recently, convolutional neural networks have also been applied to sequence order problems. However, the Transformer, which relies solely on the attention mechanism to perform these tasks, is fast and achieves new state-of-the-art results.
It also has two main parts: the encoder and the decoder. See below:
-
Encoder:
Composed of 6 layers, each with two sub-layers. The first is a multi-head self-attention structure; the second is a position-wise fully connected feed-forward network. For each sub-layer, we use LayerNorm(x + Sublayer(x)), dimension = 512.
-
Decoder:
-
The decoder consists of N = 6 identical layers stacked.
-
In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer that performs multi-head attention over the output of the encoder stack.
-
Similar to the encoder, we adopt residual connections around each sub-layer and then perform layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with the fact that the output embedding is offset by one position, ensures that the prediction for position i can only depend on the known outputs at positions less than i.
Key points that stand out from this model:
-
Multi-head self-attention: using self-attention, linear transformations multiple times to obtain projections of keys and values, and then starting the attention mechanism
-
Some performance-enhancing tricks (residual connections, positional encodings, feed-forward, label smoothing, masking to ignore what we want to ignore).
For more details about the model, see: a2_transformer.py
11.Recurrent Entity Network
Inputs:
-
Story: it consists of multiple sentences, serving as context.
-
Question: a sentence, which is a question.
-
Answer: a single label.
Model structure:
-
Input encoding:
Using a word to encode the story (context) and query (question); considering position through the use of position masks.
By using a bidirectional RNN to encode the story and query, performance increased from 0.392 to 0.398, a growth of 1.5%.
-
Dynamic memory:
-
Using the “similarity” of keys to compute gating over the values of the input story.
-
By transforming each key, value, and input to obtain candidate hidden states.
-
Combining gates and candidate hidden states to update the current hidden state.
-
Output (using attention mechanism):
-
Obtaining probability distributions by calculating the “similarity” between the query and hidden states.
-
Using the probability distribution to obtain a weighted sum of hidden states.
-
Non-linear transformations of the query and hidden states to obtain predicted labels.
Key points of this model:
-
Using independent blocks of keys and values can run in parallel.
-
Modeling context and questions together. Using memory to track the state of the world and using implicit states and non-linear transformations of questions (queries) to make predictions.
-
Simple models can also achieve very good performance. Simple encoding can be done as word usage packages.
For more details about the model, see: a3_entity_network.py
Under this model, there is a test function that requires this model to compute the story (context) and query (question) numbers, but the weight of the story is less than that of the query.
12.Dynamic Memory Network
Modules: Outlook
-
Input module: encodes raw text into vector representations.
-
Question module: encodes questions into vector representations.
-
Unique memory module: selects which parts of the input to consider using attention mechanism, taking into account questions and previous memories====> it generates a “memory” vector.
-
Answer module: generates answers from the final memory vector.
Details:
-
Input module:
A sentence: uses GRU to obtain hidden states b.list of sentences: uses GRU to obtain hidden states for each sentence. For example, [hidden state 1, hidden state 2, hidden state…, hidden state n].
-
Question module: uses GRU to obtain hidden states.
-
Memory module:
Uses attention mechanism and recurrent networks to update its memory.
-
Requires multiple sets ===> for passing inference.
e.g. ask where is the football? it will attend to sentence of "john put down the football"), then in second pass, it needs to attend location of john.
-
Attention mechanism:
two-layer feed forward neural network. input is candidate fact c, previous memory m and question q. feature get by take: element-wise, matmul and absolute distance of q with c, and q with m.
-
Memory update mechanism: h = f(c,h_previous,g). The last hidden state is the input of the answer module.
-
Answer module
Things to do:
-
Character-level convolution networks for text classification
-
Convolutional neural networks for text classification: shallow word-level and deep character-level
-
Deep convolutional networks for text classification
-
Adversarial training methods for semi-supervised text classification
References:
1. “Bag of Tricks for Efficient Text Classification”
2.”Convolutional Neural Networks for Sentence Classification”
3.”A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”
4. “Deep Learning for Chatbots, Part 2 – Implementing a Retrieval-Based Model in Tensorflow” (www.wildml.com)
5.”Recurrent Convolutional Neural Network for Text Classification”
6.”Hierarchical Attention Networks for Document Classification”
7. “Neural Machine Translation by Jointly Learning to Align and Translate”
8. Attention Is All You Need
9. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing”
10. “Tracking the state of the world with recurrent entity networks”
Editor: Wang Xuan
Proofreader: Wang Hongyu
The bottom menu of the public account has surprises!
For enterprises and individuals joining the organization, please check “Federation”
For past wonderful content, please check “Search in the account”
For joining volunteers or contacting us, please check “About Us”
To ensure the quality of the publication and establish a good reputation, Data Dispatch has established“Typo Fund”, encouragingreaders to actively correct mistakes.
If you find any errors while reading the article, please leave a message at the end of the article, or feedback in the background, after confirmation by the editor, Data Dispatch will send a8.8 yuan red envelope to the reporting reader.
If the same reader points out multiple errors in the same article, the reward remains unchanged. If different readers point out the same error, the reward goes to the first reader.
Thank you for your continued attention and support. We hope you can supervise Data Dispatch to produce higher quality content.