Chapter 8.1 TextCNN Network

Chapter 8.1 TextCNN Network

Hello everyone, welcome to the Moon Inn, I am the shopkeeper Kong Character.

The content directory for this issue is as follows. If this content helps you, feel free to like and share to support the shopkeeper! If you also recognize the column content of the inn, please click here to become a co-creator of the inn.

  • Chapter 8 Temporal and Model Fusion
  • 8.1 TextCNN Network
    • 8.1.1 TextCNN Structure
    • 8.1.2 Text Segmentation
    • 8.1.3 TextCNN Implementation
    • 8.1.4 Summary
  • References

Chapter 8 Temporal and Model Fusion

After the introduction of the content in Chapters 4 and 7, we have a clear understanding of two common network structures in deep learning: CNN and RNN. CNN is mainly used to process feature data with dependencies in spatial positions, while RNN is mainly used to handle feature data with temporal dependencies. Nevertheless, we can still extract features from temporal data using either CNN or a combination of CNN and RNN. In this chapter, we will introduce various variant models based on CNN and RNN along with corresponding application cases.

8.1 TextCNN Network

Although CNN is primarily used for feature encoding of information at adjacent spatial positions in the input matrix, it can also characterize the feature information of temporal data in a sequence using multi-scale convolutional windows, such as the classic TextCNN[1] text classification model. The core idea of TextCNN is to represent each word with a fixed-length vector (i.e., word vector, which will be introduced in section 9.2); then, the word vectors of all words in a sentence are stacked vertically to form a matrix, where the number of words in the sentence is represented, and the dimension of the word vector is indicated; finally, a fixed-width multi-scale convolution kernel is used for feature extraction and subsequent tasks.

8.1.1 TextCNN Structure

TextCNN uses convolution operations to extract local features from the text, capturing local features of different lengths with convolution kernels of varying sizes, thereby identifying key information in the text. In TextCNN, the entire network model is generally divided into three layers: convolutional layer, pooling layer, and fully connected classification layer, as shown in Figure 8-1.

Chapter 8.1 TextCNN Network

Figure 8-1 TextCNN Network Structure

In Figure 8-1, the leftmost part is a feature matrix, where each row is a 4-dimensional vector, with each vector representing a fixed word from the vocabulary. Based on this representation method, any text can be represented as a fixed-width feature matrix, which can be viewed as a single-channel feature map. Furthermore, TextCNN employs three convolution kernels of different window lengths (3, 4, and 5) for convolution processing. At this point, since the width of the convolution kernel is consistent with the width of the feature map, the output after convolution is a one-dimensional vector. It can be seen that this convolution operation can essentially be viewed as a feature extraction process for local sequential information in the text, with the length depending on the window length of the convolution kernel.

After completing the convolution operation, a max pooling operation is performed on each feature map to extract features, resulting in each channel becoming a scalar value. Finally, all results are concatenated to obtain a vector as the feature representation of the text, followed by a classification layer to complete subsequent classification tasks. Although the TextCNN network structure seems relatively simple, it often achieves good results in practical applications, with fast training speeds and low risks of overfitting.

8.1.2 Text Segmentation

In section 7.6, we used a single character granularity along with a word embedding layer to represent text. Next, we will see how to represent text using word granularity and a word embedding layer. Of course, if it is English corpus, there is no distinction between characters and words.

Text segmentation refers to dividing a sentence from a Chinese semantic perspective to obtain smaller word units. For example, for the following text:

“To Live” is a novel by Yu Hua, depicting the hardships and resilience of Chinese farmers. Through the protagonist Fugui’s life experiences, it showcases the devastation caused by war, hunger, and political movements on the people, while conveying profound reflections on family, hope, and humanity.

The result after segmentation is:

《/活着/》/是/余华/的/小说/,/描绘/了/中国/农民/的/苦难/与/坚韧/。/通过/主人公/福贵/的/生活/经历/,/展现/了/战乱/、/饥饿/和/政治/运动/对/人民/的/摧残/,/同时/传递/了/关于/家庭/、/希望/和/人性/的/深刻/思考/。

Then, word frequency statistics are conducted at the word level to construct a vocabulary and vectorize the text.

In text processing, jieba is a commonly used open-source segmentation tool[3], which can be installed via the command pip install jieba. Meanwhile, the jieba library provides two segmentation modes to handle Chinese segmentation in different scenarios, which will be introduced below.

1. Standard Segmentation Mode

The standard segmentation mode refers to the conventional method of segmenting a sentence into multiple words, with the following example code:

1 import jieba
2 if __name__ == '__main__':
3     sen = "The weather is clear today, the sun is shining, and a gentle breeze is caressing my face as I stroll alone along the river path."
4     segs = jieba.cut(sen)
5     result = "/".join(segs)

In the above code, line 4 performs segmentation on the original text and returns an iterator. Line 5 formats the processed result, as shown below.

今天/天气晴朗/,/阳光明媚/,/微风/轻拂/着/脸庞/,/我/独自/漫步/在/河边/的/小径/上/。

2. Full Segmentation Mode

Although the above method can complete the segmentation of a sentence at the word level, some words can have different segmentation methods. In this case, you can specify the full segmentation mode in the cut function by using jieba.cut(sen, cut_all=True) to obtain all possible segmentation results. For example, after enabling the full segmentation mode, the segmentation result is:

今天/今天天气/天天/天气/天气晴朗/晴朗/,/阳光/阳光明媚/光明/明媚/,/微风/轻拂/着/脸庞/,/我/独自/漫步/在/河边/的/小径/上/。

In practice, you can choose different modes according to the situation. Of course, jieba not only performs segmentation but also provides functions for keyword extraction, part-of-speech tagging, and new word discovery, which interested readers can refer to for further study.

8.1.3 TextCNN Implementation

After understanding the principles related to the TextCNN model, let’s see how to quickly implement this model using PyTorch. The complete example code can be found in the Code/Chapter08/C01_TextCNN/TextCNN.py file.

1. Forward Propagation

First, we need to implement the entire forward propagation process of the model. As shown in Figure 8-1, the entire model is divided into four parts: word embedding layer, convolution layer, pooling layer, and fully connected layer. The implementation code is as follows:

 1 class TextCNN(nn.Module):
 2     def __init__(self, vocab_size=2000, embedding_size=512,
 3                  window_size=None, out_channels=2, fc_hidden_size=128, num_classes=10):
 4         super(TextCNN, self).__init__()
 5         if window_size is None:
 6             window_size = [3, 4, 5]
 7         self.vocab_size = vocab_size
 8         self.embedding_size = embedding_size
 9         self.window_size = window_size
10         self.out_channels = out_channels
11         self.fc_hidden_size = fc_hidden_size
12         self.num_classes = num_classes
13         self.token_embedding = nn.Embedding(self.vocab_size, self.embedding_size)
14         self.convs = [nn.Conv2d(1, out_channels, 
15                       kernel_size=(k, embedding_size)) for k in window_size]
16         self.max_pool = nn.AdaptiveMaxPool2d((1, 1))
17         self.classifier = nn.Sequential(
18             nn.Linear(len(self.window_size) * self.out_channels, self.num_classes))

In the above code, lines 5-12 are for initializing relevant model hyperparameters. Line 13 instantiates a word embedding layer, which is a two-dimensional weight matrix where each row corresponds to a unique vector representation of each word in the vocabulary. Lines 14-15 instantiate multiple convolutional layers based on different convolution window lengths. Line 16 instantiates an adaptive max pooling layer, whose output shape is [1,1], and since the pooling layer has no parameters, multiple convolutional layers can share the same pooling layer. Lines 17-18 instantiate a classification layer, with the input dimension equal to the number of convolution layers multiplied by the number of output channels of each convolution layer.

Furthermore, the example code for the entire forward propagation calculation process is as follows:

 1     def forward(self, x, labels=None):
 2         x = self.token_embedding(x)
 3         x = torch.unsqueeze(x, dim=1)
 4         features = []
 5         for conv in self.convs:
 6             feature = self.max_pool(conv(x))
 7             features.append(feature.squeeze(-1).squeeze(-1))
 8         features = torch.cat(features, dim=1)
 9         logits = self.classifier(features)
10         if labels is not None:
11             loss_fct = nn.CrossEntropyLoss(reduction='mean')
12             loss = loss_fct(logits, labels)
13             return loss, logits
14         else:
15             return logits

In the above code, line 1 x is the index representation of each sentence after segmentation in the vocabulary, with a shape of [batch_size, src_len]. Line 2 is the result after processing through the word embedding layer, with an output shape of [batch_size, src_len, embedding_size]. Line 3 indicates that x is expanded in the first dimension, resulting in a shape of [batch_size, 1, src_len, embedding_size]. Lines 4-7 perform multi-scale convolution operations, where line 6 shows that the shape after convolution and pooling is [batch_size, out_channels, 1, 1], and line 7 compresses the dimensions to [batch_size, out_channels] and stores them in a list. Line 8 combines all features, resulting in a shape of [batch_size, out_channels*len(window_size)]. Line 9 is the final classification layer. Lines 10-15 return the corresponding processing results based on conditions.

Finally, it can be used as follows:

1 if __name__ == '__main__':
2     x = torch.tensor([[1, 2, 3, 2, 0, 1],
3                       [2, 2, 2, 1, 3, 1]])
4     labels = torch.tensor([0, 3])
5     model = TextCNN(vocab_size=5, embedding_size=3, fc_hidden_size=6)
6     loss, logits = model(x, labels)
7     print(logits.shape)

The output is:

1 torch.Size([2, 10])

2. Constructing the Dataset

Here, we will continue to use the Toutiao 15 classification dataset mentioned in section 7.2.4, just needing to change the granularity from character level to word level. Specifically, we need to add a new segmentation processing logic to the tokenize function in the TouTiaoNews module introduced in section 7.2.4, with the example code as follows:

1 def tokenize(text, cut_words=False):
2     if cut_words:
3         text = jieba.cut(text)
4     words = " ".join(text).split()
5     return words

In the above code, lines 2-3 are the newly added segmentation processing logic.

Furthermore, we only need to pass the cut_words parameter to the places in TouTiaoNews where the tokenize function is used to construct the dataset at the word level, and the specific example code can be found directly in the source code. The vectorized samples are similar to the following results:

1 ## Original input sample: Will traveling to Yunnan cause altitude sickness, and how to prevent it?
2 ## Segmented sample: ['去', '云南', '旅行', '会', '不会', '出现', '高原', '反应', ',', '应', '如何', '预防', '?']
3 ## Vectorized sample: [60, 1220, 391, 29, 196, 317, 0, 2368, 2, 1343, 15, 0, 3]

Finally, when instantiating TouTiaoNews, we just need to pass cut_words=True:

1 if __name__ == '__main__':
2     toutiao_news = TouTiaoNews(top_k=4000,batch_size=12,cut_words=True)
3     test_iter = toutiao_news.load_train_val_test_data(is_train=False)
4     for x,y in test_iter:
5         print(x,y)

3. Model Training

As this part of the code has been introduced multiple times before, I will not elaborate here. Readers can directly refer to the source code. Finally, during the training of the network model, similar output results will be obtained:

 1 Epochs[1/50]--batch[0/2093]--Acc: 0.0469--loss: 2.775
 2 Epochs[1/50]--batch[50/2093]--Acc: 0.2109--loss: 2.4728
 3 Epochs[1/50]--batch[100/2093]--Acc: 0.3203--loss: 2.229
 4 Epochs[1/50]--batch[150/2093]--Acc: 0.4453--loss: 1.7122
 5 Epochs[1/50]--batch[200/2093]--Acc: 0.5156--loss: 1.5143
 6 Epochs[1/50]--batch[250/2093]--Acc: 0.5547--loss: 1.2475
 7 Epochs[1/50]--batch[300/2093]--Acc: 0.5859--loss: 1.5477
 8 Epochs[1/50]--batch[350/2093]--Acc: 0.6172--loss: 1.2619
 9 Epochs[1/50]--batch[400/2093]--Acc: 0.6953--loss: 1.146
10 Epochs[1/50]--Acc on val 0.7311

8.1.4 Summary

In this section, we first introduced the principles of TextCNN in detail, which can essentially be viewed as a method for local feature extraction from sequential data using convolution operations; then we briefly introduced the usage of the segmentation tool jieba; finally, we showed how to implement the TextCNN model step by step and tested it on the Toutiao dataset. In the next section, we will introduce the RNN-based TextRNN model for text classification.

References

[1] Kim Y. 2014. Convolutional Neural Networks for Sentence Classification [C]. In Proceedings of the 2014 Conference on EMNLP, pages 1746–1751.

[2] Zhang Y, Wallace B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification[J]. arXiv preprint, 2015, arXiv:1510.03820.

[3] https://github.com/fxsjy/jieba

Leave a Comment