The Magical Recursive Neural Network That Mimics Han Han's Writing

Big Data Digest works, reposting requires authorization

Author| Han Xiaoyang && Long Xincheng

Thanks| Owen for translating and providing part of the content from Recurrent Neural Networks Tutorial part 1

Source|http://blog.csdn.net/han_xiaoyang/article/details/51253274

Big Data Digest’s “Machine Learning” column is established!

We welcome everyone to leave valuable comments and submit articles to us. How to join us? There is a description at the end of the article 🙂

◆ ◆ ◆

Introduction

As we get closer to artificial intelligence, the research and industrial communities’ interest in neural networks and deep learning is growing stronger, with expectations rising.

We have seen in the deep learning and computer vision column how computers learn to recognize the content of images through convolutional neural networks—mimicking human vision. The industrial applications have also proven that neural networks can teach computers to listen (such as Baidu’s speech recognition). Thus, a lot of energy is now being directed towards the NLP field, and teaching computers to write is certainly an interesting endeavor. Imagine if computers could read novels by Han Han and Xiao Si and write text with a similar tone; that would be quite exciting!

Indeed, there is a type of neural network that can play a significant role in NLP, handling tasks ranging from language modeling to bilingual translation, text generation, and even code style imitation. This is what I will introduce today: RNN (Recurrent Neural Network).

◆ ◆ ◆

The Magical RNN and Its Unbelievable Performance

RNN is a type of neural network related to time series. Think about it: for a single image, the pixel information is static, while for a sentence, the composition of words has an order, and typically, subsequent words are sequentially related to previous ones. In this case, convolutional neural networks with independent hierarchical structures often struggle to handle such temporal associations, while RNNs can effectively process them.

How effective is RNN in handling time series content? Let’s look at some examples processed by RNN.

We know that convolutional neural networks have achieved human-level accuracy in recognizing digits from 0 to 9, but for a string of numbers, special handling is still required. For example, below is an illustration of RNN learning to read house numbers from left to right.

The Magical Recursive Neural Network That Mimics Han Han's Writing

We can also let RNN learn to draw house number images from right to left:

RNN can imitate various file formats. For instance, here are RNN-learned XML files and Latex algebra geometry file samples: The Magical Recursive Neural Network That Mimics Han Han's Writing

We won’t worry about the correctness of the content for now; the formatting ability alone is jaw-dropping.

Programmers who love to tinker will surely seize every learning resource. They pulled the Linux source code from GitHub and fed it to the recursive neural network. After two or three days… we found it could generate the following code blocks by itself The Magical Recursive Neural Network That Mimics Han Han's Writing Not to mention whether the code content is meaningful, the standard C code style is enough to scare you.

Oh, right, we just mentioned RNN’s ability to process text. Let’s take a look at an experiment by Andrej Karpathy, who compiled all of Shakespeare’s works into a single (4.4MB) file. After training a 3-layer RNN network with 512 hidden nodes for a few hours, he obtained the following sample result: The Magical Recursive Neural Network That Mimics Han Han's Writing As a coder with a weak literary foundation, I can’t grasp the writing style of a great literary figure like Shakespeare. But we have Han Han! We have the widely loved Xiao Si! So we pulled a version of char-rnn that can handle Chinese from GitHub, gathered Xiao Si’s masterpieces from “Phantom City” to “Left Hand’s Reflection Right Hand’s Years” and “Sadness Flowing Backwards” to “Tiny Times”, processed them (cough cough, I only used them for experimental learning, not for distribution… copyright issues should be addressed to the blogger…), and fed them to RNN. Then, let’s see what it can learn. The Magical Recursive Neural Network That Mimics Han Han's Writing

I can’t understand Xiao Si’s writing, but the text generated by RNN gives me a vague sense of the atmosphere of “looking up at the sky at a 45-degree angle, tears won’t fall”. Of course, these paragraphs are among the better outputs; some are of average fluency, but they sufficiently prove that recursive neural networks have incredible learning capabilities in NLP.

◆ ◆ ◆

Understanding Recursive Neural Networks

For those who were amazed by recursive neural networks, let’s turn our attention back from Xiao Si. Let’s see what this magical recursive neural network is and what the underlying principles are.

In traditional neural networks (including CNNs), we assume that all inputs and outputs are independent of each other. Unfortunately, this assumption does not hold for many tasks. If you want to predict the next word or a sentence, you need to know the previous words. The reason RNNs are called recursive is that each element performs the same sequence of tasks, but the output depends on previous calculations. To explain RNNs differently: you can imagine that the RNN model has “memory” that stores previously computed information. Theoretically, RNNs can remember and use information from sequences of arbitrary length, but in practical applications, we usually only look back a few steps (which will be discussed in detail later). Let’s take a look at a typical RNN illustration:

The above image shows how to unfold an RNN into a complete network. We use the term “unfold” because the transformation process (weights W) from one layer (or time point t) to the next layer (next time point) is consistent. For example, if we are focusing on a sequence containing five words, the RNN will unfold into five neural networks based on the order, with each word representing a layer. RNNs iterate and compute according to the following steps:

Assume $x_t$ is the input at time $t$, for example, $x_1$ could be the one-hot encoded word vector of the second word in the sentence.
Assume $s_t$ is the hidden state at time t; this is the “memory brain” of the neural network. $s_t$ is computed from the current input and the previous hidden state: $s_t=f(Ux_t + Ws_{t-1})$. The calculation function f is usually non-linear, such as the commonly used tanh or ReLU. $s_{-1}$ is the first hidden state, which we typically initialize to 0.
$o_t$ is the output at step t. For example, if we want to predict the next word based on the previous word sequence, we might get the probabilities of each word in the entire vocabulary (obtained through softmax), i.e., $o_t = ext{softmax}(Vs_t)$

Regarding the above steps, let’s elaborate on a few points:

We can think of the hidden state $s_t$ as the “memory” of the neural network; it captures and retains some information from previous time points. The output $o_t$ is calculated from “all memory information” at the current time point t. In practical engineering applications, we often perform some processing because $s_t$ cannot capture and retain information from all previous time points.
Unlike ordinary deep neural networks, where each layer has different weight parameters, an RNN shares the same set of parameters $(U, V, W)$ (as shown in the figure) at every step. This actually indicates that we perform the same operation from one step to the next, just with different inputs. The advantage of this is that it significantly reduces the number of parameters that need to be trained and estimated.
The above diagram shows that there is output at every step, but depending on the task, it may not be necessary to have output at every step. For example, when we are determining the sentiment of a sentence, we are actually only concerned with the final sentiment judgment, not the results of each word in between. The main feature of RNNs is the hidden state because it is the “memory” that retains most of the previous information.

◆ ◆ ◆

Different Types of RNNs

The RNN we just mentioned is the original version, but in recent years, researchers have proposed a series of new types of RNN networks. For example:

4.1 Bidirectional RNN

The idea of Bidirectional RNN is somewhat different from the original RNN; it considers that the current output is related not only to previous sequence elements but also to subsequent sequence elements. For example, if we want to fill in a missing word in a sentence, we intuitively feel that the word to fill in is related to the content before and after it, right? Bidirectional RNNs can be intuitively understood as two RNNs stacked together. The output result is then based on the hidden states of the two RNNs.

4.2 Deep Bidirectional RNN

Deep Bidirectional RNN is similar to Bidirectional RNN, but the difference is that we set up a multi-layer structure at each step/time point. In practical applications, this approach allows our neural network to have greater capacity (but this also means we need more training data).

4.3 LSTM Neural Network

LSTM (Long Short Term Memory) might be the most popular RNN neural network you see mentioned in various papers. LSTM does not have a significantly different structure from baseline RNN, but it uses different functions to compute the hidden state. We call LSTM’s “memory” cells; you can think of them as black boxes where the input consists of the previous state $h_{t-1}$ and the current input $x_t$. These “cells” decide which previous information and states need to be retained/remembered and which should be discarded. In practical applications, this approach has been found to effectively preserve associations from a long time ago. We will discuss the details of LSTM in future blog posts.

◆ ◆ ◆

Implementing a Simple RNN

Next, we will implement a simple RNN using Python and the Theano library to solve a common yet useful NLP problem: the language model.

5.1 Introduction to Language Models

Language models are a common and general concept in NLP. They help us determine the probability of a phrase or sentence appearing (given a corpus). This is extremely useful in various scoring mechanisms, such as selecting translation results in translation systems or determining the string of characters in speech recognition. For specific content about language models, you can refer to our previous blog post on naive Bayes to N-gram language models (http://blog.csdn.net/han_xiaoyang/article/details/50646667).

Our goal here is to use RNN to construct a language model, i.e., to output the probability of a phrase or sentence

$$\begin{aligned} P(w_1,…,w_m) = \prod_{i=1}^{m} P(w_i \mid w_1,…, w_{i-1}) \end{aligned}$$

In the above formula, $w_i$ depends on the previous $i-1$ words. Theoretically, RNNs can capture and remember all preceding information and dependencies, but in practice, the situation is a bit more complicated; the longer the time/sequence, the more information needs to be remembered, but the more information may also be lost.

5.2 Data and Preprocessing

We need some text corpus for the RNN to learn. Fortunately, unlike supervised learning where each sample needs to be labeled with a result, here we only need sentences or phrases. Therefore, we scraped 15,000 long comments from Reddit (similar to Baidu Tieba). We hope the RNN can learn the commenting style of these commenters (just like the magical Emperor Bar) and generate some sentences mimicking them. However, we all know that in machine learning problems, the effectiveness is limited by the quality of your data, so we need to preprocess the data.

5.2.1 Tokenization and Processing

The original text corpus needs some simple processing before being fed to the machine learning algorithm. For example, the most common processing for sentences is tokenization. For English, there are natural spaces separating words, while for Chinese, we need to segment sentences and words ourselves. For example, the English sentence “He left!” can be tokenized into three parts: “He”, “left”, “!”. Here we use Python’s NLTK package for tokenization.

5.2.2 Removing Low-Frequency/Stop Words

In fact, the vast majority of words in our text will only appear once or twice, and these extremely infrequent words can be removed as they may hinder the final results. Too many words slow down the model’s training and, because they only appear once or twice, we cannot learn much related to them. Therefore, we have a simple method to handle them.

For example, we limit our vocabulary to 8,000 words, and any words outside this vocabulary are labeled with a special character UNKNOWN_TOKEN. For instance, if the word “nonlinearities” is not in our vocabulary of 8,000 words, the sentence “nonlinearities are important in neural networks” would be processed as “UNKNOWN_TOKEN are important in Neural Networks”. We can treat UNKNOWN_TOKEN as a new word, and predicting it works just like other words.

5.2.3 Start and End

After tokenization, a sentence becomes a string of words/phrases, and we perform some special processing, such as adding SENTENCE_START and SENTENCE_END to the beginning and end of the word/phrase string as special start and end markers. This way, predicting the first word of the sentence can be viewed as the probability of the first word appearing after SENTENCE_START.

5.2.4 Word Vector Mapping

For word vector encoding, we plan to use the simplest word sequence encoding. To explain, it works like this. The input x for a simple sentence like “I left home” might be translated as [0, 179, 341, 416], where 0 corresponds to the sentence starting symbol SENTENCE_START, and “I” is indexed at 179 in the vocabulary, while the output y would be translated as [179, 341, 416, 1]. This is understandable since what we are doing is predicting the following words based on the preceding ones.

The code is roughly as follows:

The Magical Recursive Neural Network That Mimics Han Han's Writing Below is an actual sentence mapped into X and Y in this 8,000-word vocabulary:

x: SENTENCE_START what are n’t you understanding about this ? ! [0, 51, 27, 16, 10, 856, 53, 25, 34, 69]

y: what are n’t you understanding about this ? ! SENTENCE_END [51, 27, 16, 10, 856, 53, 25, 34, 69, 1]

5.2.5 Step-by-Step Construction of RNN

We remember that the structure of RNN is as follows:

Let’s set some data and construct it. Assume the input x is a sequence of words (as mentioned earlier), where each $x_t$ is an independent word. However, it is worth mentioning that because we need to perform matrix operations here, and there is no numerical size difference between words (although “understanding” is encoded as 856 and “what” as 51, it does not imply a size relationship between them), we need to perform a simple operation, which is to do one-hot encoding based on the vocabulary size (8,000). This means each word is a 8000*1 vector, where only one position (the position corresponding to this word) is 1, and all other positions are 0. The output $o$ is also in a similar representation, where each $o_t$ is a vector of length equal to the vocabulary size, with each element representing the probability of predicting the word at that position.

Then we review the previously mentioned RNN forward computation formula:

\begin{aligned} s_t &= \tanh(Ux_t + Ws_{t-1}) \ o_t &= \mathrm{softmax}(Vs_t) \end{aligned}

Let’s confirm the dimensions of various inputs and outputs. We assume the vocabulary size C = 8000 and the hidden state vector dimension H = 100. It is worth mentioning that we previously noted that this hidden state is a bit like the “memory” of humans, where previous data information is stored. In fact, setting the dimension higher can store more content, but it comes with a larger computational load, so this is a trade-off issue. Therefore, we have the following settings:

$$\begin{aligned} x_t & \in \mathbb{R}^{8000} \ o_t & \in \mathbb{R}^{8000} \ s_t & \in \mathbb{R}^{100} \ U & \in \mathbb{R}^{100 \times 8000} \ V & \in \mathbb{R}^{8000 \times 100} \ W & \in \mathbb{R}^{100 \times 100} \ \end{aligned}$$

By the way, U, V, and W are the parameters of the RNN neural network that we need to learn through the corpus. So, in total, we need $2HC + H^2$ parameters. In our current scenario, C=8000 and H=100, so we have approximately 1,610,000 parameters to estimate. The size of the parameters also gives us some hints about how much computational resources might be needed, so everyone can have a rough idea in advance. Most of the computations here are manageable because the one-hot encoded matrix, when multiplied with other matrices, can be seen as picking out a column (row), and the largest computational load occurs at $Vs_t$. This means that if computational resources are limited, we should keep the vocabulary size limited to avoid it being too large.

Initializing Parameters

We define a class called RNNNumpy. Initializing U, V, and W requires some attention; we cannot directly initialize them to 0, as this would lead to a “symmetry” problem during computation. The initialization of parameters has different impacts on the final training results, and many papers have researched this issue. In fact, the initial value of the parameters is related to the activation function we choose; for instance, we choose $ anh$, and relevant papers recommend using random numbers between $ig[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}\big]$, where n is the number of connections to the previous layer. It seems a bit complicated, but don’t worry; generally speaking, as long as you initialize the parameters to small random numbers, the neural network can train normally.

Below is the corresponding Python code, where word_dim is the vocabulary size, hidden_dim is the hidden layer size, and bptt_truncate can be ignored for now.

Forward Computation

This is exactly the same as the forward operation of convolutional neural networks; it is a process of calculating the output and hidden state of the next layer/next time point based on weights. A simple implementation is as follows:

The Magical Recursive Neural Network That Mimics Han Han's Writing

From the above code, we can see that we not only calculated the output o but also the hidden state s. We will later use them to calculate gradients. Let’s repeat: each $o_t$ is a probability vector that indicates the probability of each word output. In our scenario, we only want to know what the next word is, and we are not too concerned about its probability; we can simply take the word with the maximum probability. Thus, we define a function predict to achieve this functionality:

The Magical Recursive Neural Network That Mimics Han Han's Writing

Let’s try running it:

For each word in the sentence (here of length 45), our model obtained 8,000 probabilities (corresponding to the words in the vocabulary). We are just verifying whether the forward computation can proceed normally, but since the parameters U, V, and W are randomly chosen and untrained, the results are actually random.

Then we verify the predict function:

Calculating the Loss Function

Like other machine learning algorithms, we need a loss function to represent how different the current prediction result is from the actual result, which is also convenient for our subsequent iteration and training. A commonly used loss function is the cross-entropy loss. If you are interested in understanding it in detail, you can check out our previous blog post on linear SVM and SoftMax classifier. If we have N training samples (which is obviously the number of words in the sentence here) and C categories (which is obviously the vocabulary size here), we can define the loss function to describe the difference between the predicted result o and the actual result y as follows:

$$\begin{aligned} L(y,o) = – \frac{1}{N} \sum_{n \in N} y_{n} \log o_{n} \end{aligned}$$

The formula looks a bit daunting, but what it does is sum up the losses of each sample and output the result. We define a function called calculate_loss:

Training RNN with Stochastic Gradient Descent and Backpropagation Through Time

Now that we have the loss function, what we need to do is minimize this loss function to bring our prediction results closer to the actual results. The most commonly used optimization algorithm is SGD (Stochastic Gradient Descent). You can think of it as trying to go down a mountain; at each position, we look around to see which direction is steepest (the fastest way down) and then walk that way. The “stochastic” part means we do not need to use all samples to compute this direction; sometimes we can use just one sample.

Calculating this direction is the process of calculating gradients, mathematically speaking, it involves finding the direction of the partial derivative vector given the loss function L. Here we have three parameters: U, V, and W, so we actually need to compute $rac{\partial L}{\partial U}, \frac{\partial L}{\partial V}, \frac{\partial L}{\partial W}$.

The detailed content will be discussed later; for now, just know that we need a backpropagation (along the time axis) to compute the gradients using the chain rule. The code is as follows:

Then we need to update the parameters based on the gradients, which means we need to walk down the slope:

We need to check if this method can indeed help us go down the mountain. In other words, we need to see if our altitude decreases with each step. Let’s check this here:

It looks okay; it seems to be leading us toward success, O(∩_∩)O~

That’s a complete simple RNN implementation; at least, it is usable.

However, it’s awkward… have you noticed that on the CPU, each iteration takes about 160ms? This painfully slow speed, combined with the huge number of training iterations needed to learn Xiao Si’s works, cough cough, is likely to take as long as it takes for him to publish a few more books. So we had to deploy the big killer—GPU, combined with a Python deep learning library Theano for acceleration: The Magical Recursive Neural Network That Mimics Han Han's Writing

This is the advantage of using open-source libraries; with just a few function calls, we can get it done… Thank you, open-source spirit…

Generating Text

With the model in place, we can try to generate the text with the highest probability.

Then you will find that the model can imitate comments from Reddit and generate some sentences, such as:

Anyway, to the city scene you’re an idiot teenager.
What ? ! ! ! ! ignore!
Screw fitness, you’re saying: https
Thanks for the advice to keep my thoughts around girls :）
Yep, please disappear with the terrible generation.

It’s amazing that it even mimics punctuation and some expressions formed by punctuation. Of course, the baseline version of RNN also produced many very awkward sentences, which is related to our corpus size and the dimension of the hidden states. Another point is that this type of RNN, with its “memory”, is not the most suitable; for instance, recent popular LSTM and GRU can better handle the memory and forgetting processes of historical information. We will mention this in future blog posts.

◆ ◆ ◆

Conclusion

This concludes all content about RNN introduction. RNN is indeed a very magical neural network, and it has unique advantages in processing time series information (such as NLP). More knowledge about RNN and NLP processing (such as LSTM) will be discussed in future blog posts. I believe that one day, computers will be able to write beautiful poetry, delightful prose, and long novels with intricate plots.

References:

1、Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs 2、Recurrent Neural Networks Tutorial, Part 2 – Implementing a RNN with Python, Numpy and Theano 3、The Unreasonable Effectiveness of Recurrent Neural Networks 4、char-rnn-chinese

Recommended Previous Articles, Click on the Image to Read

Overview of Machine Learning Algorithms (with Python and R Code)

The Magical Recursive Neural Network That Mimics Han Han's Writing

How AlphaGo Works? A PhD from Carnegie Mellon University Explains It in 54 Slides

The Magical Recursive Neural Network That Mimics Han Han's Writing

Reading Nature Papers: How AlphaGo was Trained

The Magical Recursive Neural Network That Mimics Han Han's Writing Machine Vision and Deep Neural Networks—Washing Away the Glamour, A Glimpse of the Pearls

Masterpiece: A Deep Dive into Fourier Transform

The Magical Recursive Neural Network That Mimics Han Han's Writing

◆ ◆ ◆

Introduction to the Machine Learning Column

“Machine Learning” is a column jointly launched by Big Data Digest and machine learning volunteers, focusing on the field of machine learning. The content will cover basic concepts, algorithms, domestic and international dynamics, in-depth insights, etc., and will be updated at least once a month.

Joining Requirements:

1. A strong interest in machine learning, understanding the basic concepts and algorithms of machine learning, and having relevant project experience or having participated in related data competitions is preferred.

2. Keeping up with domestic and international developments in machine learning, being able to integrate background knowledge, and having certain personal insights is preferred.

3. Having some translation experience and being able to provide accurate translations on time.

Column Chief Editors: Han Xiaoyang and Long Xincheng, with many years of experience in machine learning/data mining/deep learning projects, focused on the application and optimization of machine learning algorithms on massive data. They are committed to sharing their insights and experiences in machine learning through technical blogs and other means.

◆ ◆ ◆

Volunteer Team Introduction

Big Data Digest backstage replies “Volunteer” to learn how to join us

The Magical Recursive Neural Network That Mimics Han Han’s Writing

◆ ◆ ◆

Introduction

◆ ◆ ◆

The Magical RNN and Its Unbelievable Performance

◆ ◆ ◆

Understanding Recursive Neural Networks

◆ ◆ ◆

Different Types of RNNs

4.1 Bidirectional RNN

4.2 Deep Bidirectional RNN

4.3 LSTM Neural Network

◆ ◆ ◆

Implementing a Simple RNN

5.1 Introduction to Language Models

5.2 Data and Preprocessing

5.2.1 Tokenization and Processing

5.2.2 Removing Low-Frequency/Stop Words

5.2.3 Start and End

5.2.4 Word Vector Mapping

5.2.5 Step-by-Step Construction of RNN

◆ ◆ ◆

Conclusion

◆ ◆ ◆

◆ ◆ ◆

Leave a Comment Cancel reply