Essential Guide to Recurrent Neural Networks for Beginners

Author: Victor Zhou

Translator: Wang Yutong

Proofreader: Wu Jindi

This article is about 3800 words, recommended reading time is 15 minutes.

This article will introduce the basics of Recurrent Neural Networks (Vanilla RNNs), how they work, and how to implement them in Python.

Recurrent Neural Networks (RNNs) are a type of neural network specifically designed to handle sequences. They are often used in Natural Language Processing (NLP) due to their efficiency in processing text. In the following article, we will explore what RNNs are, understand how they work, and build a real RNN from scratch using Python (only using numpy).

In another article, I introduced some basic concepts of neural networks. This article will not cover those basics in detail; if needed, I recommend reading the foundational article first.

Link:

https://victorzhou.com/blog/intro-to-neural-networks/

Let’s get started!

1. Why Use RNNs?

A question about traditional neural networks (and CNNs as well) is that they can only accept fixed-size inputs and outputs: they take fixed-size inputs and produce fixed-size outputs. In contrast, RNNs can accept variable-length sequences as inputs and outputs. Here’s an example of RNN:

Red represents input, green represents the RNN itself, and blue represents output. Source: Andrej Karpathy

This ability to process sequences allows RNNs to perform exceptionally well. For example:

Machine translation (e.g., Google Translate) uses a “many-to-many” RNN. The original text sequence is fed into the RNN, and the RNN outputs the translated text.
Sentiment analysis (e.g., is this a positive or negative review?) is typically done using a “many-to-one” RNN. The text to be analyzed is fed into the RNN, and the RNN generates a single output classification (e.g., this is a positive review).

Later in this article, we will build a “many-to-one” RNN from scratch and complete a basic sentiment analysis.

2. How to Use RNNs

Let’s take a look at the “many-to-many” RNN!

Based on the previous hidden state and the next input, we can obtain the next hidden state.
By calculation, we can obtain the next output.

Many-to-many RNN

This is what makes RNNs recurrent: the same weights are used at each step. More specifically, a typical Vanilla RNN uses only 3 sets of weights to perform calculations:

Additionally, we will introduce two biases into the RNN:

We represent weights with matrices and biases with vectors. These 3 weights and 2 biases make up the entire RNN!

Here’s the formula that combines everything together:

Do not skip these equations. Take a minute to look at them. Also, keep in mind that weights are matrices and other variables are vectors.

We apply all the weights in matrix multiplication and add the biases to the resulting output. Then we use tanh as the activation function for the first equation (other activations like sigmoid can also be used).

3. Problem

Next, we will apply RNNs from scratch to perform a simple sentiment analysis task: determining whether the sentiment of a given text is positive or negative.

Here are some examples from the dataset I compiled for this article:

Link to the dataset:

https://github.com/vzhou842/rnn-from-scratch/blob/master/data.py

Text	Positive?
i am good	✓
i am bad	❌
this is very good	✓
this is not bad	✓
i am bad not good	❌
i am not at all happy	❌
this was good earlier	✓
i am not at all bad or sad right now	✓

4. Plan

Since this is a classification problem, we will use a “many-to-one” RNN. This is similar to the “many-to-many” RNN we discussed earlier, but it outputs a single y using only the final hidden state:

Many-to-one RNN

Each input is a vector representing the words in the text. The output y vector will contain two numbers, one representing positive sentiment and the other representing negative sentiment. We will apply Softmax to convert these values into probabilities and ultimately make a decision between positive/negative.

Let’s start implementing the RNN!

5. Preprocessing

The dataset mentioned earlier consists of two parts.

data.py

train_data = {

‘good’: True,

‘bad’: False,

# … more data

}

test_data = {

‘this is happy’: True,

‘i am good’: True,

# … more data

}

True=positive, False=negative

We need to do some preprocessing to convert the data into a usable format. First, we will build a vocabulary to store the words appearing in the data:

main.py

from data import train_data, test_data

# Create the vocabulary.

vocab = list(set([w for text in train_data.keys() for w in text.split(‘ ‘)]))

vocab_size = len(vocab)

print(‘%d unique words found’ % vocab_size) # 18 unique words found

Now, the vocab list contains all the words that appeared at least once in a training sample. Next, to represent each word in the vocabulary, we will assign an integer index.

main.py

# Assign indices to each word.

word_to_idx = { w: i for i, w in enumerate(vocab) }

idx_to_word = { i: w for i, w in enumerate(vocab) }

print(word_to_idx[‘good’]) # 16 (this may change)

print(idx_to_word[0]) # sad (this may change)

We can now represent any given word using its corresponding integer index! This is a necessary step because RNNs cannot understand words, so we must input numbers into them.

Finally, recall that each input to the RNN is a vector. We will use one-hot encoding, which contains all zeros except for a single one. The “1” in each one-hot vector will be located at the corresponding integer index of the word.

Since our vocabulary contains 18 unique words, each will be an 18-dimensional one-hot vector.

main.py

import numpy as np

def createInputs(text):

”’

Returns an array of one-hot vectors representing the words

in the input text string.

– text is a string

– Each one-hot vector has shape (vocab_size, 1)

”’

inputs = []

for w in text.split(‘ ‘):

v = np.zeros((vocab_size, 1))

v[word_to_idx[w]] = 1

inputs.append(v)

return inputs

Then, we will use createInputs() to generate input vectors and pass them into the RNN.

6. Forward Pass

It’s time to start implementing our RNN! We will first initialize the 3 weights and 2 biases that the RNN needs:

rnn.py

import numpy as np

from numpy.random import randn

class RNN:

# A Vanilla Recurrent Neural Network.

def __init__(self, input_size, output_size, hidden_size=64):

# Weights

self.Whh = randn(hidden_size, hidden_size) / 1000

self.Wxh = randn(hidden_size, input_size) / 1000

self.Why = randn(output_size, hidden_size) / 1000

# Biases

self.bh = np.zeros((hidden_size, 1))

self.by = np.zeros((output_size, 1))

Note: We divide by 1000 to reduce the initial variance of the weights. Although this is not the best way to initialize weights, it is intuitive and works for this article.

We use np.random.randn() to initialize weights based on a standard normal distribution.

Next, let’s implement the forward pass of the RNN. Do you remember the two equations we saw earlier?

Here’s the implementation in code:

rnn.py

class RNN:

# …

def forward(self, inputs):

”’

Perform a forward pass of the RNN using the given inputs.

Returns the final output and hidden state.

– inputs is an array of one hot vectors with shape (input_size, 1).

”’

h = np.zeros((self.Whh.shape[0], 1))

# Perform each step of the RNN

for i, x in enumerate(inputs):

h = np.tanh(self.Wxh @ x + self.Whh @ h + self.bh)

# Compute the output

y = self.Why @ h + self.by

return y, h

This is quite simple, right? Note that since there is no previous h to use, we initialize h as a zero vector in the first step.

Let’s give it a try:

main.py

# …

def softmax(xs):

# Applies the Softmax Function to the input array.

return np.exp(xs) / sum(np.exp(xs))

# Initialize our RNN!

rnn = RNN(vocab_size, 2)

inputs = createInputs(‘i am very good’)

out, h = rnn.forward(inputs)

probs = softmax(out)

print(probs) # [[0.50000095], [0.49999905]]

If you need a refresher on Softmax, you can read a quick explanation through the link.

Link:

https://victorzhou.com/blog/softmax/

Our RNN can run successfully, but it doesn’t seem very useful. It looks like we need to make some changes…

7. Feedback Phase

To train the RNN, we first need a loss function. We will use the cross-entropy loss function, which is commonly used with Softmax. The calculation formula is as follows:

Now that we have the loss function, we will use gradient descent to train the RNN model to minimize the loss. This means we will now perform some gradient-related calculations!

The following section requires some basic knowledge of multivariable calculus; you can choose to skip this part. Even if you are not very familiar with it, I recommend that you at least skim through it. Once we derive the results, we will step through the code, and a shallow understanding will still be helpful.

If you want to learn more about this section, you can read the “Training Neural Networks” section in my article “Introduction to Neural Networks”. Additionally, all the code for this article is available on GitHub, where you can also follow me.

Link to GitHub:

https://github.com/vzhou842/rnn-from-scratch

7.1 Definitions

First, we need to clarify some definitions:

7.2 Preparation

Next, we need to edit the forward pass and cache some data for use in the feedback phase. While we handle this, we will also set up a framework for our feedback phase. It looks something like this:

rnn.py

class RNN:

# …

def forward(self, inputs):

”’

Perform a forward pass of the RNN using the given inputs.

Returns the final output and hidden state.

– inputs is an array of one hot vectors with shape (input_size, 1).

”’

h = np.zeros((self.Whh.shape[0], 1))

self.last_inputs = inputs self.last_hs = { 0: h }

# Perform each step of the RNN

for i, x in enumerate(inputs):

h = np.tanh(self.Wxh @ x + self.Whh @ h + self.bh)

self.last_hs[i + 1] = h

# Compute the output

y = self.Why @ h + self.by

return y, h

def backprop(self, d_y, learn_rate=2e-2):

”’

Perform a backward pass of the RNN.

– d_y (dL/dy) has shape (output_size, 1).

– learn_rate is a float.

”’

pass

7.3 Gradients

Now it’s time for some math! We will start calculating

The process of calculating through the chain rule is left as an exercise; the results are as follows:

main.py

# Loop over each training example

for x, y in train_data.items():

inputs = createInputs(x)

target = int(y)

# Forward

out, _ = rnn.forward(inputs)

probs = softmax(out)

# Build dL/dy

d_L_d_y = probs d_L_d_y[target] -= 1

# Backward

rnn.backprop(d_L_d_y)

Next, let’s complete the gradient calculation for the final hidden state to convert it to the RNN’s output. We have:

is the final hidden state. Therefore,

Similarly,

We can now start applying backprop()!

rnn.py

class RNN:

# …

def backprop(self, d_y, learn_rate=2e-2):

”’

Perform a backward pass of the RNN.

– d_y (dL/dy) has shape (output_size, 1).

– learn_rate is a float.

”’

n = len(self.last_inputs)

# Calculate dL/dWhy and dL/dby.

d_Why = d_y @ self.last_hs[n].T

d_by = d_y

Note: We previously created self.last_hs in forward().

Since the changes will affect every one, all of this will affect y and the final L. To calculate the gradients, we need backpropagation through time (BPTT) for all time steps:

rnn.py

class RNN:

# …

def backprop(self, d_y, learn_rate=2e-2):

‘’’

Perform a backward pass of the RNN.

– d_y (dL/dy) has shape (output_size, 1).

– learn_rate is a float.

‘’’

n = len(self.last_inputs)

# Calculate dL/dWhy and dL/dby.

D_Why = d_y @ self.last_hs[n].T

d_by = d_y

# Initialize dL/dWhh, dL/dWxh, and dL/dbh to zero.

D_Whh = np.zeros(self.Whh.shape)

d_Wxh = np.zeros(self.Wxh.shape)

d_bh = np.zeros(self.bh.shape)

# Calculate dL/dh for the last h.

d_h = self.Why.T @ d_y

# Backpropagate through time.

For t in reversed(range(n)):

# An intermediate value: dL/dh * (1 – h^2)

temp = ((1 – self.last_hs[t + 1] ** 2) * d_h)

# dL/db = dL/dh * (1 – h^2)

d_bh += temp

# dL/dWhh = dL/dh * (1 – h^2) * h_{t-1}

d_Whh += temp @ self.last_hs[t].T

# dL/dWxh = dL/dh * (1 – h^2) * x

d_Wxh += temp @ self.last_inputs[t].T

# Next dL/dh = dL/dh * (1 – h^2) * Whh

d_h = self.Whh @ temp

# Clip to prevent exploding gradients.

For d in [d_Wxh, d_Whh, d_Why, d_bh, d_by]:

np.clip(d, -1, 1, out=d)

# Update weights and biases using gradient descent.

Self.Whh -= learn_rate * d_Whh

self.Wxh -= learn_rate * d_Wxh

self.Why -= learn_rate * d_Why

self.bh -= learn_rate * d_bh

self.by -= learn_rate * d_by

Some additional notes:

Alright! Our RNN is complete.

8. Climax

Finally, we have reached this moment – let’s test the RNN!

First, we will write a helper function to handle the RNN’s data:

main.py

import random

def processData(data, backprop=True):

”’

Returns the RNN’s loss and accuracy for the given data.

– data is a dictionary mapping text to True or False.

– backprop determines if the backward phase should be run.

”’

items = list(data.items())

random.shuffle(items)

loss = 0

num_correct = 0

for x, y in items:

inputs = createInputs(x)

target = int(y)

# Forward

out, _ = rnn.forward(inputs)

probs = softmax(out)

# Calculate loss / accuracy

loss -= np.log(probs[target])

num_correct += int(np.argmax(probs) == target)

if backprop:

# Build dL/dy

d_L_d_y = probs

d_L_d_y[target] -= 1

# Backward

rnn.backprop(d_L_d_y)

return loss / len(data), num_correct / len(data)

Now, we can complete a training loop:

main.py

# Training loop

for epoch in range(1000):

train_loss, train_acc = processData(train_data)

if epoch % 100 == 99:

print(‘— Epoch %d’ % (epoch + 1))

print(‘Train: Loss %.3f | Accuracy: %.3f’ % (train_loss, train_acc))

test_loss, test_acc = processData(test_data, backprop=False)

print(‘Test: Loss %.3f | Accuracy: %.3f’ % (test_loss, test_acc))

Running main.py should yield the following output:

— Epoch 100

Train: Loss 0.688 | Accuracy: 0.517

Test: Loss 0.700 | Accuracy: 0.500

— Epoch 200

Train: Loss 0.680 | Accuracy: 0.552

Test: Loss 0.717 | Accuracy: 0.450

— Epoch 300

Train: Loss 0.593 | Accuracy: 0.655

Test: Loss 0.657 | Accuracy: 0.650

— Epoch 400

Train: Loss 0.401 | Accuracy: 0.810

Test: Loss 0.689 | Accuracy: 0.650

— Epoch 500

Train: Loss 0.312 | Accuracy: 0.862

Test: Loss 0.693 | Accuracy: 0.550

— Epoch 600

Train: Loss 0.148 | Accuracy: 0.914

Test: Loss 0.404 | Accuracy: 0.800

— Epoch 700

Train: Loss 0.008 | Accuracy: 1.000

Test: Loss 0.016 | Accuracy: 1.000

— Epoch 800

Train: Loss 0.004 | Accuracy: 1.000

Test: Loss 0.007 | Accuracy: 1.000

— Epoch 900

Train: Loss 0.002 | Accuracy: 1.000

Test: Loss 0.004 | Accuracy: 1.000

— Epoch 1000

Train: Loss 0.002 | Accuracy: 1.000

Test: Loss 0.003 | Accuracy: 1.000

Our custom-built RNN is performing quite well.

Want to try it yourself or fix these codes? You can find it on GitHub as well.

Link:

https://github.com/vzhou842/rnn-from-scratch

9. Conclusion

In this article, we completed a walkthrough of Recurrent Neural Networks, including what they are, how they work, why they are useful, how to train them, and how to implement them. However, there is so much more you can do:

Learn about Long Short-Term Memory networks (LSTMs), which are a more powerful and popular RNN architecture, or the famous variant of LSTMs – Gated Recurrent Units (GRUs).
Using proper ML libraries (like Tensorflow, Keras, or PyTorch), you can experiment with larger/better RNNs.
Learn about Bidirectional RNNs, which can process sequences in both forward and backward directions, allowing the output layer to gain more information.
Experiment with Word embeddings like GloVe or Word2Vec, which can convert words into more useful vector representations.
Check out the Natural Language Toolkit (NLTK), a Python library for processing human language data.

Original Title:

An Introduction to Recurrent Neural Networks for Beginners

Original Link:

https://victorzhou.com/blog/intro-to-rnns/

Editor: Yu Tengkai

Proofreader: Yang Xuejun

Translator’s Profile

Wang Yutong, a master’s student in statistics at UIUC, undergraduate major in statistics, currently focusing on improving coding skills. In the transition from theory to application, I respect data and continue to evolve.

Translation Team Recruitment Information

Job Content: A meticulous heart is needed to translate selected foreign articles into fluent Chinese. If you are an international student in data science/statistics/computer science, or working overseas in related fields, or confident in your language skills, you are welcome to join our translation team.

What You Get: Regular translation training to improve volunteers’ translation skills, enhance awareness of cutting-edge data science, overseas friends can stay connected with domestic technology application development, and THU Data Team’s industry-university-research background provides good development opportunities for volunteers.

Other Benefits: You will have the opportunity to work with data scientists from renowned companies and students from prestigious universities like Peking University, Tsinghua University, and other overseas institutions.

Click on the end of the article “Read Original” to join the Data Team~

Essential Guide to Recurrent Neural Networks for Beginners

Click on “Read Original” to embrace the organization

main.py

main.py

main.py

rnn.py

rnn.py

main.py

rnn.py

main.py

rnn.py

rnn.py

main.py

main.py

Leave a Comment Cancel reply