Understanding the Mathematical Principles Behind RNNs

0Introduction

Nowadays, discussions about machine learning, deep learning, and artificial neural networks are becoming more and more prevalent. However, programmers often just want to use these magical frameworks without wanting to know how they actually work behind the scenes. But if we could grasp these underlying principles, wouldn’t it be better for us to use them?

Today, we will discuss recurrent neural networks and the basic mathematical principles behind them, which enable RNNs to do things that other neural networks cannot.

Understanding the Mathematical Principles Behind RNNs — 〄 RNN (Recurrent Neural Network).

The purpose of this article is to provide an intuitive understanding of the functions and structures of recurrent neural networks.

A neural network typically takes independent variables (or a set of independent variables) and dependent variables, then learns the mapping between (which we call training). Once training is complete, when given a new independent variable, it can predict the corresponding dependent variable.

But what if the order of the data is important? Imagine if the order of all independent variables is crucial?

Let me explain this intuitively.

Assuming each ant is an independent variable, if one ant moves in a different direction, it doesn’t matter to the other ants, right? But what if the order of the ants is important?

At that point, if one ant misses or leaves the group, it will affect the subsequent ants.

So, in the realm of machine learning, which data’s order is important?

Natural language data’s word order problem
Speech data
Time series data
Video/music sequence data
Stock market data
And so on

So how does RNN handle data where the overall order is important? Let’s take natural text data as an example to explain RNN.

Suppose I am performing sentiment analysis on user reviews of a movie.

From this movie is good (This movie is good) — positive, to this movie is bad (This movie is bad) — negative.

We could classify them using a simple bag of words model, predicting (positive or negative), but wait.

What if the review is this movie is not good (This movie is not good)?

The BOW model might say this is a positive signal, but in reality, it is not. RNN understands this and predicts it as negative information.

1How Does RNN Achieve This?

1Types of RNN Models

1. One-to-Many

RNN takes an input, such as an image, and generates a sequence of words.

2. Many-to-One

RNN takes a sequence of words as input and generates an output.

3. Many-to-Many

Next, we focus on the second mode Many-to-One. The input to the RNN is viewed as time steps.

Example: Input(X) = [” this “, ” movie “, ” is “, ” good “]

The timestamp for this is x(0), for movie it is x(1), for is it is x(2), and for good it is x(3).

2Network Architecture and Mathematical Formulas

Now let’s delve into the mathematical world of RNNs.

First, let’s understand what RNN cells contain! I hope and assume everyone knows about feedforward neural networks, a summary of FFNNs,

〄 Example of a feedforward neural network with a single neuron in the hidden layer.

In a feedforward neural network, we have X (input), H (hidden), and Y (output). We can have any number of hidden layers, but the weights W for each hidden layer and the input weights corresponding to each neuron are different.

Above, we have weights Wy10 and Wy11, corresponding to the weights for two different layers relative to output Y, while Wh00, Wh01, etc. represent different weights for neurons relative to input.

Due to the presence of time steps, the neural network unit contains a set of feedforward neural networks. This neural network has the characteristics of sequential input, sequential output, multiple time steps, and multiple hidden layers.

Unlike FFNNs, here we not only compute the hidden layer values from the input values but also from the values of previous time steps. For time steps, the hidden layer weights (W) are the same. Below is a complete image of RNN and the mathematical formulas involved.

In the image, we are calculating the value of the hidden layer at time step t:

is the previous time step. I mentioned that W is the same for all time steps. The activation function can be Tanh, Relu, Sigmoid, etc.

Above, we have only calculated Ht, similarly, we can calculate all other time steps.

Steps:

1. Calculate from and
2. Calculate from and
3. Calculate from , , and
4. Calculate from and , and so on.

Note that:

1. and are weight vectors, different for each time step.
2. We can even calculate the hidden layer (all time steps) first, and then calculate the values.
3. The weight vectors are initially random.

Once the feedforward input is complete, we need to calculate the error and use backpropagation to propagate the error backward, using cross-entropy as the cost function.

2BPTT (Backpropagation Through Time)

If you know how a normal neural network works, the rest is quite simple. If you’re unsure, you can refer to previous articles on artificial neural networks from this account.

We need to calculate the following:

1. How does the total error change with respect to the output (hidden and output units)?
2. How does the output change with respect to the weights (U, V, W)?

Since W is the same for all time steps, we need to go back to the front to update.

〄 BPTT in RNN.

Remember that the backpropagation in RNN is similar to that in artificial neural networks, but here the current time step is calculated based on the previous time steps, so we must traverse back and forth from start to finish.

If we apply the chain rule, like this

Since W is the same for all time steps, the terms expand more and more according to the chain rule.

In Richard Socher’s RNN lecture slides^[1], a similar but different method of calculating the formulas can be seen.

A similar but more concise RNN formula:

The total error is the sum of the errors corresponding to each time step t:

Application of the chain rule:

So here, is the same as our .

, , can be updated using any optimization algorithm, such as gradient descent.

2Back to the Example

Now let’s turn back to our sentiment analysis problem, here is an RNN,

We provide a word vector or a one-hot encoded vector as input for each word and perform feedforward and BPTT. Once training is complete, we can provide new text for prediction. It will learn something like not + positive word = negative.

RNN’s problem → Vanishing/exploding gradient problem

Since W is the same for all time steps, during backpropagation, when we go back to adjust the weights, the signal can become either too weak or too strong, leading to either vanishing or exploding problems.

⟳References⟲

[1]

Richard Socher’s RNN lecture slides: http://cs224d.stanford.edu/lectures/CS224d-Lecture7.pdf

[2]

English link: https://medium.com/towards-artificial-intelligence/a-brief-summary-of-maths-behind-rnn-recurrent-neural-networks-b71bbc183ff

– EOF –

Recommended Reading Click the title to jump

1. After reading the average value code written by Microsoft’s master, I realized that I am still too young.

2. 45 classic Git operation scenarios, specifically for merging code.

3. This time I finally understood the Fourier transform thoroughly!

If you find this article helpful, please share it with more people.

Recommended to follow “Algorithm Enthusiasts”, to cultivate programming skills.

Likes and views are the greatest support.❤️