A Brief Overview of Recurrent Neural Networks

Author: Debarko De Source: Hackernoon, Machine Heart

What is a Recurrent Neural Network (RNN)? How do they work? Where can they be used? This article attempts to answer these questions and also showcases an RNN implementation demo that you can expand upon as needed.

RNN Architecture

Basic knowledge. Familiarity with Python and CNN is essential. Understanding CNN is necessary to compare it with RNN: Why and where RNN is better than CNN.

Let’s start with the term “recurrent.” Why is it called recurrent? Recurrent means:

Occurring frequently or repeatedly

These types of neural networks are called recurrent neural networks because they perform the same operation repeatedly on a set of sequential inputs. The later sections of this article will discuss the significance of this operation.

Why do we need RNNs?

You might be wondering, with networks like convolutional networks performing exceptionally well, why do we need other types of networks? There is a specific case where RNNs are needed. To explain RNNs, you first need to understand the concept of sequences, so let’s talk about sequences.

Sequences are streams of data that are interdependent (finite or infinite), such as time series data, informative strings, conversations, etc. In a conversation, a sentence might have one meaning, but the overall conversation could have a completely different meaning. Time series data like stock market data also works this way; a single data point represents the current price, but the data throughout the day will change differently, prompting us to make buy or sell decisions.

When input data is dependent and follows a sequential pattern, CNNs generally do not perform well. There is no correlation between the previous and next inputs in CNNs. Therefore, all outputs are independent. CNNs take inputs and produce outputs based on a trained model. If you run 100 different inputs, none of the outputs will be influenced by previous outputs. But think about text generation or translation; all generated words are independent of previously generated words (in some cases, they are also independent of subsequent words, which we won’t discuss here). So you need some bias based on previous outputs. This is where RNNs come in. RNNs have a certain memory of what has happened in the data sequence. This helps the system gain context. Theoretically, RNNs have infinite memory, which means they can recall infinitely. However, in practice, they can only recall the last few steps.

This article is only intended to relate to humans generally and will not make any decisions. It merely makes judgments based on prior knowledge of the project (I haven’t even understood 0.1% of the human brain).

When to use RNNs?

RNNs can be used in many different areas. Here are the fields where RNNs are most commonly applied.

1. Language Modeling and Text Generation

Given a sequence of words, try to predict the likelihood of the next word. This is very useful in translation tasks, as the most likely sentence will be composed of the most probable words.

2. Machine Translation

Translating text content from one language to another uses one or more forms of RNN. All practical systems used daily employ some advanced version of RNN.

3. Speech Recognition

Predicting speech segments based on input sound waves to determine words.

4. Generating Image Descriptions

A very broad application of RNN is to understand what is happening in an image and make reasonable descriptions. This is the combined effect of CNN and RNN. CNN performs image segmentation, and RNN reconstructs descriptions based on segmented data. Although this application is basic, the possibilities are endless.

5. Video Tagging

Videos can be tagged frame by frame for video search.

Diving Deeper

This article proceeds according to the following topics. Each part builds on the previous one, so do not skip around.

Feedforward Networks
Recurrent Networks
Recurrent Neurons
Backpropagation Through Time (BPTT)
RNN Implementation

Introduction to Feedforward Networks

Feedforward networks pass information through a series of operations at each node of the network. Feedforward networks pass information directly back through each layer each time. This differs from other recurrent neural networks. Generally, a feedforward network accepts an input and produces an output based on that input, which is also the step of most supervised learning, where the output could be a classification result. Its behavior is similar to CNNs. The output can be categories labeled as cats, dogs, etc.

Feedforward networks are trained on a set of pre-labeled data. The purpose of the training phase is to minimize the error when the feedforward network guesses the category. Once training is complete, we can classify new batches of data using the trained weights.

A typical feedforward network architecture

One more thing to note. In a feedforward network, regardless of what image is presented to the classifier during the testing phase, it will not change the weights, so it will not affect the second decision. This is a significant difference between feedforward and recurrent networks.

Unlike recurrent networks, feedforward networks do not remember previous input data during testing. They are always dependent on the time point. They only remember historical input data during the training phase.

Recurrent Networks

That is, recurrent networks not only take the current input sample as input to the network but also include what they have perceived previously as input.

We tried to build a multilayer perceptron. From a simple perspective, it has an input layer, a hidden layer with a specific activation function, and finally produces an output.

Example of a multilayer perceptron architecture

If the number of layers in the above example increases, the input layer also receives inputs. Then the first hidden layer will pass the activation to the next hidden layer, and so on until it reaches the output layer. Each hidden layer has its own weights and biases. Now the question becomes, can we input into the hidden layers?

Each layer has its own weights (W), biases (B), and activation functions (F). The behavior of these layers is different, and merging them is technically challenging. To merge them, we replace the weights and biases of all layers with the same value. As shown in the figure below:

Now we can combine all layers together. All hidden layers can be combined into one recurrent layer. So it looks like this:

We provide input to the hidden layer at each step. Now a recurrent neuron stores all previous inputs and combines this information with the current input. Therefore, it also captures some correlation information between the current data step and previous steps. The decision at step t-1 affects the decision made at step t. This is very similar to how humans make decisions in life. We combine current data with recent data to help solve specific problems at hand. This example is simple, but in principle, it is consistent with human decision-making abilities. It makes me wonder whether we as humans are truly intelligent or if we have very advanced neural network models. The decisions we make are just training on the data collected in life. So what happens when we have advanced models and systems that can store and compute data in a reasonable time frame, can we digitize the brain? Therefore, when we have models that are better and faster than the brain (trained on data from millions of people), what will happen?

An interesting point from another article (https://deeplearning4j.org/lstm.html): Humans are always troubled by their own behavior.

Let’s illustrate the above explanation with an example of predicting the next letter after a series of letters. Imagine a word with 8 letters: namaskar.

namaskar (Namaste): A traditional greeting or gesture of respect in India, where palms are brought together in front of or at the chest.

If we try to find the 8th letter after inputting the first 7 letters into the network, what will happen? The hidden layer will go through 8 iterations. If we unfold the network, it becomes an 8-layer network, with each layer corresponding to a letter. So you can imagine a regular neural network being repeated multiple times. The number of unrollings is directly related to how far back it remembers the data.

How Recurrent Neural Networks Work

Recurrent Neurons

Here we will delve deeper into the actual neurons responsible for decision-making. Taking the previously mentioned namaskar as an example, after providing the first 7 letters, we try to find the 8th letter. The complete vocabulary of input data is {n,a,m,s,k,r}. In the real world, words or sentences are much more complex. To simplify the problem, we use the following simple vocabulary.

In the image above, the hidden layer or RNN block applies the formula on the current input and the previous state. In this case, nothing is before the letter n of namaste. So we directly use the current information to infer and move to the next letter a. During the inference of letter a, the hidden layer applies the above formula combining the information of the current inference a with the information inferred from n before. Each state that the input passes through the network is a time step or step, so the input at time step t is a, and the input at time step t-1 is n. After applying the formula to both n and a, we obtain a new state.

The formula for the current state is as follows:

h_t is the new state, h_t-1 is the previous state. x_t is the input at time t. After applying the same formula to previous time steps, we can now perceive the previous inputs. We will check 7 such inputs, all with the same weights and functions at each step.

Now let’s try to define f() in a simple way. We use the tanh activation function. We define weights through matrix W_hh and inputs through matrix W_xh. The formula is as follows:

The above example only takes the last step as memory, so it only combines with the data from the last step. To enhance the memory capability of the network and retain longer sequences in memory, we must add more states to the equation, such as h_t-2, h_t-3, etc. The final output can be computed as per the calculation during the testing phase:

Where y_t is the output. We compare the output with the actual output and then calculate the error value. The network learns by updating weights through backpropagation of the error. The later parts of this article will discuss backpropagation.

Backpropagation Through Time (BPTT)

This section assumes you already understand the concept of backpropagation. If you need to delve deeper into backpropagation, please refer to the link: http://cs231n.github.io/optimization-2/.

Now we understand how RNNs actually work, but how do we train RNNs in practice? How do we decide the weights of each connection? How do we initialize the weights of these hidden units? The goal of recurrent networks is to accurately classify sequential inputs. This is achieved through backpropagation of the error and gradient descent. However, the standard backpropagation used in feedforward networks cannot be applied here.

Unlike directed acyclic feedforward networks, RNNs are cyclic graphs, which is where the problem lies. In feedforward networks, the error derivatives of previous layers can be calculated. But the arrangement of layers in RNNs is different from that in feedforward networks.

The answer lies in the content discussed earlier. We need to unfold the network. Unfolding the network makes it look like a feedforward network.

Unfolding RNN

At each time step, we take out the hidden units of the RNN and copy them. Each copy in the time step acts like a layer in a feedforward network. Each layer in time step t+1 connects with all possible layers in time step t. Therefore, we randomly initialize the weights, unfold the network, and optimize the weights in the hidden layers through backpropagation. Initialization is completed by passing parameters to the lowest layer. These parameters are also optimized as part of backpropagation.

The result of unfolding the network is that now each layer has different weights, and thus will be optimized to different extents. It cannot be guaranteed that the error calculated based on the weights will be equal. Therefore, at the end of each run, the weights of each layer are different. This is something we absolutely do not want to see. The simplest solution is to merge the errors of all layers in some way. We can average or sum the error values. In this way, we can maintain the same weights across all time steps using a single layer.

RNN Implementation

This article attempts to implement RNN using the Keras model. We try to predict the next sequence based on the given text.

Code link:

https://gist.github.com/09aefc5231972618d2c13ccedb0e22cc.git

This model was built by Yash Katariya. I made some minor adjustments to fit the requirements of this article.

Source: OFweek Artificial Intelligence

Leave a Comment Cancel reply