Solving the Vanishing Gradient Problem in RNNs

Click the above “MLNLP” to select the “star” public account

Essential content delivered promptly

Author: Yulin Ling

CS224N(1.29)Vanishing Gradients, Fancy RNNs

Vanishing Gradient

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

The figure below is a more vivid example. Suppose we need to predict the next word after the sentence “The writer of the books”. Due to the vanishing gradient, the influence of “books” on the next word is greater than that of “writer”, leading the model to incorrectly predict “are”, which is obviously wrong.

Solving the Vanishing Gradient Problem in RNNs

Similarly, if there is gradient explosion, according to the gradient descent update formula, the parameters will update very significantly in an instant, causing the network to oscillate, and even resulting in Inf or NaN.

Solving the Vanishing Gradient Problem in RNNs

A better solution for gradient explosion is gradient clipping, which means if the gradient norm exceeds a certain threshold, the gradient norm is reduced by a certain proportion without changing its direction. As shown in the figure below, the left subplot shows the situation without gradient clipping. Due to the gradient explosion problem of RNN, as it approaches a local minimum, the gradient is large, causing the parameters to suddenly climb up the cliff and then fly to a random area on the right, missing the local minimum in between. The right subplot shows the situation after gradient clipping is applied, where the update step becomes smaller, and the parameters stabilize near the local minimum.

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

In summary, gradient explosion is relatively easier to solve, but the vanishing gradient problem is not so simple. In RNN, at each time point tt, the previous hidden state is rewritten, and due to the vanishing gradient problem, the influence of long-distance previous states on the current time is very small, resulting in an inability to model long-distance dependencies. So, if we save the state at each time point separately, can we solve the long-distance dependency problem?

LSTM

LSTM is such an idea. Please understand it in conjunction with the following two diagrams:

It is not difficult to understand when combining the formulas in the above figure and the schematic diagram below.

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

GRU

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

More radical is DenseNet, which connects many neurons across multiple layers, meaning there are more pathways without intermediaries, further reducing the vanishing gradient problem.

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

Although all neural networks have the problem of vanishing gradients, this issue is more severe in RNNs because they multiply the same weight matrix W, and RNNs are aimed at sequence problems, which tend to be deeper.

Bidirectional RNN

Assuming we are performing sentiment classification on a sentence, as shown in the figure below. For the word “terribly”, a conventional RNN can only see the information on the left and cannot see the information on the right, because the network processes from left to right. Looking at “terribly” alone or from left to right, without seeing “exciting”, one might think “terribly” is a negative word, but when combined with “exciting” on the right, it becomes a strong positive word, thus it is necessary to consider information from both the left and right sides simultaneously.

Solving the Vanishing Gradient Problem in RNNs

Bidirectional RNNs consist of two RNNs, one processing from left to right and the other from right to left. The parameters of the two RNNs are independent. Finally, the outputs of the two RNNs are concatenated to form the overall output. Therefore, for the word “terribly”, its gradient can simultaneously see information from both the left and right sides.

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

RNNs typically do not have many layers, usually not more than 8 layers. On one hand, this is due to the serious vanishing gradient problem in RNNs, and on the other hand, RNNs are trained in a serial manner, making it time-consuming to train if the network is too deep.

In summary, a picture is worth a thousand words.

Solving the Vanishing Gradient Problem in RNNs

Solving the Vanishing Gradient Problem in RNNs

Recommended Reading:

ByteDance announces plans for a comprehensive web search, marking a real crisis for Baidu

From Word2Vec to Bert, discussing the evolution of word vectors (Part 1)

Chen Lijie, a PhD student from Tsinghua’s Yao Class, wins Best Student Paper at a top theoretical computer science conference

Solving the Vanishing Gradient Problem in RNNs

Leave a Comment