Overview of Dropout Application in RNNs

[Introduction] This article provides the background and overview of Dropout, as well as a parameter analysis of its application in language modeling using LSTM/GRU recurrent neural networks.

Author｜Adrian G

Compiled by｜Zhuanzhi

Organized by｜Yingying

Dropout

Inspired by the role of gender in evolution, Hinton et al. first proposed Dropout, which temporarily removes units from the neural network. Srivastava et al. applied Dropout to feedforward neural networks and restricted Boltzmann machines, noting that a dropout rate of 0.5 for hidden layers and 0.2 for input layers was suitable for various tasks.

Figure 1. (a) Standard neural network without dropout. (b) Neural network with dropout applied.

The core concept of Srivastava et al.’s work is that “in a neural network with dropout added, each hidden unit must learn to randomly select samples from other units during training. This should make each hidden unit more robust and drive it to learn useful features without relying on other hidden units to correct its errors.” In standard neural networks, each parameter is gradually optimized to reach a global minimum through gradient descent. Therefore, hidden units may update parameters to correct the errors of other units. This can lead to “co-adaptation,” which in turn can cause overfitting. By making the presence of other hidden units unreliable, dropout prevents co-adaptation of each hidden unit.

For each training sample, the network is reconfigured and a new set of neurons is dropped. During testing, the weights need to be multiplied by the dropout rate of the relevant units.

Figure 2. (a) Error when the number of hidden units (n) is fixed. (b) Error after dropout with fixed hidden units.

In Figure 2a, we can see that the test error remains stable when the probability of retaining neurons (1-dropout) is between 0.4 and 0.8. As the dropout rate decreases below 0.2 (P > 0.8), the test time error increases. When the dropout rate is too high (p < 0.3), the network underfits.

Srivastava et al. further found that “as the dataset grows, the gains from dropout increase to a point and then decrease. This indicates that there is an ‘optimal point’ for any given network structure and dropout rate.”

Srivastava et al. represented the probability of hidden units being activated using Bernoulli, where a value of 1 has a probability of p, otherwise it is 0.

Example code for dropout is as follows:

class Dropout():
    def __init__(self, prob=0.5):
        self.prob = prob
        self.params = []
    def forward(self,X):
        self.mask = np.random.binomial(1,self.prob,size=X.shape) / self.prob
        out = X * self.mask
        return out.reshape(X.shape)
    def backward(self,dout):
        dX = dout * self.mask
        return dX,[]

DropConnect

Following Dropout, Wan et al. further proposed DropConnect, which “extends Dropout by randomly dropping weights instead of activations.” “With DropConnect, what is dropped is each connection, not each output unit.” Like Dropout, this technique is only applicable to fully connected layers.

Figure 3. (a) Example model layout of a single DropConnect layer. After running the feature extractor g() on input x, a mask M is randomly instantiated (e.g., (b). The masked weights are multiplied by that feature vector to produce u, which is the input to the activation function a and softmax layer s. For comparison, (c) shows the effective weight mask used when Dropout is applied to the output of the previous layer (column) and the output of that layer (row).

As shown in the figure, we can illustrate the difference between Dropout and DropConnect. By applying dropout to input weights instead of activations, DropConnect can generalize to the entire connection structure of fully connected network layers.

Figure 4. Neurons take a series of weights as input and apply a nonlinear activation function to generate output.

The two dropout methods mentioned above are applied to feedforward convolutional neural networks. The difference between RNNs and only feedforward neural networks is that previous states are fed back into the network, allowing the network to retain previous states. Therefore, applying standard dropout to RNNs limits the network’s ability to retain previous states, hindering its performance. Bayer et al. pointed out the issue of applying dropout to recurrent neural networks (RNNs). If the complete output weight vector is set to zero, it “significantly alters the RNN dynamics during each forward pass.”

Figure 5. Examples of conventional feedforward and (feedforward) convolutional neural networks (ConvNet).

Figure 6. Applying dropout in RNNs

Figure 7. The model expanded after RNN.

Code for applying Dropout in RNN

class RNN:
# ...
def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y
rnn = RNN()
y = rnn.step(x) # x is an input ve

Dropout for RNN

As a method to overcome the performance issues of applying dropout to RNNs, Zaremba and Pham et al. only apply dropout to non-recurrent connections (dropout is not applied to hidden states). “By not using dropout on recurrent connections, LSTM can benefit from dropout regularization without sacrificing its valuable memory capacity.”

Figure 8. Regularizing multi-layer RNN. Dropout is only applied to non-recurrent connections (i.e., only applied to feedforward dashed lines). The thick lines show the typical information flow path in LSTM. Information is affected by dropout L + 1 times, where L is the depth of the network.

Variational Dropout

Gal and Ghahramani (2015) analyzed the application of Dropout to only feedforward RNNs and found that this approach still leads to overfitting. They proposed “Variational Dropout,” which uses the same dropout mask for each time step of the input, output, and recurrent layers (dropping the same network units at each time step), and through a Bayesian interpretation, they observed improvements in language modeling and sentiment analysis tasks over ‘pure dropout’.

Figure 9. Naive dropout (a) uses different masks at different time steps, with no dropout on recurrent layers. Variational dropout (b) uses the same dropout mask at each time step, including recurrent layers (color indicates the dropout mask, solid lines indicate dropout, and dashed lines indicate standard connections without dropout).

Recurrent Dropout

Similar to Moon and Gal and Ghahramani, Semeniuta et al. proposed applying dropout to the recurrent connections of RNNs so that recurrent weights can be regularized to improve performance. Gal and Ghahramani use the hidden state of the network as input to compute gate values and cell updates for the subnetwork using dropout (Figure 9b). Semeniuta et al.’s difference is that they believe “the entire architecture is key to hidden states and regularizes the entire network.” This is similar to the concept of Moon et al. (as shown in Figure 9a), but Semeniuta et al. found that directly dropping previous states according to Moon et al. produced mixed results, while applying dropout to the hidden state update vector is a more principled approach.

“Our technique allows for the addition of a strong regularizer on the model weights responsible for learning short-term and long-term dependencies without compromising the ability to capture long-term relationships, which is especially important for models dealing with natural language.”

Figure 10. Semeniuta et al. “Illustration of three types of dropout in the recurrent connections of LSTM networks. Dashed arrows indicate broken connections. For clarity, input connections are omitted.” Note how Semeniuta et al. (2016) applied repeated dropout to the updates of LSTM memory cells.

“We demonstrate that recurrent dropout is most effective when applied to the hidden state update vector rather than the hidden state itself; (ii) when our recurrent dropout rate is combined with standard forward dropout rate, we observe improvements in network performance, although the degree of this improvement depends on the value of the dropout rate; (iii) contrary to our expectations, networks trained using our recurrent dropout method yield similar results when using step and sequential mask sampling, both outperforming the dropout scheme proposed by Moon et al.”

Zoneout

As a variant of dropout, Krueger et al. proposed Zoneout, which “does not set the activations of certain units to 0; instead, Zoneout randomly replaces the activations of certain units with their activations from the previous time step.” This “makes it easier for the network to retain information from previous time steps as it progresses, facilitating rather than hindering the flow of gradient information backward.”

Figure 11. Zoneout as a special case of dropout; ~ht is the hidden activation of unit h at the next time step (if it is not partitioned). Zoneout can be viewed as applying dropout on the hidden state increment, ~ht – ht-1. When this update is removed (indicated by the dashed line), ht becomes ht-1.

While both recurrent dropout and Zoneout prevent the loss of long-term memory established in GRU/LSTMs’ states/cells, “zoneout achieves this by carefully preserving the activations of units. This difference is most pronounced when partitioning the LSTM’s hidden state (not memory cells), as there is no similar recurrent dropout in LSTMs. Saturated output gates or output nonlinearities can cause repeated dropout to suffer from vanishing gradients, whereas Zoneout still effectively propagates gradients. Moreover, while the recurrent dropout method is specific to LSTMs and GRUs, zoneout generalizes to any model that builds its distributed representation of inputs sequentially, including vanilla RNNs.”

Figure 12. (a) Zoneout, compared to (b) the recurrent dropout strategy in LSTM. Dashed lines are zero masks; in zoneout, the corresponding dashed lines are covered with the corresponding opposite zero masks. Rectangular nodes are embedding layers.

Core idea to implement Zoneout with TensorFlow

if self.is_training:
    new_state = (1 - state_part_zoneout_prob) * tf.python.nn_ops.dropout(
        new_state_part - state_part, (1 - state_part_zoneout_prob),
         seed=self._seed) + state_part
else:
    new_state = state_part_zoneout_prob * state_part 
        + (1 - state_part_zoneout_prob) * new_state_part

AWD-LSTM

In a pioneering work on RNN regularization for language modeling, Merity et al. proposed a method they called AWD-LSTM. In this approach, Merity et al. used DropConnect on the hidden weight matrices, while all other dropout operations used variational dropout, along with several other regularization strategies including Backpropagation Through Time (BPTT) and Activation Regularization (AR) and Temporal Activation Regularization (TAR).

Regarding DropConnect, Merity et al. mentioned, “Due to the same weights being reused across multiple time steps, the same weight remains dropped during both forward and backward propagation. The result is similar to variational dropout, which applies the same dropout mask to the recurrent connections within the LSTM by performing dropout on ht-1, in addition to applying dropout to the recurrent weights.”

About the use of variational dropout, Merity et al. noted, “Each example in the mini-batch uses a unique dropout mask instead of using a single dropout mask across all examples, ensuring diversity among elements.”

Utilizing Gal & Ghahramani’s embedded dropout, Merity et al. also noted that this “is equivalent to performing dropout on the embedding matrix at the word level, where dropout is broadcasted across all word vectors in the embedding.” “Since dropout occurs on the embedding matrix used for complete forward and backward passes, this means that all occurrences of a specific word will disappear during that pass, equivalent to performing variational dropout on the one-hot embedding and connections in between.”

Code used by Merity et al.:

class LockedDropout(nn.Module):
    def __init__(self):
        super().__init__()
    def forward(self, x, dropout=0.5):
        if not self.training or not dropout:
            return x
        m = x.data.new(1, x.size(1), x.size(2)).bernoulli_(1 - dropout)
        mask = Variable(m, requires_grad=False) / (1 - dropout)
        mask = mask.expand_as(x)
        return mask * x

Thus, the position where the forward method of dropout is applied in RNNModel (note self.lockdrop = LockedDropout(mask = mask)):

def forward(self, input, hidden, return_h=False):
    emb = embedded_dropout(self.encoder, input, dropout=self.dropoute 
    if self.training else 0)
    emb = self.lockdrop(emb, self.dropouti)
    raw_output = emb
    new_hidden = []
    raw_outputs = []
    outputs = []
    for l, rnn in enumerate(self.rnns):
        current_input = raw_output
        raw_output, new_h = rnn(raw_output, hidden[l])
        new_hidden.append(new_h)
        raw_outputs.append(raw_output)
        if l != self.nlayers - 1:
            raw_output = self.lockdrop(raw_output, self.dropouth)
            outputs.append(raw_output)
    hidden = new_hidden
    output = self.lockdrop(raw_output, self.dropout)
    outputs.append(output)
    result = output.view(output.size(0)*output.size(1), output.size(2))
    if return_h:
        return result, hidden, raw_outputs, outputs
    return result, hidden

DropConnect applied to the same RNNModel’s init method:

if rnn_type == 'LSTM':
    self.rnns = [torch.nn.LSTM(ninp if l == 0 else nhid,
     nhid if l != nlayers - 1 
        else (ninp if tie_weights else nhid),
      1, dropout=0) for l in range(nlayers)]
    if wdrop:
        self.rnns = [WeightDrop(rnn, ['weight_hh_l0'], dropout=wdrop)
                for rnn in self.rnns]

The key part of the WeightDrop class is the following method:

def _setweights(self):
    for name_w in self.weights:
        raw_w = getattr(self.module, name_w + '_raw')
        w = None
        if self.variational:
            mask = torch.autograd.Variable(torch.ones(raw_w.size(0), 1))
            if raw_w.is_cuda: mask = mask.cuda()
            mask = torch.nn.functional.dropout(mask, p=self.dropout,
             training=True)
            w = mask.expand_as(raw_w) * raw_w
        else:
            w = torch.nn.functional.dropout(raw_w, p=self.dropout,
         training=self.training)
        setattr(self.module, name_w, w)

Original link:

https://medium.com/@bingobee01/a-review-of-dropout-as-applied-to-rnns-72e79ecd5b7b

References:

J. Bayer, C. Osendorfer, D. Korhammer, N. Chen, S. Urban, P. van der Smagt. 2013. On Fast Dropout and its Applicability to Recurrent Networks.

Y. Bengio, P. Simard, P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult.

cs231n. https://cs231n.github.io/convolutional-networks/

deepnotes.io. https://deepnotes.io/dropout

Y. Gal, abd Z. Ghahramani. 2015. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors.

karpathy.github.io. https://karpathy.github.io/2015/05/21/rnn-effectiveness/

D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. Rosemary Ke, A. Goyal, Y. Bengio, A. Courville, C. Pal. 2016. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.

S. Merity, N. Shirish Keskar, R. Socher. 2017. Regularizing and Optimizing LSTM Language Models.

ml-cheatsheet.readthedocs.io. https://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html

T. Moon, H. Choi, H. Lee, I. Song. 2015. Rnndrop: A novel dropout for rnns.

P. Morerio, J. Cavazza, R. Volpi, R.Vidal, V. Murino. 2017. Curriculum Dropout

A. Narwekar, A. Pampari. 2016. Recurrent Neural Network Architectures. http://slazebni.cs.illinois.edu/spring17/lec20_rnn.pdf

V. Pham, T. Bluche, C. Kermorvant, J. Louradour. 2013. Dropout improves Recurrent Neural Networks for Handwriting Recognition

S. Semeniuta, A. Severyn, E. Barth. 2016. Recurrent Dropout without Memory Loss.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

L. Wan, M. Zeiler, Matthew, S. Zhang, Y. LeCun, R. Fergus. 2013. Regularization of neural networks using dropconnect.

W. Zaremba, I. Sutskever, O. Vinyals. 2014. Recurrent Neural Network Regularization

K. Zolna, D. Arpit, D. Suhubdy, Y. Bengio. 2017. Fraternal Dropout

-END-

Zhuanzhi

Complete collection of knowledge materials on 26 topics in the field of artificial intelligenceJoin the Zhuanzhi AI service group: Welcome to scan the QR code to join the Zhuanzhi AI knowledge community and get the latest AI algorithm materials, professional knowledge tutorial videos, and consult with experts!

Overview of Dropout Application in RNNs

Please log in to www.zhuanzhi.ai or click on the original text to register and log in to Zhuanzhi for more AI knowledge materials!

Please add the Zhuanzhi assistant WeChat (scan the QR code below to add), join the Zhuanzhi theme group (please note the theme type: AI, NLP, CV, KG, etc.) for communication~

AI project technology & business cooperation: [email protected], or scan the above QR code to contact!

Please follow the Zhuanzhi public account to obtain professional knowledge in artificial intelligence!

Click on “Read the original text” to use Zhuanzhi

Dropout

Recurrent Dropout

Zoneout

AWD-LSTM

Leave a Comment Cancel reply