Amazing! LSTM With Only Forget Gate Outperforms Standard LSTM

Selected from arXiv

Authors:Jos van der Westhuizen, Joan Lasenby

Compiled by Machine Heart

Contributors: Pedro, Lu

This paper studies what happens when LSTM only has a forget gate and proposes JANET, with experiments showing that this model outperforms standard LSTM.

1. Introduction

Excellent engineers ensure their designs are practical. We now know that the best way to solve sequence analysis problems is through Long Short-Term Memory (LSTM) recurrent neural networks. Next, we need to design an implementation that meets the constraints of resource-limited real-world applications. Given the success of gated recurrent units (GRUs) with two gates (Cho et al., 2014), the first approach to designing a more hardware-efficient LSTM may be to eliminate the redundant gate. Since we are seeking a model more efficient than GRU, a single-gate LSTM model is worth our investigation. To illustrate why this single gate should be the forget gate, let’s start with the origins of LSTM.

In the era when training recurrent neural networks (RNNs) was very challenging, Hochreiter and Schmidhuber (1997) believed that using a single weight (edge) in RNNs to control whether to accept the input or output of memory cells led to conflicting updates (gradients). Essentially, at each step, both long and short-range errors act on the same weights, and if a sigmoid activation function is used, the gradient vanishes faster than the weights increase. They then proposed the Long Short-Term Memory (LSTM) unit recurrent neural network, which has multiplicative input and output gates. These gates can alleviate the conflicting update problem by ‘protecting’ the units from irrelevant information (inputs or outputs from other units).

The first version of LSTM had only two gates: Gers et al. (2000) first discovered that without a mechanism for memory cells to forget information, they could grow indefinitely, eventually causing the network to collapse. To solve this problem, they added another multiplicative gate, the forget gate, to this LSTM architecture, completing the version of LSTM that we see today.

Given the significance of the latest findings about the forget gate, let’s imagine LSTM using only a forget gate; are the input and output gates necessary? This study will explore the advantages of using only the forget gate. Across five tasks, the model using only the forget gate provided better solutions than the model using all three LSTM gates.

3 JUST ANOTHER NETWORK

We propose a simple variant of LSTM that has only a forget gate. It is Just Another NETwork, hence we named it JANET. We start with the standard LSTM (Lipton et al., 2015), where the symbols have standard meanings, defined as follows