Do RNN and LSTM Have Long-Term Memory?

This article introduces the ICML 2020 paper “Do RNN and LSTM have Long Memory?“. The authors of the paper are from Huawei Noah’s Ark Lab and the University of Hong Kong..
Author | Noah’s Ark Lab
Editor | Cong Mo

Paper link:



To overcome the difficulties of Recurrent Neural Networks (RNNs) in learning long-term dependencies, Long Short-Term Memory (LSTM) networks were proposed in 1997 and have since made significant progress in applications. Many papers have confirmed the practicality of LSTMs and attempted to analyze their properties. However, the question “Do RNNs and LSTMs have long-term memory?” remains unanswered. This paper answers the question from a statistical perspective, proving that RNNs and LSTMs do not possess long-term memory in the statistical sense when making time series predictions. The existing definitions of long-term memory in statistics do not apply to neural networks, so we propose a new definition applicable to neural networks and use this new definition to analyze the theoretical properties of RNNs and LSTMs again. To validate our theory, we made minimal modifications to RNNs and LSTMs, converting them into long-term memory neural networks, and verified their superiority on datasets with long-term memory properties.

Related Background

Although the term long-term memory is often mentioned in the application of LSTMs in deep learning, there is no strict definition. In the field of statistics, a strict definition of long-term memory has existed for a long time. For a second-order stationary one-dimensional time series, let its autocovariance function be defined as…
Do RNN and LSTM Have Long-Term Memory?Then if…Do RNN and LSTM Have Long-Term Memory?is not summable, then…Do RNN and LSTM Have Long-Term Memory?has long-term memory; if…Do RNN and LSTM Have Long-Term Memory?is summable, then…Do RNN and LSTM Have Long-Term Memory?has short-term memory. Besides the autocovariance function, it can also be equivalently defined using the spectral density function. More rigorous statements can be found in the definition shown in the figure below.
Do RNN and LSTM Have Long-Term Memory?
A common time series model that meets the definition of long-term memory is the fractionally integrated process. In time series analysis, the shorthand for time series generally uses the backshift operator…
Do RNN and LSTM Have Long-Term Memory?When…Do RNN and LSTM Have Long-Term Memory?is applied to a random variable in the time series, it obtains the random variable at the previous moment, that is…
Do RNN and LSTM Have Long-Term Memory?Backshift operator…
Do RNN and LSTM Have Long-Term Memory?operations are very similar to operations with an algebraic variable, for example…
Do RNN and LSTM Have Long-Term Memory?Using the backshift operator, a fractionally integrated process…
Do RNN and LSTM Have Long-Term Memory?
Do RNN and LSTM Have Long-Term Memory?
Do RNN and LSTM Have Long-Term Memory?is the Gamma function, Do RNN and LSTM Have Long-Term Memory?is the memory parameter of the fractionally integrated model. Generally, it is chosen as an Autoregressive Moving Average model (ARMA), at which point…
Do RNN and LSTM Have Long-Term Memory?follows the Autoregressive Fractionally-Integrated Moving-Average model (ARFIMA). The ARMA part in an ARFIMA model is responsible for modeling short-term memory patterns, while the fractionally integrated parameter…
Do RNN and LSTM Have Long-Term Memory?is responsible for modeling long-term memory patterns.
There is no unique definition of long-term memory for multidimensional time series. We choose a simple and direct way to define long-term memory for multidimensional time series, which is to check whether each dimension of the time series has long-term memory, ignoring the long-term correlations between different dimensions. Each dimension…
Do RNN and LSTM Have Long-Term Memory?has a memory parameter…
Do RNN and LSTM Have Long-Term Memory?to model that dimension. The shorthand for the multidimensional model can be seen in equation (4).
Do RNN and LSTM Have Long-Term Memory?
The ARFIMA model is one of the important inspirations for this paper. Some of its important properties include:
  • When the value is between -0.5 and 0.5, the model is stationary, and when it is greater than 0.5, the model is non-stationary;
  • For stationary models, if the value is less than 0, the model has short-term memory, while if it is greater than 0, the model has long-term memory, and the larger the value, the longer the memory effect;
  • The coefficients in equation (3) decay at a polynomial rate as the value increases, that is…
  • The autocovariance function decays at a polynomial rate.
The coefficients or autocovariance functions that decay at a polynomial rate are a significant feature distinguishing the ARFIMA model from short-term memory models. Models with short-term memory have coefficients or autocovariance functions that decay at an exponential rate, resulting in the rapid loss of past information. Combining with Definition 1, sequences with polynomial rate decay are non-summable when the exponent is less than -1, while sequences with exponential rate decay are always summable. Therefore, if the autocovariance function belongs to the former, the model has long-term memory; if it belongs to the latter, the model does not have long-term memory. This paper also uses the decay rate to prove the memory properties of RNNs and LSTMs.
When proving the decay rate of the autocovariance function of the model, we leveraged the property of geometric ergodicity to assist in the proof, with specific definitions shown in the figure below. A Markov chain with geometric ergodicity converges to a stationary distribution at an exponential rate after…
Do RNN and LSTM Have Long-Term Memory?steps, meaning that the information of the Markov chain “currently in state” is lost at an exponential rate. Geometric ergodicity implies that the autocovariance function converges to 0 at an exponential rate, indicating that the stochastic process does not have long-term memory.
Do RNN and LSTM Have Long-Term Memory?

Memory Properties of Recurrent Networks

Assume that the input of a recurrent network is…
Do RNN and LSTM Have Long-Term Memory?and the output is…
Do RNN and LSTM Have Long-Term Memory?and the target sequence is…
Do RNN and LSTM Have Long-Term Memory?where…
Do RNN and LSTM Have Long-Term Memory?comes from the model…
Do RNN and LSTM Have Long-Term Memory?
Do RNN and LSTM Have Long-Term Memory?is independently and identically distributed white noise. The theoretical results of this section are established under the condition of time series prediction without exogenous variables, that is…
Do RNN and LSTM Have Long-Term Memory?Considering a generic hidden state…
Do RNN and LSTM Have Long-Term Memory?then a recurrent network can be expressed in the form of a Markov chain, see equation (7).
Do RNN and LSTM Have Long-Term Memory?
If the transition function…
Do RNN and LSTM Have Long-Term Memory?is linear, then equation (7) becomes a linear Markov chain.
Do RNN and LSTM Have Long-Term Memory?The Markov chain expressed in equation (7) actually includes RNNs or LSTMs.
For example, the most basic RNN, using the 2-norm loss function, has the feedforward computation as shown in equation (9).
Do RNN and LSTM Have Long-Term Memory?where…
Do RNN and LSTM Have Long-Term Memory?is the output function, and…
Do RNN and LSTM Have Long-Term Memory?is the activation function. This RNN can be expressed in the form of the Markov chain in equation (7).
Similarly, for the basic LSTM network, the feedforward process is…
Do RNN and LSTM Have Long-Term Memory?The calculations of the hidden layer unit involve the following gate operations…
Do RNN and LSTM Have Long-Term Memory?where…
Do RNN and LSTM Have Long-Term Memory?is the output function, and…
Do RNN and LSTM Have Long-Term Memory?is the sigmoid activation function, while tanh is the hyperbolic tangent activation function. This LSTM process can also be written in the form of equation (7).
Do RNN and LSTM Have Long-Term Memory?
However, in LSTM, the hidden layer unit…
Do RNN and LSTM Have Long-Term Memory?and the cell state…
Do RNN and LSTM Have Long-Term Memory?correspond to the generic hidden layer unit in (7).
The form of the transition function…
Do RNN and LSTM Have Long-Term Memory?is more complex and will not be displayed here.
After adding two technical assumptions (see the figure, Assumption 1), we derived the two main conclusions of this paper. Theorem 1 provides a sufficient condition for recurrent networks (7) to have short-term memory, while Theorem 2 provides a necessary and sufficient condition for linear recurrent networks (8) to have short-term memory. These conditions are imposed on the transition function or transition matrix, so they hold for all recurrent network models that satisfy equation (7), including RNNs and LSTMs.
Do RNN and LSTM Have Long-Term Memory?
Theorem 1 is relatively abstract, as the conditions on the transition function cannot be intuitively converted into conditions on the weights and activation functions in the network. Thus, we further proposed Inference 1 and Inference 2. Inference 1 proves that as long as the output and activation functions of the RNN are continuous and bounded, the RNN has short-term memory, as shown in the figure below.
Do RNN and LSTM Have Long-Term Memory?
Inference 2 proposes sufficient conditions for LSTMs to have short-term memory. One is the condition on the output function; currently commonly used linear, ReLU, sigmoid, or tanh output functions all meet the requirements; the other requires that the output of the forget gate is strictly less than 1. Inference 2 reflects that the forget gate is key to the memory properties of LSTMs.
Do RNN and LSTM Have Long-Term Memory?
The above theoretical results are established under the premise of no exogenous variables, while neural networks can perform operations with external variables in specific applications. If external variables inherently possess long-term memory properties, they will interfere with our analysis of the memory properties of neural networks, so existing statistical definitions of long-term memory cannot be used when external variables are present. To fill this gap, we propose a new definition of long-term memory applicable to neural networks. Assume that the neural network can be written (or approximated) in the following form…
Do RNN and LSTM Have Long-Term Memory?Then if the coefficient matrix has a dimension that decays at a polynomial rate, the network is considered to have long-term memory. Specific statements can be found in Definition 3.
Do RNN and LSTM Have Long-Term Memory?

Long-Term Memory Recurrent Networks and Their Properties

Based on the theoretical results above, we aim to make minimal modifications to RNNs and LSTMs to enable them to model long-range dependencies. Similar to the structure in the ARFIMA model, we added a long-term memory filter at different positions in RNNs and LSTMs, resulting in Memory-augmented RNN (MRNN) and Memory-augmented LSTM (MLSTM) models.
The specific form of the long-term memory filter is…
Do RNN and LSTM Have Long-Term Memory?
The diagram of the MRNN network structure is…
Do RNN and LSTM Have Long-Term Memory?
where the long-term memory hidden layer unit…
Do RNN and LSTM Have Long-Term Memory?operates alongside the ordinary hidden layer unit…
Do RNN and LSTM Have Long-Term Memory?which is responsible for capturing information about long-term memory, while…
Do RNN and LSTM Have Long-Term Memory?is responsible for modeling short-term information. The specific calculations are as follows…
Do RNN and LSTM Have Long-Term Memory?
The overall forward propagation process of MRNN can be expressed as…
Do RNN and LSTM Have Long-Term Memory?
To analyze the theoretical properties, we need to simplify the model. We refer to the fixed model as MRNNF, where the memory parameter…
Do RNN and LSTM Have Long-Term Memory?is constant over time. For the MRNNF model, we analyzed and obtained the following properties, namely that MRNNF satisfies Definition 3 and possesses long-term memory, while RNNs still do not possess long-term memory.
Do RNN and LSTM Have Long-Term Memory?
For the modifications to LSTM, we added the long-term memory filter to the cell state, as the original cell state of LSTM follows a dynamic coefficient AR(1) model…
Do RNN and LSTM Have Long-Term Memory?Therefore, we naturally added the state update here with the long-term memory filter, resulting in…
Do RNN and LSTM Have Long-Term Memory?
The diagram of the MSLTM network structure is…
Do RNN and LSTM Have Long-Term Memory?
The forward propagation process is…
Do RNN and LSTM Have Long-Term Memory?
Do RNN and LSTM Have Long-Term Memory?
Directly analyzing whether MLSTM satisfies Definition 3 is still challenging, so we performed a similar simplification to MRNNF, assuming that all gates do not change over time. Then, LSTMs with gates that do not change over time do not possess long-term memory, while MLSTMs with gates that do not change over time (referred to as MLSTMF) possess long-term memory. Note that this does not lead to the conclusion that LSTMs do not have long-term memory; it only indirectly suggests that the gate operations in LSTMs play an important role in their memory properties.

Experimental Results

We conducted three experiments: first, to validate the advantages of the newly proposed models on datasets with long-term memory properties; second, to verify that the newly proposed models do not degrade performance on datasets with only short-term memory properties; and third, to explore the impact of the long-term memory filter length…
Do RNN and LSTM Have Long-Term Memory?on model performance.
We first selected four datasets with long-term memory properties for time series prediction tasks: sequences generated by the ARFIMA model, returns from the Dow Jones index, subway passenger flow data from Minneapolis, and tree ring width data. The long-term memory properties of these datasets can be judged by plotting the sample autocorrelation function. For example, the autocorrelation function in the figure below shows a long trend, which is significantly different from white noise sequences, intuitively indicating that the data has long-term memory.
Do RNN and LSTM Have Long-Term Memory?
We compared eight models:
1. Original RNN, lookback = 1;
2. Dual-track RNN (similar to MRNN), but the filter part is unrestricted, with…
Do RNN and LSTM Have Long-Term Memory?free weights;
3. Recurrent weighted average network (RWA);
4. MRNNF, where the memory parameter…
Do RNN and LSTM Have Long-Term Memory?is constant over time;
5. MRNN, where the memory parameter…
Do RNN and LSTM Have Long-Term Memory?changes over time;
6. Original LSTM;
7. MLSTMF, where the memory parameter…
Do RNN and LSTM Have Long-Term Memory?is constant over time;
8. MLSTM, where the memory parameter…
Do RNN and LSTM Have Long-Term Memory?changes over time.
Since the training of these networks is a non-convex problem, initializing the models with different seeds will lead to different learned models. Therefore, we used 100 different seeds and reported the mean, standard deviation, and minimum of the error metrics for the time series prediction tasks. The mean and standard deviation of the errors are shown in Table 2, and the minimum values are shown in Table 3.
Do RNN and LSTM Have Long-Term Memory?
Do RNN and LSTM Have Long-Term Memory?
We plotted the performance of the 100 models learned from the 100 random seeds using a boxplot, as shown in Figure 4. Using a two-sample…
Do RNN and LSTM Have Long-Term Memory?-test to compare the performance of MRNN and RNN/LSTM, the conclusion shows that the advantage of MRNN is significant.
Do RNN and LSTM Have Long-Term Memory?
Additionally, we also tested the application of our proposed long-term memory module in multilayer neural networks on the Spanish paper review dataset. To improve efficiency, we fixed the memory parameter…
Do RNN and LSTM Have Long-Term Memory?to be constant over time, set the filter length…
Do RNN and LSTM Have Long-Term Memory?to 50, and only used the filter structure in the first layer. The second layer of the network was uniformly set to LSTM units, and the comparison results are as follows. Although MLSTM and MLSTMF did not show significant advantages on the time series prediction datasets, their advantages were very clear on this natural language processing classification task. Using a two-sample…
Do RNN and LSTM Have Long-Term Memory?-test comparing the accuracy of MLSTMF and RNN/LSTM, the p-values were all less than 0.05.
Do RNN and LSTM Have Long-Term Memory?
In the second experiment, we generated data using RNN and used eight models for prediction, with the results shown in the figure below. Except for a slight increase in variance for the new model, there was no significant disadvantage. This indicates that our new model is also applicable to datasets with mixed long-term and short-term memory.
Do RNN and LSTM Have Long-Term Memory?
In the third experiment, we explored the impact of hyperparameters…
Do RNN and LSTM Have Long-Term Memory?on model performance. The experiment selected…
Do RNN and LSTM Have Long-Term Memory?and 100 for comparison in four scenarios. The conclusion is that MRNN performs best when…
Do RNN and LSTM Have Long-Term Memory?while MLSTM performs best when…
Do RNN and LSTM Have Long-Term Memory?We speculate that this may be due to the larger size of the MLSTM model, making it more difficult to train.


This paper first proves that RNNs and LSTMs do not have long-term memory from the perspective of time series. By using the filter structure in the fractionally integrated process, we made corresponding modifications to RNNs and LSTMs, enabling them to handle data with long-range dependencies. In time series prediction tasks, MRNN and MRNNF demonstrated advantages across various datasets, while MLSTM and MLSTMF performed comparably to the original LSTM and were influenced by the filter length…
In future work, we can continue to explore whether similar filters can bring similar advantages to other recurrent or feedforward networks. Additionally, compared to other models, the computational overhead of dynamic MRNN and MLSTM is relatively large, and future research can focus on further simplifying the models and exploring faster optimization methods. Finally, according to Definition 3, we can also attempt many other information decay patterns to model long-term memory sequences, for example, by directly allowing the weights of the filters…

Do RNN and LSTM Have Long-Term Memory?
