From | Zhihu Author | Tian Yu Su
https://www.zhihu.com/question/278825804/answer/402634502
Editor | Deep Learning This Small Matter Public Account

I have done some similar work, let me share my understanding.
The key to LSTM’s effectiveness in handling sequence problems lies in its Gates.
Take a simple sentiment classification problem as an example:

For example, in this sentence, we remove stop words and finally do Word Embedding to feed into DNN. In this sentence, there are 2 positive words “good” and 1 “like”, and negative words include 1 “not” and 1 “no”. Since there are more positive words, DNN tends to judge it as a positive sentiment; however, in reality, this sentence conveys a negative sentiment. Both “good” have “not” negating them, and “like” has “no” negating it, but DNN lacks the sequential learning between hidden layer nodes, thus failing to capture this information;
Whereas if we use an LSTM:

Because LSTM has the propagation of cell states, as shown by the linked arrows in the LSTM diagram, it can capture such negation relationships, thereby outputting the correct sentiment coefficient.
From the LSTM formula perspective (not considering peephole), the forget gate is a unit activated by the sigmoid function, with values ranging from 0 to 1.
Our cell state update formula is:
When the forget gate approaches 0, it indicates loss of historical information (from time t-1); when it approaches 1, it indicates more retention of historical information.
As the questioner mentioned, most positions in the forget gate are 0, and a few are 1. This part of 1 represents the information it needs to retain in the network. I also agree with the questioner that this gate is somewhat similar to attention; for the information I need, I give it high attention (corresponding to 1), while for useless information, I choose not to pay attention (corresponding to 0). Similarly, if information is important at a certain moment, its corresponding forget gate position will remain close to 1, allowing that moment’s information to be continuously passed down without being lost, which is one reason why LSTM can handle long sequences.
Next, let’s talk about Bi-RNN. I think Bi-RNN is intuitively easier to understand. For instance, when we read a sentence: “Although this person is very hardworking, he really has no achievements.” If viewed from the LSTM perspective, it learns from front to back. Upon finishing the first half of the sentence, the extracted information is: “This person is hardworking”, which seems like a positive message, but only by reading on can we realize that the key point of this sentence is that he is not achieving anything. Bi-RNN simulates human behavior well, that is, it reads the complete sentence first. Bi-RNN transmits information in reverse order. Therefore, when Bi-RNN finishes reading “Although this person is hardworking”, its reverse sequence has already captured the information from the latter half of the sentence, allowing it to make more accurate judgments.
Supplement:
In response to the question from the comment section @Herry, let’s discuss GRU. Simply put, GRU has a simpler structure than LSTM, with only 2 gates, while LSTM has 3 gates, requiring fewer parameters for training, and is implemented a bit faster; additionally, GRU has only one state, combining LSTM’s cell state and activation state into one.
From the formulas, GRU has two gates: one is the reset gate, and the other is the update gate.
The reset gate resets the state at time t-1, and it is also an output activated by sigmoid.
Then this reset gate is used to calculate :
From the formulas, we can see that the current candidate state does not completely use
for learning, but first resets it. In LSTM, when calculating
, it directly uses
.
Another point is that GRU uses the update gate both for updating and forgetting.
Where is equivalent to the forget gate in LSTM.
In LSTM, this forget gate is separate from the update gate, and they are not closely related. It can be said that GRU has made a reasonable simplification, reducing some computational load without sacrificing too much performance.
Furthermore, as mentioned above, GRU only needs to control one state, its cell state equals the activation state, while LSTM controls two states; the cell state must pass through the output gate to obtain the activation state.
Finally, regarding their specific performance comparison, currently in the models I have worked on, I hardly see any significant difference; both perform reasonably well. However, I generally prefer LSTM, perhaps because it was the first one I encountered. GRU is relatively faster and has a simpler structure, suitable for quickly developing model prototypes.
For a comparison of their performances, I recommend looking at the paper [1412.3555] Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.
Repository address shared:
Reply “code” in the backend of the machine learning algorithms and natural language processing public accountto obtain 195 NAACL + 295 ACL 2019 papers with open-source code. The open-source address is as follows:https://github.com/yizhen20133868/NLP-Conferences-Code
Important! The Yizhen Natural Language Processing - Pytorch group has been officially established! There are a lot of resources in the group, everyone is welcome to join and learn! Note: Please modify the remarks to [School/Company + Name + Direction] when adding. For example - Harbin Institute of Technology + Zhang San + Dialogue System. The account owner, please avoid direct sales. Thank you!
Recommended reading:
Exploration of the Few-Shot Dilemma in NLP
Collection of CNN Tricks in Deep Learning
Several review articles on Multi-Task Learning