Source | Zhihu
Address | https://www.zhihu.com/question/62399257/answer/241969722
Author | Scofield
Editor | Machine Learning Algorithms and Natural Language Processing Public Account
This article is for academic sharing only. If there is an infringement, please contact us to delete the article.
To put it simply, I will write a detailed article when I have time.
1. Perspectively
As we all know, LSTM can already handle sequence labeling problems, predicting a label for each token (LSTM followed by: classifier); CRF does the same, predicting a label for each token.
However, their prediction mechanisms are different. CRF is a globally normalized conditional state transition probability matrix, predicting the label for each token of a given sample; LSTM (RNNs, no distinction here) relies on the powerful nonlinear fitting ability of neural networks, transforming samples through a complex high-dimensional nonlinear transformation during training to learn a model, and then predicting the label for each token of a given sample.
2. LSTM + CRF
Since LSTM is okay, why do researchers create a hybrid model of LSTM + CRF?
Haha, because the labels predicted by a single LSTM have issues! For example, in segmentation (BES; character level), a plain LSTM will produce results like:
Input: “Learning to develop a model, and then predicting a specified one”
Expected output: 学/B 习/E 出/S 一/B 个/E 模/B 型/E ,/S 然/B 后/E 再/E 预/B 测/E ……
Real output: 学/B 习/E 出/S 一/B 个/B 模/B 型/E ,/S 然/B 后/B 再/E 预/B 测/E ……
As you can see, using LSTM, the overall prediction accuracy is indeed good, but it can produce the error: A B following another B. This error does not exist in CRF because the feature function of CRF exists to learn various features (n-gram, window) from the given sequence observations, and these features capture the relationships between words within a limited window size. Generally, it learns a rule (feature): B is followed by E, and E will not appear. This limited feature ensures that CRF’s prediction results do not produce the aforementioned error. Of course, CRF can learn more limited features, and the more, the better!
So, let’s connect CRF to LSTM, feeding each hidden state tensor from LSTM at each time step into CRF, allowing LSTM to learn a new set of nonlinear transformation space under the constraints of CRF’s features, according to the new loss function.
Finally, needless to say, the results are indeed much better.
LSTM + CRF codes, here. Go just take it.
Hope this helps.
Heavy! The Yizhen Natural Language Processing – Academic WeChat Group has been established
You can scan the QR code below, and the assistant will invite you to join the group for communication,
Note: Please modify the note to [School/Company + Name + Direction] when adding
For example – Harbin Institute of Technology + Zhang San + Dialogue System.
The account owner, please avoid business promotion. Thank you!
Recommended Reading:
The Differences and Connections Between Fully Connected Graph Convolutional Networks (GCN) and Self-Attention Mechanisms
A Complete Guide to Graph Convolutional Networks (GCN) for Beginners
Paper Review [ACL18] Self-Attentive Component Syntax Analysis