Click the “MLNLP” above, select to “star” or “pin”
Important content delivered promptly
Editor: Yi Zhen
https://www.zhihu.com/question/62399257
This article is for academic sharing only. If there is infringement, it will be deleted.
Understanding LSTM Followed by CRF
Author:Scofieldhttps://www.zhihu.com/question/62399257/answer/241969722
In short, I will write a detailed article when I have time.
1. Perspective
Everyone knows that LSTM can handle sequence labeling problems, predicting a label for each token (LSTM followed by a classifier); CRF does the same, predicting a label for each token.
However, their prediction mechanisms are different.
CRF uses a globally normalized conditional state transition probability matrix to predict the label for each token in a given sample; LSTM (RNNs, no distinction here) relies on the powerful nonlinear fitting ability of neural networks, transforming samples through a complex high-dimensional nonlinear space during training to learn a model, and then predicting the label for each token in a given sample.
2. LSTM + CRF
Since LSTM is fine, why do researchers create a hybrid model of LSTM + CRF?
Haha, because the labels predicted by a single LSTM have issues! For example, in segmentation (BES; character level), a plain LSTM may produce results like this:
input: “Learning to model, then predicting a specified one”
expected output: 学/B 习/E 出/S 一/B 个/E 模/B 型/E ,/S 然/B 后/E 再/E 预/B 测/E ……
real output: 学/B 习/E 出/S 一/B 个/B 模/B 型/E ,/S 然/B 后/B 再/E 预/B 测/E ……
As you can see, using LSTM, the overall prediction accuracy is indeed good, but the above error occurs: a B followed by another B. This error does not occur in CRF because the existence of the feature function in CRF is to learn various features (n-gram, window) based on the given sequence, capturing relationships between words within a limited window size.
Generally, it learns a rule (feature): B is followed by E, not another E. This limited feature prevents CRF from making the errors seen in the above example. Of course, CRF can learn more limited features, which is even better!
Alright, then connect CRF to LSTM, inputting each hidden state tensor from LSTM at each time step into CRF, allowing LSTM to learn a new set of nonlinear transformation space under the constraints of CRF’s features, according to the new loss function.
In the end, it goes without saying that the results are indeed much better.
Author: Huo Hua https://www.zhihu.com/question/62399257/answer/241969722
In simple terms, conditional random fields can learn the context of labels. LSTM with softmax classification can only learn the contextual relationships of features, not the labels.
Author:Waterhttps://www.zhihu.com/question/62399257/answer/199998004One answer refers to an ACL 2016 paper that uses the classic architecture of CNN + LSTM + CRF, which is a very mature system.
Currently, in the field of entity recognition, LSTM + CRF is a standard configuration. I believe that unless there are significant breakthroughs in attention mechanisms, this framework will not change in the short term.
To understand why LSTM needs a CRF layer afterwards, one must first understand the function of CRF.
The questioner likely understands the output of LSTM. Let’s not discuss the principles; in sequence labeling issues, as mentioned by the questioner regarding NER, it is a seq2seq problem. In English, it can predict one of the four labels for each input word, for example, BIEO. Assuming the current output is 100 words, the output would be a 100*4 probability prediction. This should be the questioner’s confusion: why not just use a classifier to select one from the four?
First, let’s consider the original intention of using LSTM, which is to analyze the current tag label considering the context. Actually, CRF has a similar principle. CRF can be described as a probability graph; in a single CRF, you need to extract as many features as possible for each object and learn the “linkage” relationships between them. Adding CRF after LSTM is equivalent to training CRF on the abstracted language relationships learned by LSTM, which can utilize the likelihood functions from that paper, of course, labelwise can also be used, which is part of the tuning process.
In summary, in my understanding, CRF is equivalent to reusing the information from LSTM, with a higher efficiency than a simple classifier. This fits the actual situation. The questioner might as well find an example to test and play with, and they will understand.
Author:uuisafreshhttps://www.zhihu.com/question/62399257/answer/206903718
I understand that the B-LSTM + CRF model, saying that CRF is layered on top of LSTM is not a precise statement. If said so, then there are actually two layers of sequence models.
I believe it is more accurate to say that LSTM and CRF are fused together. For example, the output of LSTM only has emission probabilities. Although these probabilities consider context due to LSTM’s gating mechanism, which can remember or forget previous content and is bidirectional, it still lacks transition probabilities, like CRF and HMM, which combine emission and transition probabilities.
For instance, in part-of-speech tagging, the simplest BIO has an obvious rule: B-X cannot be followed by I-Y. Thus, B-LSTM + CRF is created to combine emission and transition probabilities. In reality, the CRF added at the back is not a true CRF; for instance, it lacks feature templates and does not accept discrete features; it is merely a Viterbi derivation.
Recommended Reading:
Step-by-step guide to building word vector models with PaddlePaddle SkipGram practical
How to evaluate the fastText algorithm proposed by the author of Word2Vec? Does deep learning have no advantages in simple tasks like text classification?
From Word2Vec to Bert, discussing the past and present of word vectors (Part 1)