Advantages and Disadvantages of CRF and LSTM Models in Sequence Labeling

Click the “MLNLP” above to add it to your “Starred” or “Pinned” list.

Heavyweight content delivered to you first.Advantages and Disadvantages of CRF and LSTM Models in Sequence Labeling

Editor: Yizhen

https://www.zhihu.com/question/46688107

This article is for academic sharing only. If there is any infringement, it will be deleted.

Advantages and Disadvantages of CRF and LSTM Models in Sequence Labeling

Author:Xie Zhininghttps://www.zhihu.com/question/46688107/answer/117448674

Both have their pros and cons:

LSTM: Models like RNN, LSTM, and BILSTM are powerful in sequence modeling; they can capture long-term contextual information. Additionally, they possess the ability to fit nonlinear relationships, which CRF cannot surpass. For time t, the output layer y_t is influenced by the hidden layer h_t (which contains contextual information) and the input layer x_t (the current input). However, y_t and other y_t’ are independent of each other, which feels like a pointwise approach. For the current time t, we want to find the y_t with the highest probability, but y_t’ from other times do not affect the current y_t. If there are strong dependencies between y_t (for example, adjectives generally followed by nouns, indicating certain constraints), LSTM cannot model these constraints, limiting its performance.

CRF: Unlike LSTM and similar models, it does not consider long-term contextual information but rather focuses on the linear weighted combination of local features of the entire sentence (scanning the whole sentence through feature templates). A key point is that the CRF model is p(y | x, w), where both y and x are sequences. It is somewhat listwise, optimizing a sequence y = (y1, y2, …, yn) rather than a specific y_t at a moment, aiming to find the sequence y = (y1, y2, …, yn) with the highest probability such that p(y1, y2, …, yn | x, w) is maximized. It calculates a joint probability, optimizing the entire sequence (the ultimate goal) rather than stitching together the optimal at each moment. In this regard, CRF outperforms LSTM.

HMM: In both practice and theory, CRF is superior to HMM. The parameters of the HMM model mainly include the “initial state distribution,” “probability transition matrix between states,” and “probability transition matrix from states to observations.” All this information can be accounted for in CRF, for example: considering features like h(y1), f(y_i-1, y_i), g(y_i, x_i) in the feature templates.

CRF vs LSTM: Regarding data scale, when the data scale is small, CRF tends to perform slightly better than BILSTM. When the data scale is large, BILSTM should surpass CRF. In terms of scenarios, if the task does not heavily rely on long-term information, models like RNN will only add extra complexity. In such cases, one could consider a model like iFLYTEK’s FSMN (a feedforward network that considers contextual information based on windows).

CNN + BILSTM + CRF: This is a popular method in academia currently. BILSTM + CRF aims to combine the advantages of both models, while CNN primarily addresses English cases, where English words are composed of finer-grained letters that hide some features (for example: prefix and suffix features). The convolution operation of CNN extracts these features, which may not apply to Chinese (as Chinese characters cannot be decomposed unless based on segmentation). For example, in part-of-speech tagging, words like “football” and “basketball” are more likely to be tagged as nouns, where the suffix “ball” represents such a feature.

BILSTM + CRF TensorFlow version: https://github.com/chilynn/sequence-labeling, mainly referencing the implementation of GitHub – glample/tagger: Named Entity Recognition Tool, which is based on Theano. Each round of parameter updates is based on a single sample’s SGD, resulting in slower training speeds. The sequence-labeling implementation is based on TensorFlow, changing SGD to mini-batch SGD. Since the lengths of each sample in the batch vary, padding is required before training, and the final loss is calculated through masking (based on each sample’s actual length).

Author:Stupid Captainhttps://www.zhihu.com/question/46688107/answer/120114053

1. As a probabilistic graphical model, CRF is theoretically more perfect, with a solid theoretical foundation at every step. However, the assumptions of CRF are also quite clear, but problems do not always match its assumptions. LSTM can theoretically fit any function, relaxing the assumptions significantly. However, the theoretical principles and interpretability of deep learning models are generally limited.

2. CRF is relatively difficult to extend; adding edges or cycles to the graphical model requires re-deriving formulas and rewriting code. LSTM can easily stack, change to bidirectional, or switch activation functions, which is just a matter of a slow motion with hands.

3. CRF is not well-suited for large data. LSTM benefits from various GPU acceleration and standard big data training routines like multi-machine asynchronous SGD. However, the same issue applies: if the training data is insufficient, overfitting can be severe, leading to concerning results.

4. LSTM can model a sequence as an “intermediate state,” and the modeling results can serve as features for other models.

Author:Wan Guangluhttps://www.zhihu.com/question/46688107/answer/136928113

LSTM and CRF are two different levels of concepts.

The core concept of CRF is to calculate the global likelihood probability of a sequence, which is more like a way of selecting loss. Correspondingly, it should be cross-entropy. CRF treats a sequence as a whole to calculate likelihood probability rather than calculating the likelihood probability of single points. This makes it perform well in sequence labeling problems.

Even those who currently use LSTM models will use CRF at the loss layer, which is basically validated to be better. The corresponding aspect of LSTM should be the feature aspects of the original CRF model. For instance, traditional CRF models require manual selection of various features, but the current mainstream solutions tend to favor embedding layers + BILSTM layers, directly learning features through machine learning. This represents an end-to-end approach.

Advantages and Disadvantages of CRF and LSTM Models in Sequence Labeling

Recommended Reading:

Step-by-step guide to building word vector models with PaddlePaddle SkipGram.

How to evaluate the fastText algorithm proposed by the author of Word2Vec? Does deep learning have no advantages in simple tasks like text classification?

From Word2Vec to Bert, a discussion on the evolution of word vectors (Part 1).

Advantages and Disadvantages of CRF and LSTM Models in Sequence Labeling

Leave a Comment