Source: DeepHub IMBA
This article is approximately 1000 words long and is recommended to be read in 5 minutes.
This article will provide a complete summary of word embedding models.
TF-IDF, Word2Vec, GloVe, FastText, ELMO, CoVe, BERT, RoBERTa
The role of word embeddings in deep models is to provide input features for downstream tasks (such as sequence labeling and text classification). Over the past decade, many word embedding methods have been proposed, and this article will provide a complete summary of these word embedding models.
Context-Independent
The characteristic of the representations learned by this type of model is that each word is unique and different without considering the context of the words.
No Learning Required
Bag-of-words: A text (such as a sentence or document) is represented as its bag of words, ignoring grammar and word order.
TF-IDF: This score is obtained by taking the frequency of a word (TF) and multiplying it by the inverse document frequency (IDF) of the word.
Learning Required
Word2Vec: A shallow (two-layer) neural network trained to reconstruct the linguistic context of words. Word2Vec can utilize either of two model architectures: Continuous Bag of Words (CBOW) or Continuous Skip-gram. In the CBOW architecture, the model predicts the current word from the surrounding context words’ window. In the Continuous Skip-gram architecture, the model uses the current word to predict the surrounding context window.
GloVe (Global Vectors for Word Representation): Training is performed on aggregated global word-word co-occurrence statistics in the corpus, resulting in representations that display linear substructures in the word vector space.
FastText: Unlike GloVe, it embeds words by treating each word as composed of character n-grams rather than the whole word. This feature allows it to learn not only rare words but also out-of-vocabulary words.
Context-Dependent
Unlike context-independent word embeddings, context-dependent methods learn different embedding representations for the same word based on its context.
Based on RNN
ELMO (Embeddings from Language Model): Uses a character-based encoding layer and two BiLSTM layers in a neural language model to learn contextualized word representations, allowing for the learning of contextualized word representations.
CoVe (Contextualized Word Vectors): Uses a deep LSTM encoder that comes from an attention seq2seq model trained on machine translation to contextualize word vectors.
Based on Transformers
BERT (Bidirectional Encoder Representations from Transformers): A transformer-based language representation model trained on large cross-domain corpora. It uses a masked language model to predict randomly masked words in a sequence and employs a next sentence prediction task to learn the relationship between sentences.
XLM (Cross-lingual Language Model): A cross-lingual model that learns cross-lingual representations using an unsupervised method based on single-language corpora, training by combining different languages with new training objectives to allow the model to grasp more cross-lingual information.
RoBERTa (Robustly Optimized BERT Pretraining Approach): Builds on BERT and modifies key hyperparameters, removes the next sentence pretraining objective, and trains with larger batches and learning rates.
ALBERT (A Lite BERT for Self-supervised Learning of Language Representations): Proposes parameter-reduction techniques to reduce memory consumption and improve the training speed of BERT.