In-Depth Understanding of Word2Vec

Deep Learning

Author: louwill

From: Deep Learning Notes

Language models are one of the core concepts in natural language processing. Word2Vec is a language model based on neural networks and a method for word representation. Word2Vec includes two structures: skip-gram (Skip-gram Model) and CBOW (Continuous Bag of Words Model), both essentially perform dimensionality reduction on vocabulary.

Word2Vec

We view the language model of NLP as a supervised learning problem: given context words, predict the target word, or given the target word, predict the context words. The mapping between input and output is the language model. The purpose of such a language model is to check whether the arrangement conforms to the rules of natural language, or more colloquially, whether it makes sense in human language.

Thus, based on the idea of supervised learning, the protagonist of this article—Word2Vec—is a natural language model trained based on neural networks. Word2Vec, proposed by Google in 2013, is an NLP analysis tool characterized by vectorizing vocabulary, allowing us to quantitatively analyze and mine the relationships between words. Therefore, Word2Vec is also a type of word embedding representation discussed in the previous lecture, but this vector representation requires training through neural networks.

Word2Vec trains a neural network to obtain a language model about the relationship between input and output. Our focus is not on how well we can train this model, but on obtaining the trained neural network weights, which we will use for the vector representation of input vocabulary. Once we have the word vectors for all vocabulary in the training corpus, subsequent NLP research work becomes relatively easier.

Word2Vec includes two models. One is the Continuous Bag of Words Model (CBOW), which predicts the target word given context words, while the other is the Skip-gram model, which predicts context words based on a given word, also known as the “jumping word” model.

The application scenario of the CBOW model is to predict the target word based on context, so our input is context words. Of course, the original words cannot be used as input; here, the input is still the one-hot vector of each vocabulary, and the output is the probability of each word in the given vocabulary being the target word. The structure of the CBOW model is shown in Figure 1.

Figure 1 CBOW Model

The application scenario of the Skip-gram model is to predict context words based on the target word, so our input is any word, and the output is the probability of each word in the given vocabulary being a context word. The structure of the Skip-gram model is shown in Figure 2.

Figure 2 Skip-gram Model

From the structural diagrams of the CBOW and Skip-gram models, we can see that apart from differences in input and output, there is not much difference between the two. Changing the input layer of the CBOW to the output layer essentially transforms it into the Skip-gram model, and the two can be understood as a mutually inverted relationship.

From the perspective of supervised learning, Word2Vec is essentially a multi-class classification problem based on neural networks. When the output vocabulary is very large, we need techniques like Hierarchical Softmax and Negative Sampling to speed up training. However, from the perspective of natural language processing, Word2Vec is not concerned with the neural network model itself but with the vector representation of vocabulary obtained after training. This representation allows the final word vector dimensions to be far smaller than the vocabulary size, so Word2Vec is essentially a dimensionality reduction operation. We reduce tens of thousands of vocabulary from high-dimensional space to low-dimensional space, which greatly benefits downstream NLP tasks.

The Training Process of Word2Vec: Taking CBOW as an Example

Due to the similarity between Skip-gram and CBOW, this section will only illustrate how Word2Vec trains to obtain word vectors using the CBOW model as an example. Figure 3 highlights the parameters to be trained in the CBOW model. It is clear that we need to train the weights from the input layer to the hidden layer and from the hidden layer to the output layer.

Figure 3 Training Weights of CBOW

The basic steps for training the CBOW model include:

Represent the context words as one-hot vectors as the model input, where the dimension of the vocabulary is the number of context words;
Then multiply the one-hot vectors of all context words by the shared input weight matrix;
Average the vectors obtained in the previous step to form the hidden layer vector;
Multiply the hidden layer vector by the shared output weight matrix;
Apply softmax activation to the computed vector to obtain a probability distribution of the dimension, and take the index with the highest probability as the predicted target word.

Below is a specific example to illustrate. Assume the corpus is I learn NLP everyday, using I learn everyday as context words and NLP as the target word. Represent both context and target words as one-hot vectors as shown in Figure 4.

Figure 4 CBOW Training Process 1: Input One-hot Representation

Next, multiply the one-hot representation by the input layer weight matrix, which is also called the embedding matrix and can be randomly initialized. As shown in Figure 5.

Figure 5 CBOW Training Process 2: One-hot Input Multiplied by Embedding Matrix

Then, average the resulting vectors to form the hidden layer vector, as shown in Figure 6.

Figure 6 CBOW Training Process 3: Averaging

Then multiply the hidden layer vector by the output layer weight matrix, which is also an embedding matrix that can be initialized. The output vector is obtained, as shown in Figure 7.

Figure 7 CBOW Training Process 4: Hidden Layer Vector Multiplied by Embedding Matrix

Finally, apply the Softmax activation to the output vector to obtain the actual output, compare it with the true label, and perform gradient optimization training based on the loss function.

Figure 8 CBOW Training Process 5: Softmax Activation Output

The above is the complete calculation process of the CBOW model, which is also one of the basic methods for Word2Vec to train vocabulary into word vectors. Whether it is the Skip-gram model or the CBOW model, Word2Vec generally provides high-quality word vector representations. Figure 9 visualizes the 128-dimensional skip-gram word vectors trained with 50,000 words compressed into a 2-dimensional space.

Figure 9 Visualization of Word2Vec

It can be seen that semantically similar words are grouped together, further proving that Word2Vec is a reliable method for word vector representation.

Leave a Comment Cancel reply