In-Depth Understanding of Word2Vec Principles

Author：louwill

From：Deep Learning Notes

The language model is one of the core concepts in natural language processing. Word2Vec is a language model based on neural networks, and it is also a vocabulary representation method. Word2Vec includes two structures: skip-gram and CBOW (Continuous Bag of Words), but essentially both are a dimensionality reduction operation on vocabulary.

Word2Vec

We view the NLP language model as a supervised learning problem: given context words, output the target word, or given the target word, output the context words. The mapping between input and output is the language model. The purpose of such a language model is to check whether the combination adheres to the rules of natural language; in simpler terms, it checks if it sounds like human language.

Therefore, based on the idea of supervised learning, the main character of this article—Word2Vec—is a natural language model trained based on neural networks. Word2Vec is an NLP analysis tool proposed by Google in 2013, characterized by vectorizing vocabulary, allowing us to quantitatively analyze and mine the connections between words. Thus, Word2Vec is also a type of word embedding representation we discussed in the previous lecture, except that this vectorized representation needs to be obtained through neural network training.

Word2Vec trains a neural network to obtain a language model regarding the relationship between input and output. Our focus is not on how well to train this model, but rather on obtaining the trained neural network weights, which are what we use for the vectorized representation of input vocabulary. Once we obtain the word vectors for all vocabulary in the training corpus, subsequent NLP research work becomes relatively easier.

Word2Vec includes two models. One model predicts the target word given context words, called the Continuous Bag of Words model (CBOW); the other predicts the context words based on a given word, called the skip-gram model, which is also referred to as the “jumping word” model.

The application scenario of the CBOW model is to predict the target word based on context, so our input is context words. Of course, the original words cannot serve as input; here the input is still the one-hot vectors for each vocabulary, and the output is the probability of each word in the given vocabulary being the target word. The structure of the CBOW model is shown in Figure 1.

Figure 1 CBOW Model

The application scenario of the skip-gram model is to predict context words based on a target word, so our input is any word, and the output is the probability of each word in the given vocabulary being a context word. The structure of the skip-gram model is shown in Figure 2.

Figure 2 Skip-Gram Model

From the structure diagrams of the CBOW and skip-gram models, we can see that besides the differences in input and output, there is basically no significant difference between the two. Changing the input layer of CBOW to the output layer essentially transforms it into the skip-gram model; the two can be understood as a reversible relationship.

From the perspective of supervised learning, Word2Vec is essentially a multi-class classification problem based on neural networks. When the output vocabulary is very large, we need techniques like Hierarchical Softmax and Negative Sampling to accelerate training. However, from the perspective of natural language processing, Word2Vec is not focused on the neural network model itself but on the vectorized representation of vocabulary obtained after training. This representation allows the final word vector dimension to be much smaller than the vocabulary size, so Word2Vec is essentially a dimensionality reduction operation. We reduce tens of thousands of vocabularies from high-dimensional space to low-dimensional space, which greatly benefits downstream NLP tasks.

The Training Process of Word2Vec: Taking CBOW as an Example

Due to the similarity between skip-gram and CBOW, this section will only illustrate how Word2Vec trains to obtain word vectors using the CBOW model as an example. Figure 3 highlights the parameters that the CBOW model needs to train. It is clear that we need to train the weights from the input layer to the hidden layer and from the hidden layer to the output layer.

Figure 3 Training Weights of CBOW

The basic steps of training the CBOW model include:

Represent the context words as one-hot vectors as the model’s input, where the dimension of the vocabulary is , and the number of context words is ;
Then multiply the one-hot vectors of all context words by the shared input weight matrix;
Add and average the vectors obtained in the previous step to form the hidden layer vector;
Multiply the hidden layer vector by the shared output weight matrix;
Apply softmax activation to the calculated vector to obtain a probability distribution of dimension , and take the index with the highest probability as the predicted target word.

Let’s illustrate this with a specific example. Suppose the corpus is “I learn NLP every day,” with “I learn every day” as context words and “NLP” as the target word. Both context and target words are represented as one-hot vectors as shown in Figure 4.

Figure 4 CBOW Training Process 1: Input One-Hot Representation

Next, multiply the one-hot representation by the input layer weight matrix, which is also called the embedding matrix and can be randomly initialized. As shown in Figure 5.

Figure 5 CBOW Training Process 2: One-Hot Input Multiplied by Embedding Matrix

Then, average the resulting vectors to form the hidden layer vector as shown in Figure 6.

Figure 6 CBOW Training Process 3: Averaging

Then multiply the hidden layer vector by the output layer weight matrix, which is also an embedding matrix that can be initialized. The output vector is shown in Figure 7.

Figure 7 CBOW Training Process 4: Hidden Layer Vector Multiplied by Embedding Matrix

Finally, apply Softmax activation to the output vector to obtain the actual output, and compare it with the true labels, then perform gradient optimization training based on the loss function.

Figure 8 CBOW Training Process 5: Softmax Activation Output

The above is the complete calculation process of the CBOW model, which is also one of the basic methods for Word2Vec to train vocabulary into word vectors. Whether it is the Skip-gram model or the CBOW model, Word2Vec generally provides high-quality word vector representations. Figure 9 shows the visualization of 128-dimensional skip-gram word vectors trained with 50,000 words compressed into 2-dimensional space.

Figure 9 Visualization Effect of Word2Vec

As can be seen, words with similar meanings are generally clustered together, proving that Word2Vec is a reliable word vector representation method.

I am Dong Ge, and I am currently creating the series topic 👉「100 Cool Operations in Pandas」, welcome to subscribe. After subscribing, the article updates will be pushed to the subscription account immediately, and you won’t miss any of them.

Finally, I would like toshare “100 Python E-books”, including Python programming techniques, data analysis, web scraping, web development, machine learning, and deep learning.

Now sharing for free, readers in need can download and study. Just reply with the keyword:Python in the public account “GitHuboy“, and that’s it.

Leave a Comment Cancel reply