Overview of Word2Vec Algorithm

Technical Column

Overview of Word2Vec Algorithm

Author: Yang Hangfeng

Editor: Zhang Nimei

1.Word2Vec Overview

Overview of Word2Vec Algorithm

Word2Vec is simply a method of representing the semantic information of words through learning from text and using word vectors, that is, mapping the original word space to a new space through Embedding, so that semantically similar words are close to each other in this space.

Based on traditional neural networks, the neural probabilistic language model suffers mainly from excessive computational load, particularly in the matrix operations between the hidden layer and the output layer, as well as the Softmax normalization operations on the output layer.

Thus, Overview of Word2Vec Algorithm is designed to optimize the neural probabilistic language model for these two issues. The two important models in are CBOW model and Skip-gram model. For these two models, provides two frameworks based on Hierarchical Softmax and Negative Sampling for design, and this article focuses on the first type.

2.CBOW Model

2.1 Network Structure Based on Hierarchical Softmax

CBOW stands for Continuous bag-of-words, which includes three layers: input layer, projection layer, and output layer.

1.Input Layer: contains the word vectors of 2c words in Context(w)

Overview of Word2Vec Algorithm where n represents the length of the word vector.

2.Projection Layer: sums the 2c vectors from the input layer.

Overview of Word2Vec Algorithm

3.Output Layer: corresponds to a Overview of Word2Vec Algorithm tree, which is constructed with the words appearing in the corpus as leaf nodes and their occurrence counts in the corpus as weights. In this tree, there are N(=|D|) leaf nodes corresponding to the words in the dictionary D, and N-1 non-leaf nodes (the yellow nodes in the figure).

2.2 Gradient Calculation

To facilitate the description of the problem later, we first provide a unified explanation of the symbols used in the Overview of Word2Vec Algorithm model:

: the path from the root node to the corresponding leaf node ;
: the path from the root node to the corresponding leaf node ;
: the path containing nodes, where represents the root node, and represents the word corresponding node;
, where : the word corresponds to a encoding, which consists of bits encoding, where represents the encoding of the node corresponding to the node in the path (the root node does not correspond to an encoding);
where represents the vector corresponding to the non-leaf node in the path .

Thus, the idea of Overview of Word2Vec Algorithm is that for any word in the dictionary , there must be a unique path from the root node to the corresponding leaf node in the tree. This path has branches, and each branch can be seen as a binary classification, so each classification corresponds to a probability. Finally, multiplying these probabilities together gives Overview of Word2Vec Algorithm .

Overview of Word2Vec Algorithm

Where Overview of Word2Vec Algorithm , through log-likelihood maximization, the objective function of the model is:

Overview of Word2Vec Algorithm

Overview of Word2Vec Algorithm The algorithm used to maximize the objective function is Stochastic Gradient Ascent. First, consider the gradient calculation of with respect to :

Overview of Word2Vec Algorithm

Therefore, the update formula for Overview of Word2Vec Algorithm is:

Overview of Word2Vec Algorithm

Next, consider the gradient calculation of Overview of Word2Vec Algorithm with respect to :

Overview of Word2Vec Algorithm

If we observe that Overview of Word2Vec Algorithm and have symmetry, then it would be easier to calculate the corresponding gradient. Since represents the sum of all word vectors in , how to update each component ? The approach is very straightforward, simply taking

Overview of Word2Vec Algorithm

2.3 CBOW Model Update Pseudocode

Overview of Word2Vec Algorithm

3.Skip-gram Model

3.1 Network Structure Based on Hierarchical Softmax

Similar to model, the Overview of Word2Vec Algorithm model also includes three layers: input layer, projection layer, and output layer:

Input Layer: contains only the word vector of the center word for the current sample.
Projection Layer: This layer is an identity projection, which can actually be optional. It is just for convenience and comparison with the model’s network structure.

Overview of Word2Vec Algorithm

3.Output Layer: Like the model, the output layer is also a Overview of Word2Vec Algorithm tree.

3.2 Gradient Calculation

For the Overview of Word2Vec Algorithm model, it is known that the current word needs to predict the words in its context . Therefore, the key is to construct the conditional probability function . In the model, it is defined as:

Overview of Word2Vec Algorithm

In the above formula, Overview of Word2Vec Algorithm can be compared to the idea introduced in the previous section . Therefore, we have:

Overview of Word2Vec Algorithm

Through log-likelihood maximization, the objective function of the Overview of Word2Vec Algorithm model is:

Overview of Word2Vec Algorithm

First, consider Overview of Word2Vec Algorithm with respect to :

Thus, the update formula for Overview of Word2Vec Algorithm

is:

Overview of Word2Vec Algorithm

Next, consider Overview of Word2Vec Algorithm

regarding

: Gradient calculations (can also be derived directly from symmetry):

Thus, the update formula for Overview of Word2Vec Algorithm

is:

Overview of Word2Vec Algorithm

3.3 Skip-gram Model Update Pseudocode

Overview of Word2Vec Algorithm

4. Summary

Word2Vec fundamentally transforms each word in natural language into a unified meaning and dimension word vector. Only by converting natural language into vector form can we build related algorithms on top of it. As for the specific meaning of each dimension in the vector, it is unknown and unnecessary to know. As the saying goes, it is mysterious and profound!

If you find this helpful, feel free to give a like, love ❤️!

<< Swipe left to add @Eva to the group >>

Leave a Comment Cancel reply