Overview of Word2Vec Algorithm

Overview of Word2Vec Algorithm

Technical Column

Overview of Word2Vec Algorithm

Author: Yang Hangfeng

Editor: Zhang Nimei

1.Word2Vec Overview

Overview of Word2Vec Algorithm

Word2Vec is simply a method of representing the semantic information of words through learning from text and using word vectors, that is, mapping the original word space to a new space through Embedding, so that semantically similar words are close to each other in this space.

Based on traditional neural networks, the neural probabilistic language model suffers mainly from excessive computational load, particularly in the matrix operations between the hidden layer and the output layer, as well as the Softmax normalization operations on the output layer.

Thus, Overview of Word2Vec Algorithm is designed to optimize the neural probabilistic language model for these two issues. The two important models in Overview of Word2Vec Algorithm are CBOW model and Skip-gram model. For these two models, Overview of Word2Vec Algorithm provides two frameworks based on Hierarchical Softmax and Negative Sampling for design, and this article focuses on the first type.

2.CBOW Model

2.1 Network Structure Based on Hierarchical Softmax

CBOW stands for Continuous bag-of-words, which includes three layers: input layer, projection layer, and output layer.

1.Input Layer: contains the word vectors of 2c words in Context(w)

Overview of Word2Vec Algorithm where n represents the length of the word vector.

2.Projection Layer: sums the 2c vectors from the input layer.

Overview of Word2Vec Algorithm

Overview of Word2Vec Algorithm

3.Output Layer: corresponds to a Overview of Word2Vec Algorithm tree, which is constructed with the words appearing in the corpus as leaf nodes and their occurrence counts in the corpus as weights. In this Overview of Word2Vec Algorithm tree, there are N(=|D|) leaf nodes corresponding to the words in the dictionary D, and N-1 non-leaf nodes (the yellow nodes in the figure).

2.2 Gradient Calculation

To facilitate the description of the problem later, we first provide a unified explanation of the symbols used in the Overview of Word2Vec Algorithm model:

  • Overview of Word2Vec Algorithm: the path from the root node to the corresponding leaf node Overview of Word2Vec Algorithm;

  • Overview of Word2Vec Algorithm: the path from the root node to the corresponding leaf node Overview of Word2Vec Algorithm;

  • Overview of Word2Vec Algorithm: the path Overview of Word2Vec Algorithm containing Overview of Word2Vec Algorithm nodes, where Overview of Word2Vec Algorithm represents the root node, and Overview of Word2Vec Algorithm represents the wordOverview of Word2Vec Algorithm corresponding node;

  • Overview of Word2Vec Algorithm, where Overview of Word2Vec Algorithm: the word Overview of Word2Vec Algorithm corresponds to a Overview of Word2Vec Algorithm encoding, which consists of Overview of Word2Vec Algorithm bits encoding, where Overview of Word2Vec Algorithm represents the encoding of the node corresponding to the Overview of Word2Vec Algorithm node in the Overview of Word2Vec Algorithm path (the root node does not correspond to an encoding);

  • Overview of Word2Vec Algorithm where represents the vector corresponding to the non-leaf node in the path Overview of Word2Vec Algorithm.

Thus, the idea of Overview of Word2Vec Algorithm is that for any word in the dictionary Overview of Word2Vec Algorithm, there must be a unique path from the root node to the corresponding leaf node in the tree. This path has Overview of Word2Vec Algorithm branches, and each branch can be seen as a binary classification, so each classification corresponds to a probability. Finally, multiplying these probabilities together gives Overview of Word2Vec Algorithm.

Overview of Word2Vec Algorithm

Where Overview of Word2Vec Algorithm, through log-likelihood maximization, the objective function of the Overview of Word2Vec Algorithm model is:

Overview of Word2Vec Algorithm

Overview of Word2Vec Algorithm The algorithm used to maximize the objective function is Stochastic Gradient Ascent. First, consider the gradient calculation of Overview of Word2Vec Algorithm with respect to Overview of Word2Vec Algorithm:

Overview of Word2Vec Algorithm

Therefore, the update formula for Overview of Word2Vec Algorithm is:

Overview of Word2Vec Algorithm

Next, consider the gradient calculation of Overview of Word2Vec Algorithm with respect to Overview of Word2Vec Algorithm:

Overview of Word2Vec Algorithm

If we observe that Overview of Word2Vec Algorithm and Overview of Word2Vec Algorithm have symmetry, then it would be easier to calculate the corresponding gradient. Since Overview of Word2Vec Algorithm represents the sum of all word vectors in Overview of Word2Vec Algorithm, how to update each component Overview of Word2Vec Algorithm? The approach is very straightforward, simply taking

Overview of Word2Vec Algorithm

2.3 CBOW Model Update Pseudocode

Overview of Word2Vec Algorithm

3.Skip-gram Model

3.1 Network Structure Based on Hierarchical Softmax

Similar to Overview of Word2Vec Algorithm model, the Overview of Word2Vec Algorithm model also includes three layers: input layer, projection layer, and output layer:

  1. Input Layer: contains only the word vector of the center word Overview of Word2Vec Algorithm for the current sampleOverview of Word2Vec Algorithm.

  2. Projection Layer: This layer is an identity projection, which can actually be optional. It is just for convenience and comparison with the Overview of Word2Vec Algorithm model’s network structure.

Overview of Word2Vec Algorithm

3.Output Layer: Like the Overview of Word2Vec Algorithm model, the output layer is also a Overview of Word2Vec Algorithm tree.

3.2 Gradient Calculation

For the Overview of Word2Vec Algorithm model, it is known that the current word Overview of Word2Vec Algorithm needs to predict the words in its context Overview of Word2Vec Algorithm. Therefore, the key is to construct the conditional probability function Overview of Word2Vec Algorithm. In the Overview of Word2Vec Algorithm model, it is defined as:

Overview of Word2Vec Algorithm

In the above formula, Overview of Word2Vec Algorithm can be compared to the idea introduced in the previous section Overview of Word2Vec Algorithm. Therefore, we have:

Overview of Word2Vec Algorithm

Through log-likelihood maximization, the objective function of the Overview of Word2Vec Algorithm model is:

Overview of Word2Vec Algorithm

First, consider Overview of Word2Vec Algorithm with respect to Overview of Word2Vec Algorithm: Overview of Word2Vec Algorithm

Thus, the update formula for Overview of Word2Vec Algorithm is:

Overview of Word2Vec Algorithm

Next, consider Overview of Word2Vec Algorithm regarding Overview of Word2Vec Algorithm: Gradient calculations (can also be derived directly from symmetry):
Overview of Word2Vec Algorithm
Thus, the update formula for Overview of Word2Vec Algorithm is:

Overview of Word2Vec Algorithm

3.3 Skip-gram Model Update Pseudocode

Overview of Word2Vec Algorithm

4. Summary

Word2Vec fundamentally transforms each word in natural language into a unified meaning and dimension word vector. Only by converting natural language into vector form can we build related algorithms on top of it. As for the specific meaning of each dimension in the vector, it is unknown and unnecessary to know. As the saying goes, it is mysterious and profound!

If you find this helpful, feel free to give a like, love ❤️!

Overview of Word2Vec Algorithm
Overview of Word2Vec Algorithm

<< Swipe left to add @Eva to the group >>

Leave a Comment