Illustrated Word2vec: Everything You Need to Know

Click on Machine Learning Algorithms and Python Learning ,Select Star

Exciting content won’t get lost

Illustrated Word2vec: Everything You Need to Know

Source: Big Data Digest

Embedding is one of the most fascinating ideas in machine learning. If you’ve ever used Siri, Google Assistant, Alexa, Google Translate, or even your smartphone keyboard for next-word prediction, you have likely benefited from this concept, which has become central to natural language processing models.

Over the past few decades, embedding technologies have seen considerable development for neural network models. Especially recently, its development includes contextual embeddings that lead to cutting-edge models like BERT and GPT-2.

BERT:

https://jalammar.github.io/illustrated-bert/

Word2vec is an effective method for creating word embeddings, existing since 2013. But beyond being a method for word embeddings, some of its concepts have proven effective in creating recommendation engines and understanding sequential data in commercial, non-linguistic tasks. Companies like Airbnb, Alibaba, and Spotify have drawn inspiration from the NLP field and applied it to their products, thus supporting new types of recommendation engines.

In this article, we will discuss the concept of embeddings and the mechanism for generating embeddings using word2vec. Let’s start with an example to familiarize ourselves with using vectors to represent things. Did you know your personality can be represented by just a list of five numbers (vectors)?

Personality Embedding: What Kind of Person Are You?

How can we represent how introverted/extroverted you are on a scale from 0 to 100 (where 0 is the most introverted and 100 is the most extroverted)? Have you ever taken personality tests like MBTI or the Big Five personality traits test? If you haven’t, these tests will ask you a series of questions and then score you on many dimensions, one of which is introversion/extroversion.

Example results of the Big Five personality traits test. It can really tell you a lot about yourself and has predictive power for academic, personality, and career success. You can find the test results here.

Assuming my introversion/extroversion score is 38/100. We can visualize it this way:

Let’s narrow the range to -1 to 1:

When you only know this one piece of information, how well do you think you understand this person? Not very well. People are complex, so let’s add another test score as a new dimension.

We can represent the two dimensions as a point on a graph, or as a vector from the origin to that point. We have great tools to handle the upcoming vectors.

I have hidden the personality traits we are plotting so that you will gradually get used to deriving value information from a personality vector representation without knowing what each dimension represents.

We can now say this vector partially represents my personality. This representation comes in handy when you want to compare two other people with me. Suppose I got hit by a bus and need to be replaced by someone with a similar personality; which of the two people in the following diagram is more like me?

When dealing with vectors, a common method for calculating similarity scores is cosine similarity:

Person 1 is more similar to me in personality. Vectors pointing in the same direction (length matters too) have a higher cosine similarity.

Once again, two dimensions are still insufficient to capture enough information about different populations. Psychology has identified five major personality traits (and a large number of subtraits), so let’s compare using all five dimensions:

The problem with using five dimensions is that we can no longer neatly plot small arrows on a two-dimensional plane. This is a common issue in machine learning, where we often need to think in higher-dimensional spaces. The good news is that cosine similarity still works; it applies to any number of dimensions:

Cosine similarity applies to any number of dimensions. These scores are better than last time because they are calculated based on the higher dimensions of the things being compared.

At the end of this section, I would like to present two central ideas:

1. We can represent people and things as algebraic vectors (which is great for machines!).

2. We can easily calculate the relationships between similar vectors.

Word Embeddings

With the understanding from the above, let’s continue to look at examples of trained word vectors (also known as word embeddings) and explore some of their interesting properties.

This is a word embedding for the word “king” (trained GloVe vectors from Wikipedia):

[ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042 ]

This is a list of 50 numbers. We can’t see much by observing the values, but let’s visualize it a bit to compare with other word vectors. We will place all these numbers in a row:

Let’s color-code the cells based on their values (red for those close to 2, white for those close to 0, and blue for those close to -2):

We will ignore the numbers and only look at the colors to indicate the values of the cells. Now let’s compare “king” with other words:

Notice how “Man” and “Woman” are more similar to each other than either is to “King”? This suggests something. These vector illustrations nicely showcase the information/meaning/associations of these words.

This is another example list (look for columns with similar colors by vertically scanning):

There are several points to note:

1. All these different words have a straight red column. They are similar in this dimension (though we don’t know what each dimension is).

2. You can see that “woman” and “girl” are similar in many respects, and so are “man” and “boy”.

3. “boy” and “girl” also have similarities to each other, but these differ from those with “woman” or “man”. Can these be summarized into a vague concept of “youth”? Perhaps.

4. Except for the last word, all words represent people. I added the object “water” to show the differences across categories. You can see that the blue column continues down and stops before the word embedding for “water”.

5. “king” and “queen” are similar to each other, but they are different from other words. Can these be summarized into a vague concept of “royalty”?

Analogy

A famous example that showcases the wonderful properties of embeddings is analogy. We can add and subtract word embeddings to get interesting results. A famous example is the formula: “king” – “man” + “woman”:

Using the Gensim library in Python, we can add and subtract word vectors, and it will find the words most similar to the resulting vector. The image shows a list of the most similar words, each with cosine similarity.

We can visualize this analogy as before:

The vector generated by “king – man + woman” is not exactly equal to “queen”, but “queen” is the closest word to it among the 400,000 word embeddings included in this set.

Now that we’ve seen trained word embeddings, let’s learn more about the training process. But before we start using word2vec, we need to look at the parent concept of word embeddings: neural language models.

Language Models

If we were to cite the most typical example of natural language processing, it would be the next-word prediction feature in smartphone input methods. This is a feature used hundreds of times daily by billions of people.

Next-word prediction is a task that can be achieved through language models. A language model tries to predict the next word that might follow a list of words (say, two words).

In the above mobile screenshot, we can think of the model receiving two green words (thou shalt) and recommending a set of words (“not” is one of the most likely to be chosen):

We can think of this model as a black box:

But in reality, the model does not output just one word. Instead, it scores all the words it knows (the model’s vocabulary, which can range from thousands to millions of words) by likelihood, and the input method program selects the highest-scoring ones to recommend to the user.

The output of a natural language model is the probability scores of the words known to the model, which we typically express as percentages, but in reality, scores like 40% are represented as 0.4 in the output vector group.

Natural language models (see Bengio 2003) complete predictions in three steps after training, as shown below:

The first step is the most relevant to us, as we are discussing Embedding. After training, the model generates a matrix mapping all words in the vocabulary. During prediction, our algorithm queries the input word in this mapping matrix and calculates the predicted value:

Now let’s focus on model training to learn how to build this mapping matrix.

Language Model Training

Compared to most other machine learning models, language models have a significant advantage: we have abundant text to train them. All our books, articles, Wikipedia, and various types of text content are available. In contrast, many other machine learning model developments require manually designed or specially collected data.

We can obtain their mapping relationships by finding words that frequently appear near each word. The mechanism is as follows:

1. First, acquire a large amount of text data (e.g., all Wikipedia content).

2. Then, we establish a sliding window that can move along the text (e.g., a window containing three words).

3. Using such a sliding window can generate a large sample dataset for training the model.

As this window slides along the text, we can (in reality) generate a set of data for model training. To clearly understand this process, let’s see how the sliding window handles this phrase:

At first, the window locks onto the first three words of the sentence:

We make the first two words features and the third word the label:

At this point, we have produced the first sample in the dataset, which will be used in our subsequent language model training.

Next, we slide the window to the next position and produce the second sample:

At this point, the second sample has also been generated.

Before long, we can obtain a large dataset, from which we can observe the words that will appear after different word groups:

In practical applications, the model is often trained while we slide the window. However, I think separating the data generation and model training into two phases makes it clearer and easier to understand. Besides using neural networks for modeling, people also commonly use a technique called N-grams for model training.

If you want to understand the shift from using N-grams models to neural models in real products, you can check out a blog post by Swiftkey (my favorite Android input method) published in 2015, which discusses their natural language model and compares it to early N-grams models. I like this example because it shows how to clearly explain the properties of the Embedding algorithm in marketing presentations.

Considering Both Ends

Fill in the blanks based on the information provided above:

In front of the blank, the background I provided is five words (if ‘bus’ was mentioned earlier), most people would definitely fill in ‘bus’ in the blank. But if I give you another piece of information – say, a word after the blank, would the answer change?

Now the content filled in the blank has completely changed. At this point, ‘red’ is the most likely fit for this position. From this example, we learn that the words before and after a word carry informational value. It turns out we need to consider words in both directions (the left and right of the target word). So how should we adjust the training method to meet this requirement? Keep reading.

Skipgram Model

We need to consider not only the two words before the target word but also the two words after it.

If we do this, the model we are actually building and training looks as follows:

This architecture is called Continuous Bag of Words (CBOW), as discussed in a paper on word2vec.

There is another architecture that does not guess the target word based on the context (the preceding and following words) but predicts the possible preceding and following words for the current word. Let’s envision how the sliding window looks during training data as follows:

The words in the green box are input words, and the pink box contains the possible output results.

The depth of color in the pink box indicates that the sliding window generates four independent samples for the training set:

This method is called the Skipgram architecture. We can illustrate the contents of the sliding window as follows.

This provides four samples for the dataset:

Then we move the sliding window to the next position:

In this way, we generate another four samples:

After moving several sets of positions, we can obtain a batch of samples:

Revisiting the Training Process

Now that we have obtained the training dataset for the Skipgram model from existing text, let’s see how to use it to train a natural language model that predicts adjacent words.

Starting with the first sample in the dataset. We input the features into an untrained model and let it predict a possible adjacent word.

The model will perform three steps and input a prediction vector (corresponding to the probability of each word in the vocabulary). Because the model is untrained, the predictions at this stage will surely be wrong. But that’s okay; we know which word we should guess – that word is the output label in my training data:

The target word’s probability is 1, while all other words’ probabilities are 0, so the vector composed of these values is the “target vector”.

How much bias does the model have? By subtracting the two vectors, we can obtain the bias vector:

Now this error vector can be used to update the model, so in the next round of predictions, if we use “not” as input, we are more likely to get “thou” as output.

This is actually the first step of training. Next, we continue to perform the same operation on the next sample in the dataset until we traverse all samples. This is one epoch. We repeat this for several epochs to obtain the trained model, and then we can extract the embedding matrix for use in other applications.

While this does help us understand the entire process, it still does not represent the true training method of word2vec. We have missed some key ideas.

Negative Sampling

Recall the three steps the neural language model takes to compute the predicted value:

From a computational standpoint, the third step is very expensive – especially when we need to do it for each training sample in the dataset (which can easily amount to tens of millions). We need to find ways to improve performance.

One approach is to divide the target into two steps:

1. Generate high-quality word embeddings (don’t worry about the next word prediction).

2. Use these high-quality embeddings to train the language model (to perform next-word prediction).

In this article, we will focus on the first step (because this article focuses on embeddings). To generate high-quality embeddings using high-performance models, we can change the task of predicting adjacent words:

Switch it to a model that extracts input and output words and outputs a score indicating whether they are neighbors (0 means “not neighbors”, 1 means “neighbors”).

This simple transformation changes the model we need from a neural network to a logistic regression model – making it simpler and faster to compute.

This switch requires us to change the structure of the dataset – the label values now have a new column with values of 0 or 1. They will all be 1 because we add all words as neighbors.

The current computation speed is incredibly fast – processing millions of examples in just a few minutes. But we still need to address a loophole. If all examples are neighbors (target: 1), our “genius model” might be trained to always return 1 – achieving 100% accuracy but learning nothing, resulting in garbage embedding results.

To solve this problem, we need to introduce negative samples into the dataset – samples of words that are not neighbors. Our model needs to return 0 for these samples. The model must work hard to solve this challenge – and still must maintain high speed.

For each sample in our dataset, we add negative examples. They have the same input word, with labels of 0.

But what do we fill in as output words? We randomly select words from the vocabulary.

This idea is inspired by noise-contrastive estimation. We compare the actual signal (positive examples of adjacent words) with noise (randomly selected non-neighbors). This leads to a huge trade-off in computational and statistical efficiency.

Noise-Contrastive Estimation

http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf

Skipgram with Negative Sampling (SGNS)

We have now introduced two (paired) core ideas in word2vec: negative sampling and skipgram.

Word2vec Training Process

Now that we understand the two core ideas of skipgram and negative sampling, we can continue to carefully study the actual training process of word2vec.

Before starting the training process, we preprocess the text that we are training the model on. At this step, we determine the size of the vocabulary (we call it vocab_size, for example, 10,000) and which words it contains.

At the start of the training phase, we create two matrices – the Embedding matrix and the Context matrix. These two matrices embed each word in our vocabulary (so vocab_size is one of their dimensions). The second dimension is the length we want each embedding to have (embedding_size – 300 is a common value, but we have also seen examples with 50).

At the start of the training process, we initialize these matrices with random values. Then we begin the training process. In each training step, we take an adjacent example and its related non-adjacent examples. Let’s look at our first set:

Now we have four words: the input word “not” and the output/context words: “thou” (the actual neighbor word), “aaron” and “taco” (negative examples). We continue to look for their embeddings – for the input word, we check the Embedding matrix. For the context words, we check the Context matrix (even though both matrices embed each word in our vocabulary).

Then, we calculate the dot product of the input embedding with each context embedding. In each case, the result will be a number representing the similarity between the input and context embeddings.

Now we need a way to turn these scores into something that looks like probabilities – we need them to be positive values and between 0 and 1. The sigmoid function is well-suited for this purpose.

Now we can consider the output of the sigmoid operation as the model’s output for these examples. You can see that “taco” scores the highest, while “aaron” scores the lowest, both before and after the sigmoid operation.

Since the untrained model has made predictions and we do have actual target labels for comparison, let’s calculate the error in the model’s predictions. To do this, we just subtract the sigmoid scores from the target labels.

error = target – sigmoid_scores

This is the “learning” part of “machine learning”. Now we can use this error score to adjust the embeddings of “not”, “thou”, “aaron”, and “taco” so that the next time we do this calculation, the results will be closer to the target scores.

The training step ends here. We have obtained better embeddings for the words used in this step (not, thou, aaron, and taco). We now proceed to the next step (the next adjacent sample and its related non-adjacent samples) and repeat the same process.

As we loop through the entire dataset multiple times, the embeddings will continue to improve. Then we can stop the training process, discard the Context matrix, and use the Embeddings matrix as the trained embeddings for the next task.

Window Size and Number of Negative Samples

Two key hyperparameters in the word2vec training process are window size and the number of negative samples.

Different tasks suit different window sizes. A heuristic approach is that using a smaller window size (2-15) will yield embeddings where a high similarity score between two embeddings indicates that these words are interchangeable (note that if we only look at words that are very close together, antonyms can often be interchangeable — for example, good and bad often appear in similar contexts). Using a larger window size (15-50 or even more) will yield embeddings where similarity better indicates word relevance. In practice, you often need to provide guidance for the embedding process to help readers achieve a similar “feel” of language. Gensim defaults to a window size of 5 (including two words before and after the input word, excluding the input word itself).

The number of negative samples is another factor influencing the training process. The original paper suggests that 5-20 negative samples is an ideal number. It also notes that when you have a sufficiently large dataset, 2-5 seems to be enough. Gensim defaults to 5 negative samples.

Conclusion

I hope you now have a better understanding of word embeddings and the word2vec algorithm. I also hope that now when you read a paper mentioning “Skipgram with Negative Sampling” (SGNS), you will have a better grasp of these concepts.

Related reports:

https://jalammar.github.io/illustrated-word2vec/

If this was helpful to you. Please like and click to view, thank you

Leave a Comment Cancel reply