Illustrated Word2Vec: Understanding Word Embeddings

Illustrated Word2Vec: Understanding Word Embeddings

Word embeddings represent a word with a numerical vector, which is different from the IDs used in Tokenization. Word embedding vectors carry more semantic information. This article will illustrate Word2Vec: a method for word embeddings. This series also includes illustrations of Tokenization, Transformer, GPT2, and BERT.
If you want to learn about Tokenization, please see <Illustrated Tokenization>.

This article is translated from the blog below.

https://jalammar.github.io/illustrated-word2vec/

Illustrated Word2Vec: Understanding Word Embeddings

I think the concept of embeddings is one of the most fascinating ideas in machine learning. If you have ever used Siri, Google Assistant, Alexa, Google Translate, or even a smartphone keyboard with next-word prediction features, you have likely benefited from this idea, which has become central to natural language processing models. Over the past few decades, the application of embeddings in neural models has seen significant development (recent advancements include context-based word embeddings, leading to cutting-edge models like BERT and GPT2).

Word2Vec is an efficient method for creating word embeddings, and it has been in use since 2013. However, beyond being a method for word embeddings, some of its concepts have proven to be very effective in creating recommendation engines and understanding continuous data in non-linguistic business tasks. Companies like Airbnb, Alibaba, Spotify, and Anghami have leveraged the power of NLP to empower a new type of recommendation engine.

In this article, we will introduce the concept of embeddings and the mechanism of generating embeddings using Word2Vec. But first, let’s start with an example to understand how to represent things using vectors. Did you know that a list of five numbers (a vector) can represent a lot of information about your personality?

Personality Embeddings: What Kind of Person Are You?

On a scale from 0 to 100, how introverted/extroverted are you (where 0 is the most introverted and 100 is the most extroverted)? Have you taken personality tests like MBTI (Myers-Briggs Type Indicator) or the better-known Big Five personality traits test? If you haven’t taken one, these tests will ask you a series of questions and then score you across several dimensions, of which introversion/extroversion is one.

Illustrated Word2Vec: Understanding Word Embeddings

Let’s say I scored 38/100 on the introversion/extroversion scale. I could plot it like this:

Illustrated Word2Vec: Understanding Word Embeddings

Now, let’s narrow the range to -1 to 1:

Illustrated Word2Vec: Understanding Word Embeddings

How much do you think you understand a person based solely on this one piece of information? Not much. People are complex. So let’s add another dimension – another trait from the personality test.

Illustrated Word2Vec: Understanding Word Embeddings

We can represent the two dimensions as a point on a graph, or as a vector from the origin to that point. We have great tools to handle the upcoming vectors.

I’ve hidden the personality traits we’re plotting so you won’t know what each dimension represents, but you can still gain a lot of information from the vector representation of personality.

Now we can say this vector partially represents my personality. The usefulness of this representation comes when you want to compare two other people to me. Let’s say I get hit by a bus and need to find someone with a similar personality to replace me. In the graph below, which person is more similar to me?

Illustrated Word2Vec: Understanding Word Embeddings

A common method for calculating similarity scores when dealing with vectors is cosine similarity:

Illustrated Word2Vec: Understanding Word Embeddings

Person 1 is more similar in personality to me. Vectors pointing in the same direction (length also matters) have a higher cosine similarity.

Once again, two dimensions cannot capture enough information about the differences between people. Decades of psychological research have identified five main traits (as well as many secondary traits). So let’s use all five dimensions in our comparison:

Illustrated Word2Vec: Understanding Word Embeddings

The problem with using five dimensions is that we lose the ability to neatly plot little arrows in two-dimensional space. This is a common challenge in machine learning, where we often need to think in higher-dimensional spaces. The good news is that cosine similarity still applies. It works for any number of dimensions:

Illustrated Word2Vec: Understanding Word Embeddings

Cosine similarity works for any number of dimensions. These scores are better than the last ones because they are calculated based on higher-dimensional comparisons.

At the end of this section, I would like to present two central ideas:

1. We can represent people and things as algebraic vectors (which is great for machines!).

2. We can easily calculate the relationships between similar vectors.

Illustrated Word2Vec: Understanding Word Embeddings

Word Embeddings

With this understanding, we can continue to look at examples of trained word vectors (also known as word embeddings) and start exploring some of their interesting properties.

Here is the word embedding for the word “king” (trained GloVe vectors from Wikipedia):

[ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042 ]

You can try this out using the online tool below.

WebVectors: An interactive word vector playground for NLP

This is a list of 50 numbers. By looking at these values, we can’t derive much information. But let’s visualize it a bit to compare it with other word vectors. Let’s put all these numbers in a row:

Illustrated Word2Vec: Understanding Word Embeddings

Let’s color the cells based on the size of the values (red if close to 2; white if close to 0; blue if close to -2):

Illustrated Word2Vec: Understanding Word Embeddings

We will ignore the numbers and focus on the colors to represent the values of the cells. Now let’s compare “King” with other words:

Illustrated Word2Vec: Understanding Word Embeddings

You can see that “Man” and “Woman” are more similar to each other than either is to “King.” This tells you something. These vector representations capture a lot of information, meanings, and associations about these words.

This is another example list (by vertically scanning the columns, looking for color-similar columns to compare):

Illustrated Word2Vec: Understanding Word Embeddings

Notice a few things:

1. Among all these different words, there is one column that is red. They are similar in this dimension (and we don’t know what each dimension represents). 2. You can see that “woman” and “girl” are similar in many places. Similarly, “man” and “boy” are as well. 3. “boy” and “girl” also have some places where they are similar to each other but differ from “woman” or “man.” Could this represent a vague concept of “youth”? Possibly. 4. Except for the last word, all the words represent people. I added an object (“water”) to show the differences between categories. For example, you can see that the blue column extends all the way down and then stops before the embedding for “water.” 5. There are clear places where “king” and “queen” are similar to each other and different from all other words. Could these represent a vague concept of “royalty”?

Analogies

A famous example showcasing the wonderful properties of embeddings is analogies. We can add and subtract word embeddings and get interesting results. One famous example is the formula: “king” – “man” + “woman”:

Illustrated Word2Vec: Understanding Word Embeddings

Using the Gensim library in Python, we can add and subtract word vectors, and it will find the words most similar to the resulting vector. This image shows the list of most similar words, each with its cosine similarity.

We can visualize this analogy as before:

Illustrated Word2Vec: Understanding Word Embeddings

The vector generated from “king – man + woman” isn’t exactly equal to “queen,” but “queen” is the closest word to it among the 400,000 word embeddings we have here.

Now that we have seen trained word embeddings, let’s learn more about the training process. But before we start with Word2Vec, we need to look at the parent concept of word embeddings: language models.

Language Models

If I had to give the most typical example of natural language processing, it would be the next-word prediction feature in smartphone input methods. This is a feature used hundreds of times a day by billions of people.

Illustrated Word2Vec: Understanding Word Embeddings

Next-word prediction is a task that can be accomplished through language models. A language model tries to predict the next word based on a list of words (say, two words) that come before it.

In the screenshot above, we can think of the model receiving two green words (thou shalt) and recommending a set of words (“not” is one of the most likely to be selected):

Illustrated Word2Vec: Understanding Word Embeddings

We can imagine this model as a black box:

Illustrated Word2Vec: Understanding Word Embeddings

But in fact, the model doesn’t just output one word. It actually scores all the words it knows (the model’s vocabulary, which could be thousands to millions of words) by probability, and the input method program selects the highest-scoring word to recommend to the user.

Illustrated Word2Vec: Understanding Word Embeddings

The output of a natural language model is the probability scores of the words the model knows, which we usually express as percentages, but in reality, a score like 40% is represented as 0.4 in the output vector set.

Natural language models (see Bengio 2003) complete predictions in three steps after training:

Illustrated Word2Vec: Understanding Word Embeddings

The first step is the most relevant to us because we are discussing Embedding. After training, the model generates a mapping matrix for all the words in its vocabulary. During prediction, our algorithm queries the input words in this mapping matrix and calculates the predicted values:

Illustrated Word2Vec: Understanding Word Embeddings

Now let’s focus on model training to learn how to construct this mapping matrix.

Language Model Training

Language models have a huge advantage over most other machine learning models. This advantage lies in the fact that we can train them using a massive amount of running text. Imagine all the books, articles, Wikipedia content, and other forms of text data we have. This is a big contrast compared to many other machine learning models that require handcrafted features and special data collection.

We can obtain their mapping relationships by finding words that frequently appear near each other. The mechanism is as follows:

1. First, obtain a large amount of text data (e.g., all Wikipedia content)

2. Then we establish a sliding window that can glide along the text (for example, a window containing three words)

3. Using this sliding window, we can generate a large amount of sample data for training the model.

Illustrated Word2Vec: Understanding Word Embeddings

As this sliding window glides over the text, we (virtually) generate a dataset for training the model. To understand how this is accomplished, let’s see how the sliding window handles this phrase:

At the beginning, the window locks on the first three words of the sentence:

Illustrated Word2Vec: Understanding Word Embeddings

We take the first two words as features, and the third word as the label:

Illustrated Word2Vec: Understanding Word Embeddings

This produces the first sample in the dataset, which will be used in our subsequent language model training.

Next, we slide the window to the next position and produce the second sample:

Illustrated Word2Vec: Understanding Word Embeddings

Now the second sample is generated.

Before long, we can obtain a large dataset, from which we can see the words that tend to follow different groups of words:

Illustrated Word2Vec: Understanding Word Embeddings

In fact, the model training often occurs simultaneously while the sliding window is moving. But logically, it is clearer to separate the “dataset generation” phase from the training phase. Besides neural network-based language modeling methods, a technique called N-gram is often used for training language models (see Chapter 3 of “Speech and Language Processing”). To understand how the transition from N-gram to neural models is reflected in real-world products, here’s a blog post from 2015 by my favorite Android keyboard Swiftkey, detailing their neural language model and comparing it with the previous N-gram model. I love this example because it shows how to describe the algorithmic characteristics of embedding vectors in marketing language.

Considering Both Ends

Fill in the blanks based on the information above:

Illustrated Word2Vec: Understanding Word Embeddings

The context here is the five words before the blank (and the previously mentioned “bus”). I believe most people would guess that the blank should be filled with the word “bus.” But if I give you one more piece of information – a word after the blank – will that change your answer?

Illustrated Word2Vec: Understanding Word Embeddings

This completely changes what should fill in the blank. Now, the most likely word to fill in the blank is “red.” We can conclude that the specific words before and after a word carry information value. It turns out that considering both the left and right sides (the words on the left and right of the word we are guessing) leads to better word embeddings. Let’s see how to adjust the way we train our model to address this issue.

Skipgram Model

Instead of just looking at the two words before the target word, we can also look at the two words after it.

Illustrated Word2Vec: Understanding Word Embeddings

If we do this, the model we actually build and train looks like this:

Illustrated Word2Vec: Understanding Word Embeddings

This is called the Continuous Bag of Words structure, and it is described in a paper on Word2Vec. Another architecture has also shown good results, but it operates a bit differently.

This architecture tries to guess the neighboring words using the current word instead of using the context (the words before and after) to guess the word. We can see the sliding window as follows:

Illustrated Word2Vec: Understanding Word Embeddings

The words in the green box are input words, while the pink box shows possible output results.

The depth of color in the pink box indicates the different possibilities because the sliding window generates four independent samples for the training set:

Illustrated Word2Vec: Understanding Word Embeddings

This method is called the Skipgram architecture. We can display the contents of the sliding window as follows.

Illustrated Word2Vec: Understanding Word Embeddings

This provides four samples for the dataset:

Illustrated Word2Vec: Understanding Word Embeddings

Then we move the sliding window to the next position:

Illustrated Word2Vec: Understanding Word Embeddings

Now we generate the next four samples:

Illustrated Word2Vec: Understanding Word Embeddings

After moving a few sets of positions, we can obtain a batch of samples:

Illustrated Word2Vec: Understanding Word Embeddings

Revisiting the Training Process

Now that we have the skipgram training dataset extracted from existing running text, let’s briefly look at how to use it to train a basic neural language model that can predict neighboring words.

Illustrated Word2Vec: Understanding Word Embeddings

We take the first sample from the dataset. We extract features and input them into the untrained model, asking it to predict a suitable neighboring word.

Illustrated Word2Vec: Understanding Word Embeddings

The model executes these three steps and outputs a prediction vector (assigning a probability to each word in its vocabulary). Since the model is untrained, the predictions at this time are certainly incorrect. But that’s okay; we know what it should guess – the label/output unit in the row we’re currently using to train the model:

Illustrated Word2Vec: Understanding Word Embeddings

The target word probability is 1, while the probabilities for all other words are 0, so the vector composed of these values is the “target vector.”

How far off is the model’s prediction from the actual label? We subtract the two vectors to get an error vector:

Illustrated Word2Vec: Understanding Word Embeddings

Now we can use this error vector to update the model, so that next time when the model sees “not” as input, it’s more likely to guess “thou.”

Illustrated Word2Vec: Understanding Word Embeddings

This completes the first step of training. We continue the same process for the next sample in the dataset and then the next sample, until we have covered all samples in the dataset. This completes one training cycle. We can train through multiple cycles, and then we will have a trained model from which we can extract the embedding matrix for any further applications.

While this expands our understanding of the process, we still lack some key ideas about the actual training of Word2Vec.

Negative Sampleing

Recall the three steps the neural language model takes to compute prediction values:

Illustrated Word2Vec: Understanding Word Embeddings

The third step is computationally expensive, especially considering that we need to perform it once for each training sample in the dataset (which can reach tens of millions). We need to take some measures to improve performance.

One approach is to divide the target into two steps:

1. Generate high-quality word embeddings (without worrying about predicting the next word). 2. Use these high-quality embeddings to train the language model (for next-word prediction). In this article, we will focus on the first step because we are concerned with embeddings. To generate high-quality embeddings using high-performance models, we can switch the model’s task from predicting neighboring words to:

Illustrated Word2Vec: Understanding Word Embeddings

Then switch it to a model that takes input and output words and outputs a score indicating whether they are neighbors (0 means “not neighbors,” 1 means “are neighbors”).

Illustrated Word2Vec: Understanding Word Embeddings

This simple switch transforms the model we need from a neural network to a logistic regression model, making it simpler and faster to compute.

This switch requires us to change the structure of the dataset – the labels are now a new column with values of 0 or 1. Since all the words we added are neighbors, they will all be 1.

Illustrated Word2Vec: Understanding Word Embeddings

Now we can compute at an amazing speed, processing millions of examples in minutes. However, we still need to address a loophole. If all our examples are positive cases (target is 1), we run the risk of facing a clever model that always returns 1 – achieving 100% accuracy without learning anything and generating garbage embedding vectors.

Illustrated Word2Vec: Understanding Word Embeddings

To solve this problem, we need to introduce negative samples into our dataset – word samples that are not neighbors. Our model needs to return 0 for these samples. Now, this is a challenge where the model must work hard to solve the problem, but it can still compute at high speed.

Illustrated Word2Vec: Understanding Word Embeddings

For each sample in our dataset, we add negative examples. They have the same input word, with labels of 0.

But what do we fill in as output words? We randomly draw words from the vocabulary.

Illustrated Word2Vec: Understanding Word Embeddings

The inspiration for this idea comes from noise-contrastive estimation. We compare the actual signal (positive examples of neighboring words) with noise (randomly selected words that are not neighbors). This leads to a huge trade-off in computational and statistical efficiency.

Noise Contrastive Estimation

http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf

Skipgram with Negative Sampling (SGNS)

We have now introduced two (pairs of) core ideas in Word2Vec: negative sampling and skipgram.

Illustrated Word2Vec: Understanding Word Embeddings

Word2Vec Training Process

Now that we understand the two core ideas of skipgram and negative sampling, we can delve deeper into the actual training process of Word2Vec.

Before the training process begins, we preprocess the text to be trained. In this step, we determine the size of the vocabulary (which we call vocab_size and can think of as, for example, 10,000) and the words that belong to that vocabulary.

At the start of the training phase, we create two matrices – the embedding matrix and the context matrix. Both matrices have embeddings for each word in the vocabulary (thus vocab_size is one of their dimensions). The second dimension is how long we want each embedding to be (embedding_size – 300 is a common value, but we saw embeddings of length 50 in earlier examples).

Illustrated Word2Vec: Understanding Word Embeddings

At the start of the training process, we initialize these matrices with random values. Then we begin the training process. In each training step, we choose a positive example and its related negative examples. Let’s look at our first set of examples:

Illustrated Word2Vec: Understanding Word Embeddings

Now we have four words: the input word is “not,” and the output/context word is “thou” (the actual neighbor), along with “aaron” and “taco” (the negative examples). We continue to look up their embedding vectors – for the input word, we look in the embedding matrix; for the context word, we look in the context matrix (although both matrices provide embedding vectors for each word in the vocabulary).

Illustrated Word2Vec: Understanding Word Embeddings

Then, we perform a dot product operation between the input embedding vector and each context embedding vector. In each case, this produces a number that indicates the similarity between the input embedding vector and the context embedding vector.

Illustrated Word2Vec: Understanding Word Embeddings

Now we need a way to convert these scores into probabilities that resemble probabilities – we need to convert them to positive values and ensure they fall within the range of 0 to 1. This is exactly the task that the sigmoid function (logistic function) is well-suited for.

Illustrated Word2Vec: Understanding Word Embeddings

Now we can treat the output of the sigmoid operation as the model’s output for these examples. We can see that before and after the sigmoid operation, “taco” has the highest score, while “aaron” has the lowest.

Since the untrained model has made predictions and we have an actual target label for comparison, let’s calculate how far off the model’s predictions are from the actual labels. To do this, we simply subtract the sigmoid scores from the target labels.

Illustrated Word2Vec: Understanding Word Embeddings

error = target – sigmoid_scores

This is the “learning” part of “machine learning.” We can now use this error score to adjust the embedding vectors of

Leave a Comment