Word2Vec Python Source Code Analysis

After getting used to the operations of Word2Vec, today we will lift the lid and see what it’s really like inside. Word2Vec can transform words into vectors, which sounds quite magical, right? But once you understand the principle, you’ll find it’s just a little trick of assigning mathematical labels to words.

Core Idea: Prediction Is Learning

Word2Vec plays a prediction game – seeing a word and guessing what words might appear around it. For example, in the sentence “I like to eat apples”, when you see “eat”, you should think that “apples” might be nearby. During this process, words quietly learn their numerical features.

def train_pair(center_word, context_word, weights):
    # Forward propagation of the center word
    hidden = np.dot(weights['w1'][center_word], weights['w2'].T)
    prob = softmax(hidden)
    # Calculate gradient and loss
    loss = -np.log(prob[context_word])
    grad = prob
    grad[context_word] -= 1
    return loss, grad

Sliding Window: Context Is Not To Be Underestimated

Word2Vec uses a sliding window for sampling, just like kids playing on a slide, with words sliding past one after another. The window size determines how far back in context you look; both too large and too small are not effective.

def create_contexts_target(corpus, window_size=1):
    target = corpus[window_size:-window_size]
    contexts = []
    for idx in range(window_size, len(corpus) - window_size):
        cs = []
        for t in range(-window_size, window_size + 1):
            if t == 0:
                continue
            cs.append(corpus[idx + t])
        contexts.append(cs)
    return np.array(contexts), np.array(target)

Tip: The window size should be set moderately; too large will create irrelevant connections, and too small will fail to capture relationships between words.

Negative Sampling: True Knowledge Comes From Challenges

Just looking at positive examples is not enough; we need some negative examples to challenge the model. Negative sampling randomly finds some unrelated words to help the model learn to differentiate.

def get_negative_samples(vocab_size, sample_size):
    negative_samples = np.random.randint(0, vocab_size, size=sample_size)
    return negative_samples

Gradient Update: Model Improvement Relies on It

Gradient updates are like climbing stairs, taking one step at a time in the right direction. A learning rate that is too high can lead to falling down, while one that is too low will take forever to reach the top.

def update_weights(weights, grads, learning_rate=0.01):
    for key in weights:
        weights[key] -= learning_rate * grads[key]
    return weights

Tip: Be patient when adjusting the learning rate; different datasets may require different values.

Skip-gram: Predicting Surrounding Words

The Skip-gram model is like playing a matching game, where you give a center word and connect it to the surrounding words that should appear.

class SkipGram:
    def __init__(self, vocab_size, embedding_size):
        self.w1 = np.random.randn(vocab_size, embedding_size) * 0.01
        self.w2 = np.random.randn(embedding_size, vocab_size) * 0.01
    def forward(self, x):
        self.h = self.w1[x]
        score = np.dot(self.h, self.w2)
        return softmax(score)

As I write this code, I suddenly remember that many people often overcomplicate Word2Vec. In fact, it is just a process of labeling; as words become familiar with each other, they naturally know which words are more similar.

When writing code, remember to handle numerical stability properly, or gradient explosion can be troublesome. Initialize vectors a bit smaller, and adjust the learning rate slowly, so that the model can learn steadily.

After reading the source code, don’t you find that Word2Vec isn’t so scary? In short, it’s just a mathematical way to describe the relationships between words. Next time you use it, you won’t feel like it’s a black box!