The Misconceptions About Word2Vec: A Programmer’s Insight

Li Zi from Ao Fei Si Quantum Bit | WeChat Official Account QbitAI

Word2Vec is a language tool open-sourced by Google in 2013.

A two-layer network can turn words into vectors, which is crucial in the NLP field and the foundation for many functionalities.

However, now a programmer named bollu (short for Pineapple) loudly tells the world:

“Everything you know about Word2Vec is wrong.”

The Misconceptions About Word2Vec: A Programmer's Insight

In his view, the algorithm explanation in papers and the code implementation are fundamentally different.

Does it matter if the paper is clear as long as the code is open-sourced?

A detailed discussion sparked many people’s interest and resonance, and within half a day, Hacker News had nearly 300 points:

The Misconceptions About Word2Vec: A Programmer's Insight

So, how did Pineapple’s worldview collapse, and what does the real Word2Vec look like in his eyes?

A Different Sky

Word2Vec has a kind of classic explanation (the one in Skip-Gram with negative sampling), which is how papers and countless blogs describe it:

The Misconceptions About Word2Vec: A Programmer's Insight

It only shows two vectors.

But the programmer says that after looking at the most original C language implementation code of Word2Vec, he found it completely different.

(Most humans using Word2Vec for word embedding either directly call the C implementation or use the gensim implementation. Gensim is translated from the C implementation, and even the variable names remain unchanged.)

The C Implementation Looks Like This

Each word has two vectors, each serving different roles:

One represents the word as a focus word (Focus Word).

The other represents it as the context word (Context Word) for another focus word.

Pineapple says: Sounds familiar, GloVe borrowed this idea, but no one has clearly stated it.

In the source code written in C, the setup is already very complete, and these vectors are handled by two arrays (Array):

syn0 array, responsible for the vector of a word as a focus word. It is randomly initialized.

1https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L369
2  for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) {
3    next_random = next_random * (unsigned long long)25214903917 + 11;
4    syn0[a * layer1_size + b] = 
5       (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;
6  }

syn1neg array, responsible for the vector of this word as a context word. It is zero initialized.

1https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L365
2for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++)
3  syn1neg[a * layer1_size + b] = 0;

For training, a focus word must be selected first. During positive and negative sample training, this focus word remains constant (Constant).

The gradient of the focus word vector (Gradients) accumulates in a buffer. After being affected by positive and negative samples, these gradients are applied to the focus word:

1if (negative > 0) for (d = 0; d < negative + 1; d++) {
2  // if we are performing negative sampling, in the 1st iteration,
3  // pick a word from the context and set the dot product target to 1
4  if (d == 0) {
5    target = word;
6    label = 1;
7  } else {
8    // for all other iterations, pick a word randomly and set the dot
9    //product target to 0
10    next_random = next_random * (unsigned long long)25214903917 + 11;
11    target = table[(next_random >> 16) % table_size];
12    if (target == 0) target = next_random % (vocab_size - 1) + 1;
13    if (target == word) continue;
14    label = 0;
15  }
16  l2 = target * layer1_size;
17  f = 0;
18
19  // find dot product of original vector with negative sample vector
20  // store in f
21  for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
22
23  // set g = sigmoid(f) (roughly, the actual formula is slightly more complex)
24  if (f > MAX_EXP) g = (label - 1) * alpha;
25  else if (f < -MAX_EXP) g = (label - 0) * alpha;
26  else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
27
28  // 1. update the vector syn1neg,
29  // 2. DO NOT UPDATE syn0
30  // 3. STORE THE syn0 gradient in a temporary buffer neu1e
31  for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
32  for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
33}
34// Finally, after all samples, update syn1 from neu1e
35https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L541
36// Learn weights input -> hidden
37for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];

So the question arises, why is it randomly initialized, and why is it zero initialized?

About Initialization

These things were never mentioned in the papers and blogs, so Pineapple could only speculate:

Because negative samples (Negative Sample) come from the entire text and are not weighted based on word frequency, any word can be selected, and usually, this word’s vector has not undergone much training.

If this vector already has a value, then it can move (Move Randomly) the focus word at will.

The solution is to set all negative samples to zero, so that only those vectors that appear more frequently will affect the representation of another vector.

The programmer said that if this is the case, it is indeed very clever. He never thought that the initialization strategy could be so important, and it is not evident from reading the papers.

Directly Looking at the Code, No Longer Trusting Papers

The Misconceptions About Word2Vec: A Programmer's Insight

Before this, Pineapple had already spent two months trying to reproduce Word2Vec and read countless articles, but he was unsuccessful.

No matter how many times he tried, he still couldn’t achieve the scores mentioned in the papers. He couldn’t believe that the scores were fabricated by the authors.

In the end, he decided to read the source code carefully. At first, he thought he had opened it incorrectly because it was different from the materials he had seen before:

I don’t understand why the original paper and online blogs do not explain how Word2Vec actually works. So I wanted to write it out myself.

It was also during this process that he discovered, as mentioned earlier, that GloVe gives the context (Context) a separate vector, which came from Word2Vec.

And the authors of GloVe never mentioned this point.

Thinking of this, the programmer had new doubts:

Isn’t this academic dishonesty (Academic Dishonesty)? I don’t know if it counts, but I think it’s at least a serious issue.

Feeling sad, Pineapple made a clever decision: from now on, he would not read the algorithm explanations in papers but directly read the source code.

Is This a Common Habit?

Discussing the inconsistency between papers and implementations, a senior programmer (DannyBee) who has read papers for 40 years occupied the top floor of the Hacker News comment section.

He recounted the changes in authors’ habits over the years:

In the early days, many algorithm implementations matched their principles and descriptions, and the performance matched the descriptions. Papers would use pseudocode (Pseudocode), and the differences in pseudocode would be explained in detail.

Later, people began to drift away. Some papers’ algorithms either did not work as described or were so inefficient that they could not be used. When looking at the source code, it was found that it was not what the paper said.

SSAPRE is a typical case. Even today, people still find its paper difficult to understand. When putting the source code into the Open64 compiler, it is also found to be wildly different (Wildly Different) from the paper.

Later on, with communities like GitHub, things seemed to move back toward a healthier direction.

In such an environment, Word2Vec is a counterexample, perhaps they felt that having open-sourced the code, it didn’t matter if the paper was unclear.

Immediately, someone (nullwasamistake) below commented that there are more than one counterexample:

When I was implementing a hash table sorting algorithm, I found a recent paper with similar issues.

The paper never mentioned that the table size must be a power of 2.

And the entire significance of this research seemed to be to have a higher memory efficiency than existing algorithms.

I did 2/3 of the work only to find that it was not more efficient than existing methods, but worse, unless I adjusted the table size to 2^n.

While it is not outright deception, this oversight is quite creative.

However, when someone suggested that he should post that paper, this honest commenter replied:

There is a risk in criticizing tech giants now; I might want to work there in the future.

From this, it can be seen that Pineapple is a brave young man.

Portal

Pineapple’s complete opinion on Word2Vec is published on GitHub, and interested readers can check it out:

https://github.com/bollu/bollu.github.io

Additionally, there is the Hacker News comment section, which is convenient for finding more like-minded individuals:

https://news.ycombinator.com/item?id=20089515

The End

Mini Program | AI Learning Tutorials for All Categories

The Misconceptions About Word2Vec: A Programmer's Insight

AI Community | Communicate with Excellent People

The Misconceptions About Word2Vec: A Programmer's Insight

The Misconceptions About Word2Vec: A Programmer's Insight

Quantum Bit QbitAI · Contract Author of Toutiao

v’ᴗ’ v Tracking new dynamics of AI technology and products

Leave a Comment