Word2Vec and Its Relatives: Matrix Factorization

The original article was published on the public account: A Confession of a Fortune-Telling Engineer. Feel free to follow at the end of the article to receive various reliable and unreliable updates from the author.

Paper Title: Neural Word Embedding as Implicit Matrix Factorization

This is an article from NIPS 2014, which is quite old. I was drawn to it by the title and downloaded it before, but I never read it. Recently, after reviewing a series of articles on word vector representations, especially one I saw a couple of days ago that mentioned using SVD to decompose the PMI matrix as a replacement for word2vec, I recalled this article. So, I dug it out and took a closer look again, gaining a lot from it.

The core of this article discusses several things:

Derivation of the Equivalence of SGNS and PMI Matrix:

Through transformation of the loss function, it proves that the optimization result of SGNS (Skip-gram Negative Sampling), which is a set of word vectors, is approximately the same as the result of decomposing a PMI matrix, which is the row vectors of a matrix, differing only by a constant log(k), where k is the number of negative samples corresponding to each positive sample. This means that each element of the decomposed matrix is PMI(w,c)-log(k), where w is the center word and c is the context word. A simple proof process is as follows (for a detailed proof, please refer to the paper):

First, it can be proven that if w and c can take infinitely long, the loss of SGNS reaches its minimum value, then the corresponding w·c (dot product) equals PMI(w,c)-log(k). This is because, if w and c can take infinitely long, it can be considered that there are infinite parameters available, and according to basic learnable theory, each wi·ci can independently fit any value.
Then, reorganizing the SGNS loss, one key step is to treat w·c as a whole variable x and optimize each cell independently—this independent optimization is based on the earlier mentioned condition that w and c can take infinitely long, so they can independently fit the required values. By minimizing the gradient, we can derive the value of w·c corresponding to the minimum local loss for each independent w and c, which happens to be PMI(w,c)-log(k).

The optimization process of NCE loss is similar to SGNS, and its result is equivalent to decomposing a conditional probability matrix, where each element of the matrix is logP(w|c)-log(k), with the variable meanings the same as before.
However, in reality, it is impossible to take arbitrarily long w and c, so some cells will deviate from the optimal quality mentioned above. Hence, the problem becomes a weighted matrix decomposition problem, where the weights are determined by the occurrence counts of the wc pairs, with the constraint being k, which is the length limitation of the w and c vectors.

Optimization Methods for PMI Matrix Decomposition:

Since the PMI matrix is a dense matrix, there will also be many difficult-to-handle missing values, and writing missing values as 0 will also cause issues. Therefore, decomposing Positive PMI is more reasonable, which means turning all non-positive values into 0.

The author believes that the intuitive reason for ignoring negative values in the PMI matrix is that words with positive associations are more useful, such as Canada and snow, while words with negative associations are not as useful, such as Canada and desert. Therefore, the similarity between two words is more influenced by their jointly occurring positive examples rather than negative examples [1].

Furthermore, the author proposed Shifted PPMI, analogous to PPMI, which means treating the matrix with elements PMI(w,c)-log(k) as positive, and decomposing this SPPMI matrix corresponds to optimizing SGNS.
SPPMI can be directly decomposed using SVD, which performs very well in restoring PMI or similar matrices without the need to tune any parameters, but its effectiveness varies in specific NLP tasks.
Comparison between SVD and SGNS:

However, I am puzzled by this point because the cells in the PMI matrix are already weighted, so wouldn’t direct optimization be sufficient? Why is it still necessary to consider whether it is weighted?
The results of SVD decomposition show no difference in word similarity compared to SGNS, but in other issues like analogy, SVD does not perform as well as SGNS.
SVD decomposition can be performed on simple counting results without constructing massive SGNS samples. This is advantageous in scenarios with large data volumes, as the computational load of SVD remains unchanged, while that of SGNS increases with sample size.
The SVD method cannot improve with increased vector dimensions, while SGNS can. One reason is that after the dimensional increase, SVD will try to fit a matrix with a large number of zero results, yielding an approximate zero result.
SVD cannot solve the weighted issue, but SGNS, being based on SGD, can address this problem. Thus, SGNS gives more attention to high-frequency samples, improving its optimization effects, while SVD cannot distinguish this point and cannot specifically optimize high-frequency data.
SVD does not handle unobserved results, but SGNS can, as this is its negative samples.
SGNS does not require the matrix to be sparse, but SVD prefers the matrix to be sparse; otherwise, the computational load is significant.

Gradient-based SVD (SMF) has advantages and disadvantages that lie between SVD and SGNS, such as stronger computational capability than SVD, the ability to handle weights, but cannot completely restore the original matrix, etc.

In summary, the most significant significance of this article lies in providing us with a more understandable dimension for the somewhat black-box nature of word vectors, a reasonable explanation. However, due to the aforementioned issues, the SVD method may not necessarily be suitable for application in large-scale data practical scenarios.

Postscript: Later, I saw another article[2] stating that the conclusions of this article are invalid, as the premise that SGNS is equivalent to matrix decomposition is that w and c can take infinitely long, meaning there are infinite parameters available, but this is not the case. However, I believe we can view the unrestricted case of w and c as the optimal situation, which is the target value, while the restricted case is our learning model, aiming to fit this target value, thus treating the problem as a regression issue. The results obtained should also be reasonable. Moreover, from a practical effectiveness perspective, using SVD to decompose the SPPMI matrix indeed yields good results.

That being said, even if the inference proposed in this article has flaws, it is still the first public article (to my knowledge) comparing SGNS and MF, and its contributions are significant.

This approach feels quite similar to the idea of ReLU, as both discard negative values and only use positive values.
http://building-babylon.net/2016/05/12/skipgram-isnt-matrix-factorisation/

Leave a Comment Cancel reply