Getting Started with Word2Vec: A Practical Guide

Author: Liu Jianping Pinard Blog Address: https://www.cnblogs.com/pinard Original Link, click to read the full text directly: https://www.cnblogs.com/pinard/p/7278324.html

In the Word2Vec principle article, we summarized the two models of Word2Vec: CBOW and Skip-Gram, as well as the two solutions: Hierarchical Softmax and Negative Sampling.

  • Word2Vec Principle Article | Basics of CBOW and Skip-Gram Models

  • Word2Vec Principle Article | Models Based on Hierarchical Softmax

  • Word2Vec Principle | Models Based on Negative Sampling

1. Installation and Overview of Gensim

Gensim is a very useful Python NLP package that can not only be used for Word2Vec but also has many other APIs available. It encapsulates Google’s C language version of Word2Vec. Of course, we can directly use the C language version of Word2Vec for learning, but I personally think it is not as convenient as using the Python version provided by Gensim.

Installing Gensim is very easy; just use “pip install gensim”. However, it is important to note that Gensim has requirements for the version of NumPy, so during the installation process, it may secretly upgrade your NumPy version. The Windows version of NumPy can have issues when installed or upgraded directly. At this point, we need to uninstall NumPy and re-download a version of NumPy with MKL that meets Gensim’s version requirements. The download address is here: http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy. The installation method is the same as the method in step 4 of setting up the Windows standalone machine learning environment for scikit-learn and pandas.

The successful installation is indicated by the ability to import the following in your code without errors:

from gensim.models import word2vec

2. Overview of Gensim Word2Vec API

In Gensim, the APIs related to Word2Vec are all in the package gensim.models.word2vec. The parameters related to the algorithm are in the class gensim.models.word2vec.Word2Vec. The parameters to note for the algorithm are:

1) sentences: The corpus we want to analyze, which can be a list or read from a file. Later, we will have examples of reading from a file.

2) size: The dimension of the word vectors, with a default value of 100. This dimension is generally related to the size of our corpus; if it is a small corpus, such as text data less than 100M, then using the default value is generally sufficient. If it is a very large corpus, it is recommended to increase the dimension.

3) window: The maximum distance of context for the word vectors. This parameter is marked as c in our algorithm principle article. The larger the window, the more distant words will also have context relationships with a certain word. The default value is 5. In actual use, you can dynamically adjust the size of this window according to actual needs. For small corpora, this value can be set smaller. For general corpora, it is recommended to keep this value between [5,10].

4) sg: This is the choice between the two models of Word2Vec. If it is 0, it is the CBOW model; if it is 1, it is the Skip-Gram model. The default is 0, which is the CBOW model.

5) hs: This is the choice between the two solutions of Word2Vec. If it is 0, it is Negative Sampling; if it is 1 and the number of negative samples is greater than 0, it is Hierarchical Softmax. The default is 0, which is Negative Sampling.

6) negative: This is the number of negative samples when using Negative Sampling, with a default of 5. It is recommended to keep this value between [3,10]. This parameter is marked as neg in our algorithm principle article.

7) cbow_mean: Only used for CBOW during projection. If it is 0, the algorithm sums the context word vectors; if it is 1, it averages the context word vectors. In our principle article, it is described according to the average of the word vectors. I personally prefer to use the average to represent it; the default value is also 1, and it is not recommended to modify the default value.

8) min_count: The minimum word frequency required to calculate the word vectors. This value can remove some very rare low-frequency words, with a default of 5. For small corpora, this value can be lowered.

9) iter: The maximum number of iterations in the stochastic gradient descent method, with a default of 5. For large corpora, this value can be increased.

10) alpha: The initial step size in the stochastic gradient descent method during iterations. It is marked as alpha in the algorithm principle article, with a default of 0.025.

11) min_alpha: Since the algorithm supports gradually reducing the step size during iterations, min_alpha gives the minimum iteration step size. The iteration step size in stochastic gradient descent can be derived from iter, alpha, and min_alpha together. This part is not the core content of the Word2Vec algorithm, so we did not mention it in the principle article. For large corpora, you need to tune alpha, min_alpha, and iter together to select suitable values for the three.

These are the main parameters of Gensim Word2Vec. Next, we will learn Word2Vec with a practical example.

3. Practical Application of Gensim Word2Vec

I chose the original text of the novel “In the Name of the People” as the corpus, and the original corpus can be found here (http://files.cnblogs.com/files/pinard/in_the_name_of_people.zip).

Complete code can be found on my GitHub: https://github.com/ljpzzz/machinelearning/blob/master/natural-language-processing/word2vec.ipynb

After obtaining the original text, we first need to perform word segmentation, which is completed using Jieba segmentation. In the summary of Chinese text mining preprocessing, we have already summarized the principles and practices of word segmentation. Therefore, here we directly provide the code for word segmentation, and the results of the segmentation will be placed in another file. The code is as follows, and adding the following string of names is to ensure that Jieba can more accurately segment the names.

  # -*- coding: utf-8 -*-

import jieba
import jieba.analyse

jieba.suggest_freq('沙瑞金', True)
jieba.suggest_freq('田国富', True)
jieba.suggest_freq('高育良', True)
jieba.suggest_freq('侯亮平', True)
jieba.suggest_freq('钟小艾', True)
jieba.suggest_freq('陈岩石', True)
jieba.suggest_freq('欧阳菁', True)
jieba.suggest_freq('易学习', True)
jieba.suggest_freq('王大路', True)
jieba.suggest_freq('蔡成功', True)
jieba.suggest_freq('孙连城', True)
jieba.suggest_freq('季昌明', True)
jieba.suggest_freq('丁义珍', True)
jieba.suggest_freq('郑西坡', True)
jieba.suggest_freq('赵东来', True)
jieba.suggest_freq('高小琴', True)
jieba.suggest_freq('赵瑞龙', True)
jieba.suggest_freq('林华华', True)
jieba.suggest_freq('陆亦可', True)
jieba.suggest_freq('刘新建', True)
jieba.suggest_freq('刘庆祝', True)

with open('./in_the_name_of_people.txt') as f:
    document = f.read()
    
    #document_decode = document.decode('GBK')
    
    document_cut = jieba.cut(document)
    #print  ' '.join(jieba_cut)  //If the print result is shown, the segmentation effect disappears, and the subsequent result cannot be displayed
    result = ' '.join(document_cut)
    result = result.encode('utf-8')
    with open('./in_the_name_of_people_segment.txt', 'w') as f2:
        f2.write(result)
f.close()
f2.close()

After obtaining the segmented file, in general NLP processing, it is necessary to remove stop words. Since the Word2Vec algorithm relies on context, and stop words can potentially be part of the context, we can choose not to remove stop words for Word2Vec.

Now we can directly read the segmented file into memory. Here we use the LineSentence class provided by Word2Vec to read the file, and then apply the Word2Vec model. This is just an example, so we skip the parameter tuning step; in actual use, you may need to tune some of the parameters mentioned above.

    # import modules & set up logging
    import logging
    import os
    from gensim.models import word2vec

    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

    sentences = word2vec.LineSentence('./in_the_name_of_people_segment.txt') 

    model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)

The model is ready; what can we do with it? Here are three common applications.

The first is the most common: finding the set of words most similar to a given word vector. The code is as follows:

req_count = 5
for key in model.wv.similar_by_word('沙瑞金'.decode('utf-8'), topn =100):
    if len(key[0])==3:
        req_count -= 1
        print key[0], key[1]
        if req_count == 0:
            break

Let’s see the three-character words most similar to Secretary Sha (mainly names) as follows:

Gao Yuliang 0.967257142067 Li Dakang 0.959131598473 Tian Guofu 0.953414440155 Yi Xuexi 0.943500876427 Qi Tongwei 0.942932963371

The second application is to see the similarity between two word vectors, here are the similarity degrees of two pairs of people from the book:

print model.wv.similarity('沙瑞金'.decode('utf-8'), '高育良'.decode('utf-8'))
print model.wv.similarity('李达康'.decode('utf-8'), '王大路'.decode('utf-8'))

The output is as follows:

0.9611374553250.935589365706

The third application is to find words of different categories, here is a character classification question:

print model.wv.doesnt_match(u"沙瑞金 高育良 李达康 刘庆祝".split())

Word2Vec also performs well here, with the output being “Liu Qingzhu”.

That’s all for learning Word2Vec with Gensim; I hope it helps everyone.

Leave a Comment