Practical Application of Word2vec in NLP

Introduction
References
Main Content

Dataset
Model Training
Model Evaluation
Model Tuning

Extensions
Bonus

Introduction

Hello everyone, I am a dropout from Royal Bruster University of Data Mining, I drink the strongest orange juice and dig the deepest corners—persistent as I am.

Last week, I impulsively dug a big pit of Word2vec, leaving the practical part to be filled in.

This article will use the Word2vec model

to mine similar words in the biological field.

The content of this article includes:
* Interpreting the parameters of the Word2vec model in Gensim
* Training the Word2vec model based on the corresponding corpus and evaluating the results
* Tuning the model results

References

Word2vec Tutorial
Gensim Word2vec
My previous work [Understanding the Essence of Word Vectors in Word2vec]

Main Content

It is recommended to use Jupyter Notebook for easy execution of code snippets. Also, install the gensim and nltk Python libraries in advance.

Additionally, I can’t stand the stupid code format of WeChat public accounts anymore!! Every time I upload it online, it gets messed up. I decided to upload the source code to Baidu Cloud. Just reply W2vAction in the public account backend to download it.

Dataset

At the end of reference [3.], I mentioned a dataset in the medical field (reply Sherlocked to obtain it). This dataset is a non-public dataset, which crawled the abstracts of multiple medical papers and processed it into a format where each line is a sentence.

We can run the following shell script in Jupyter Notebook to see the general situation of the dataset (it may run incorrectly on Windows, but that’s not important).

!echo 'Number of lines in the dataset: '

!wc -l 'bioCorpus_5000.txt'

!echo '======'

!echo 'First 10 lines of the dataset'

!head -10 'bioCorpus_5000.txt'

We can see the output as follows:

Number of lines in the dataset: 5000 bioCorpus_5000.txt

======'

First 10 lines of the dataset

formate assay in body fluids application in methanol poisoning.

delineation of the intimate details of the backbone conformation of pyridinenucleotide coenzymes in aqueous solution.

metal substitutions in carbonic anhydrase a halide ion probe study.

effect of chloroquine on cultured fibroblasts release of lysosomal hydrolases and inhibition of their uptake.

atomic models for the polypeptide backbones of myohemerythrin and hemerythrin.

studies of oxygen binding energy to hemoglobin molecule.

maturation of the adrenal medulla--IV

effects of morphine.

comparison between procaine and isocarboxazid metabolism in vitro by a liver microsomal amidase-esterase.

radiochemical assay of glutathione S-epoxide transferase and its enhancement by phenobarbital in rat liver in vivo.

As we can see, this dataset is relatively clean because I did some preprocessing, including removing abnormal characters and changing the first letter to lowercase, but I did not perform stemming or similar operations. Personally, I think different tenses of words should be treated differently.

Model Training

First, import our main character today—word2vec.

from gensim.models import word2vec # Word2vec model

If you have carefully read reference [2.], you will find that the input format accepted by word2vec is similar to:

[['I', 'am', 'handsome'], ['Mu', 'wen', 'looks', 'cool'], ...]

That is, a list where the elements are also lists, each list represents a sentence, and the words of that sentence are split into list elements.

However, when we have a lot of sentences and limited memory, storing them all in a list and loading them into memory is not appropriate, so we define a generator.

# Use generator to read sentences from the file # Suitable for reading large files without loading into memoryclass MySentences(object):

    def __init__(self, fname):

        self.fname = fname

    def __iter__(self):

        for line in open(self.fname,'r'): yield line.split()

For those who are not familiar with generators and the yield keyword, you can supplement related knowledge on your own. I won’t expand on it here because it doesn’t affect the subsequent content.

We can also define a training function, specifying the input file path and the output model path as follows:

# Model training functiondef w2vTrain(f_input, model_output):

    sentences = MySentences(DataDir+f_input)

    w2v_model = word2vec.Word2Vec(sentences,

                                  min_count = MIN_COUNT,

                                  workers = CPU_NUM,

                                  size = VEC_SIZE,

                                  window = CONTEXT_WINDOW

    w2v_model.save(ModelDir+model_output)

Note that word2vec.Word2Vec() is the implementation of word2vec in gensim, and the explanations of the parameters are as follows:

min_count: Words with a frequency < min_count will be discarded (the best method is to replace them with the UNK symbol, known as ‘out-of-vocabulary words’, but here we simplify and consider such low-frequency words unimportant and discard them directly).
workers: The number of cores that can execute in parallel; Cython must be installed for it to take effect (installing Cython is very simple, just use pip install cython).
size: The dimension of the word vector, which is the number of hidden layer nodes in the neural network mentioned in reference [3.].
window: The maximum distance of context words from the target word; this is easy to understand. For example, the CBOW model predicts the word using the context of one word, so there must be a limit on this context. If it takes too much and is too far from the target word, some words become meaningless. If it takes too little, there is insufficient information, so window is the longest distance of context.

After understanding the meanings of these parameters, we can manually set them based on the real situation of our dataset. We are using a small dataset here, so the parameters generally won’t be set too large. The training process is as follows:

# TrainingDataDir = "./"ModelDir = "./ipynb_garbage_files/"MIN_COUNT = 4CPU_NUM = 2 # Cython must be installed to support parallelVEC_SIZE = 20CONTEXT_WINDOW = 5 # Extract the maximum distance of context words from the target word is 5 f_input = "bioCorpus_5000.txt"model_output = "test_w2v_model"w2vTrain(f_input, model_output)

Given that many websites have crawled my original articles without permission, I decided to insert a QR code in the article to maintain the source. I hope it won’t disturb your reading.

Model Evaluation

After training the model, let’s first load the local model.

w2v_model = word2vec.Word2Vec.load(ModelDir+model_output)

Our evaluation method is to look for similar words for some existing words to see if they make sense. For example, we will look for synonyms of the word body.

w2v_model.most_similar('body')

The results are:

[('blood', 0.9992936253547668),

 ('an', 0.9992882609367371),

 ('plasma', 0.9992483258247375),

 ('with', 0.9992115497589111),

 ('cardiac', 0.9991875290870667),

 ('and', 0.9991493225097656),

 ('human', 0.999146580696106),

 ('concentration', 0.9991347789764404),

 ('as', 0.9991294741630554),

 ('between', 0.9990973472595215)]

The left column shows similar words, and the right column shows the similarity. We can see that besides blood, cardiac, and plasma, which are relatively meaningful, some obviously nonsensical words like an, and, and with are mixed in.

Now let’s look at the synonyms of heart.

w2v_model.most_similar('heart')

The results are:

[('liver', 0.9995023012161255),

 ('as', 0.9994983673095703),

 ('by', 0.9994901418685913),

 ('for', 0.9994851350784302),

 ('and', 0.9994800090789795),

 ('isolated', 0.9994774460792542),

 ('from', 0.9994565844535828),

 ('cells', 0.9994398355484009),

 ('with', 0.9994127154350281),

 ('patients', 0.9994051456451416)]

Some meaningful words, such as liver and cells, are present, but some strange ones are mixed in.

Model Tuning

The strange words mixed in are what we call ‘stop words’ in NLP, such as common pronouns and prepositions. I believe there are two reasons for this result:

Poor parameter settings, for example, if vec_size is set too small, it leads to these 20 dimensions being insufficient to capture the different information between words. Therefore, we need to continue adjusting the hyperparameters.
The dataset is small, so stop words occupy too much information.

I won’t discuss the method of adjusting parameters in detail; it’s a mystery (well, actually, I think it’s too boring). If you have time, try Bayesian tuning. For the second issue, we can remove stop words from the dataset in advance. The method is as follows: first, import the stop words list from nltk.

# Stop wordsfrom nltk.corpus import stopwordsStopWords = stopwords.words('english')

We can take a look at the stop words.

StopWords[:20]

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers']

Next, we need to process each sentence after reading the dataset, then throw it into word2vec. Therefore, we need to define a new training function that adds the operation of removing stop words and retrain the model (for convenience of comparison, the parameter settings remain the same as above).

# Retraining # Model training functiondef w2vTrain_removeStopWords(f_input, model_output):

    sentences = list(MySentences(DataDir+f_input))    for idx,sentence in enumerate(sentences):

        sentence = [w for w in sentence if w not in StopWords]

        sentences[idx]=sentence

    w2v_model = word2vec.Word2Vec(sentences, min_count = MIN_COUNT,

                                  workers = CPU_NUM, size = VEC_SIZE)

    w2v_model.save(ModelDir+model_output)

w2vTrain_removeStopWords(f_input, model_output)

w2v_model = word2vec.Word2Vec.load(ModelDir+model_output)

Next, let’s take a look at the effect.

w2v_model.most_similar('body') # The result is generally good[('relationship', 0.9543654918670654),

 ('plasma', 0.9490970373153687),

 ('two', 0.9482829570770264),

 ('blood', 0.9451138973236084),

 ('structure', 0.9415417909622192),

 ('properties', 0.9410394430160522),

 ('human', 0.9409817457199097),

 ('cardiac', 0.9402023553848267),

 ('effect', 0.9401187896728516),

 ('response', 0.9397702217102051)]

Well, except for two, the rest are quite meaningful (but the similarity is still 0.9+, which is obviously too high, so hyperparameters still need to be adjusted).

Now let’s take a look at heart.

w2v_model.most_similar('heart') # The result is generally good[('method', 0.9774184823036194),

 ('liver', 0.9749837517738342),

 ('determination', 0.9709398150444031),

 ('isolated', 0.9707441926002502),

 ('experimental', 0.9699070453643799),

 ('patients', 0.9696168899536133),

 ('chronic', 0.9682500958442688),

 ('regulation', 0.9668369293212891),

 ('study', 0.9661740660667419),

 ('cells', 0.9653100967407227)]

It’s getting interesting, but there are still some words like method and study that are not very direct. I guess they are related to heart research?

Extensions

Careful friends may have noticed that after training word2vec in gensim, the generated word2vec model (which is our above w2v_model) will have methods like model.syn0, model.syn1neg, model.negative, etc., returning results similar to a two-dimensional array. What are these?

These are actually some intermediate variables during the update process of the Word2vec model, such as parameters for hierarchical softmax (or negative sampling), the increment of the word vector updates, and the input word vectors of the model (for CBOW, it is the sum of context words; for skip-gram, it is the word itself). Since I did not delve into the training process of hierarchical softmax and negative sampling in reference [3.], it may be difficult to understand. I will write more about it later if I have time; if not, I will give up the pit. Anyway, as long as everyone knows how to use it, just adjust the gensim package; so easy!

Bonus

Feel free to reward and support 24K pure original!

Additionally, I have created a discussion group that only discusses technical issues. You can add my private WeChat account MuWenHappyEverday to be invited into the group. Please be sure to note: public account reader, otherwise it will not be approved.