Using Word2Vec Word Vector Model in R

Using Word2Vec Word Vector Model in R

The gensim library in Python can train and use the Word2Vec model, and there is a corresponding word2vec package in R. Word2Vec is one of the most commonly used techniques in word embedding. If you are not familiar with word embeddings, you can read the previous articles.

  • Reprint | Expanding Research Methods in Social Sciences in the Era of Big Data — Application of Text Analysis Based on Word Embedding Technology
  • Reprint | From Symbols to Embeddings: Two Text Representations in Computational Social Science

The R packages needed for this article

install.packages(c("word2vec", "jiebaR", "tidyverse", "readtext"))

Common Functions of the Word2Vec Package

  • word2vec Train a Word2Vec model using text data
  • as.matrix Get word vectors
  • doc2vec Get document vectors
  • predict Get synonyms
  • write.word2vec Save the Word2Vec model to a file
  • read.word2vec Read a Word2Vec model file

Preparing Data

The original data is downloaded from the website 三体.txt and has not been tokenized. Now we need to:

  1. Read the Chinese txt data
  2. Preserve punctuation and perform tokenization
  3. Reorganize the tokenization results into a string format similar to English (words separated by spaces)
  4. Save the results into a new txt
library(jiebaR)
library(tidyverse)
library(word2vec)

# Import data
tri_body <- readtext::readtext('data/三体.txt')$text 

# Tokenization (preserve punctuation)
tokenizer <- worker(symbol=T)
tri_words <- segment(tri_body, tokenizer)

# Organize into English format (add spaces between words)
segmented_text <- stringr::str_c(tri_words, collapse = " ") %>% c()

# Write to txt
readr::write_file(segmented_text, file='data/santi.txt')

Training the Word2Vec Model

word2vec(
  x,
  type = c("cbow", "skip-gram"),
  dim = 50,
  window = ifelse(type == "cbow", 5L, 10L),
  iter = 5L,
  lr = 0.05,
  min_count = 5L,
  split = c(" \n,.-!?:;/\"#$%&'<=>@[]\\^_`{|}~\t\v\f\r", ".\n?!"),
  stopwords = character(),
  threads = 1L,
  ...
)
  • x English text data txt file (the Chinese data txt file is the tokenized txt file with words separated by spaces)
  • type Training method, default is CBOW
  • dim Word vector dimension, default is 50 dimensions
  • window Word vector window, default is 5
  • iter Number of training iterations, default is 5
  • split Delimiters for tokenization and sentence splitting.
  • lr Learning rate, default is 0.05
  • min_count Words must appear at least 5 times in the corpus (words that appear less than 5 times will not be included in the trained results)
  • stopwords Stop words list, default is an empty character set
  • threads Parallel acceleration, number of CPU cores, default is 1. To speed up the training process, you can use parallel::detectCores() to get the number of cores on your computer
# Train a 10-dimensional word vector model
model <- word2vec(x = 'data/santi.txt', 
                  dim = 10,  
                  iter = 20, 
                  split = c(" ",  "。?!;"),
                  threads = parallel::detectCores()) # Parallel, use multiple CPU cores to accelerate

emb <- as.matrix(model)

# Display 6 words
head(emb)
Run
##             [,1]       [,2]        [,3]        [,4]      [,5]        [,6]
## 煮   -1.02566934 -0.9271542 -0.42417252 -0.54280633 1.8847700  0.41640753
## 报   -0.83992052  1.9440031  0.09093992  0.83522910 1.7909089  0.72149992
## 悬空 -0.06369513 -1.3519955 -2.13137460 -0.06198586 0.6096401  1.32933748
## 略    1.74687469 -0.4278547 -0.33822438  1.08505321 2.0168977 -0.07693915
## 伏   -0.68947995 -1.4147453 -1.95522511 -0.39963767 0.5269030  0.30352208
## 石柱 -0.40561640 -1.3643234  0.30329546 -0.94012892 2.1579018  0.79654717
##            [,7]       [,8]       [,9]      [,10]
## 煮   -1.1708908 -0.7624418 -0.6275516  1.2417521
## 报    0.5235919  0.8448864 -0.2960095 -0.0773837
## 悬空  0.1527163 -0.1337370 -0.1646384  1.1892601
## 略   -0.3246748 -0.9813624  0.5045205  0.2771466
## 伏    0.3166684 -1.4238008 -1.0167172 -0.0976937
## 石柱  0.2237919  0.6933151  0.7412233 -0.791870

View the Vector of a Word

View the vector of the word 汪淼

emb["汪淼",]

Run

##  [1] -0.77559733 -0.90021265  0.66555792 -0.10277803  1.89924443 -0.88817298
##  [7] -1.32665634 -0.75938725 -0.09628224  1.18008399

View the vector of the word 地球

emb["地球",]

Run

##  [1]  0.29645494 -0.61688840  0.91209215 -0.64530188  0.62816381 -0.72807491
##  [7]  0.50655973  2.38137436  1.19238114 -0.0961034

predict()

Find the 20 most similar words to the word 罗辑 in the corpus

predict(model, '罗辑', type='nearest', top_n = 20)

Run

## $罗辑
##    term1    term2 similarity rank
## 1   罗辑     胡文  0.9744400    1
## 2   罗辑   申玉菲  0.9678891    2
## 3   罗辑   瓦季姆  0.9550550    3
## 4   罗辑 狄奥伦娜  0.9518393    4
## 5   罗辑     蓝西  0.9472395    5
## 6   罗辑     护士  0.9471439    6
## 7   罗辑   法扎兰  0.9458703    7
## 8   罗辑   白艾思  0.9451101    8
## 9   罗辑     坎特  0.9396626    9
## 10  罗辑     白蓉  0.9387447   10
## 11  罗辑   参谋长  0.9377206   11
## 12  罗辑   弗雷斯  0.9369408   12
## 13  罗辑   第一眼  0.9357565   13
## 14  罗辑     父亲  0.9350463   14
## 15  罗辑   多少次  0.9314436   15
## 16  罗辑     门去  0.9291503   16
## 17  罗辑     维德  0.9267251   17
## 18  罗辑     褐蚁  0.9203902   18
## 19  罗辑       刚  0.9200501   19
## 20  罗辑     吴岳  0.9191605   20

View the mean vector (the center of multiple word vectors) of 10 synonyms

vectors <- emb[c("汪淼", "罗辑", "叶文洁"), ]
centroid_vector <- colMeans(vectors)

predict(model, centroid_vector, type = "nearest", top_n = 10)

Run

##        term similarity rank
## 1      罗辑  0.9185568    1
## 2  狄奥伦娜  0.9104245    2
## 3      文洁  0.9088279    3
## 4      汪淼  0.9054156    4
## 5    白艾思  0.9046930    5
## 6      张翔  0.9026827    6
## 7      尴尬  0.8952187    7
## 8      庄颜  0.8952166    8
## 9      皇帝  0.8949283    9
## 10     父亲  0.8915347   10

doc2vec()

  • doc2vec(object, newdata, split = ” “)
    • object Word2Vec model object
    • newdata List of documents (list of strings separated by spaces)
    • split Default delimiter is space

Convert 2 documents to vectors

docs <- c("哦 , 对不起 , 汪 教授 。这是 我们 史强 队长 。", 
          " 丁仪 博士 , 您 能否 把 杨冬 的 遗书 给 汪 教授 看 一下 ?")

doc2vec(object=model, newdata = docs, split=' ')

Run

##            [,1]       [,2]       [,3]     [,4]      [,5]       [,6]       [,7]
## [1,] -1.1769752 -0.1065619  0.1983950 1.734068 0.5478012 -0.8320528 -0.2387014
## [2,] -0.4827189  0.0664595 -0.2119484 1.895074 0.6729840 -0.3008853 -0.6857539
##            [,8]      [,9]      [,10]
## [1,] -0.5519856 -2.007002  0.4182127
## [2,] -0.5976922 -2.130454 -0.4653725

Saving the Word2Vec Model

Save the model for two main purposes

  • To share the Word2Vec model
  • To avoid retraining the model repeatedly, saving data analysis time
word2vec::write.word2vec(x = model, 
                         # Create a new output folder and save the model in it
                         file = "output/santi_word2vec.bin")

Importing a Pre-trained Model

Import the pre-trained Word2Vec model from output/santi_word2vec.bin

pre_trained_model <- word2vec::read.word2vec(file = "output/santi_word2vec.bin")
pre_trained_emb <- as.matrix(pre_trained_model)
head(pre_trained_emb)

Run

##             [,1]       [,2]        [,3]        [,4]      [,5]        [,6]
## 煮   -1.02566934 -0.9271542 -0.42417252 -0.54280633 1.8847700  0.41640753
## 报   -0.83992052  1.9440031  0.09093992  0.83522910 1.7909089  0.72149992
## 悬空 -0.06369513 -1.3519955 -2.13137460 -0.06198586 0.6096401  1.32933748
## 略    1.74687469 -0.4278547 -0.33822438  1.08505321 2.0168977 -0.07693915
## 伏   -0.68947995 -1.4147453 -1.95522511 -0.39963767 0.5269030  0.30352208
## 石柱 -0.40561640 -1.3643234  0.30329546 -0.94012892 2.1579018  0.79654717
##            [,7]       [,8]       [,9]      [,10]
## 煮   -1.1708908 -0.7624418 -0.6275516  1.2417521
## 报    0.5235919  0.8448864 -0.2960095 -0.0773837
## 悬空  0.1527163 -0.1337370 -0.1646384  1.1892601
## 略   -0.3246748 -0.9813624  0.5045205  0.2771466
## 伏    0.3166684 -1.4238008 -1.0167172 -0.0976937
## 石柱  0.2237919  0.6933151  0.7412233 -0.7918702

Selected Articles

From Symbols to Embeddings: Two Text Representations in Computational Social Science

Data | Quantitative History and Economics Research

Long-term Call for Papers | Welcome everyone to submit

17G Dataset | Shenzhen Stock Exchange Corporate Social Responsibility Reports

Baidu Index | Using Qdata to Collect Baidu Index

Recommended | Quick Guide to Text Analysis in Social Sciences (Economics and Management)

Video Sharing | Application of Text Analysis in Management Research

Maigret Library | Check the usage of a username on various platforms
MS | Using Network Algorithms to Identify Innovation Disruption
Training GloVe Word Embedding Model with CNtext
Cognitive Measurement | Vector Distance vs Semantic Projection
Wordify | Tool to Discover and Differentiate Consumer Vocabulary
Display PDF Content in Jupyter
NLP Roadmap | Mind Map of Text Analysis Knowledge Points

EmoBank | Chinese Dimensional Emotion Dictionary

Asent Library | English Text Data Sentiment Analysis
Video Column Course | Python Web Crawlers and Text Analysis
PNAS | Text Network Analysis & Cultural Bridge Python Code Implementation
BERTopic Library | Topic Modeling with Pre-trained Models
Tomotopy | Fastest LDA Topic Model
Management World | Building and Measuring Short-sightedness with Text Analysis
Wow~70G Regular Report Dataset of Listed Companies
100min Video | Python Text Analysis and Accounting
Run R Code in Jupyter
Error when Installing Python Packages: Microsoft Visual 14.0 or Greater is Required. What to Do?

Blogdown Package | Using R to Maintain Hugo Static Websites
R Language | Creating Academic Conference Posters with Posterdown Package
R Language | Creating PPT in Rmarkdown with Officedown Package

R Language | Drawing SCI Style Charts with Ggsci Package

R Language | Combining Multiple TXT Files into One CSV File

Leave a Comment