The gensim library in Python can train and use the Word2Vec model, and there is a corresponding word2vec
package in R. Word2Vec is one of the most commonly used techniques in word embedding. If you are not familiar with word embeddings, you can read the previous articles.
-
Reprint | Expanding Research Methods in Social Sciences in the Era of Big Data — Application of Text Analysis Based on Word Embedding Technology -
Reprint | From Symbols to Embeddings: Two Text Representations in Computational Social Science
The R packages needed for this article
install.packages(c("word2vec", "jiebaR", "tidyverse", "readtext"))
Common Functions of the Word2Vec Package
-
word2vec Train a Word2Vec model using text data -
as.matrix Get word vectors -
doc2vec Get document vectors -
predict Get synonyms -
write.word2vec Save the Word2Vec model to a file -
read.word2vec Read a Word2Vec model file
Preparing Data
The original data is downloaded from the website 三体.txt
and has not been tokenized. Now we need to:
-
Read the Chinese txt data -
Preserve punctuation and perform tokenization -
Reorganize the tokenization results into a string format similar to English (words separated by spaces) -
Save the results into a new txt
library(jiebaR)
library(tidyverse)
library(word2vec)
# Import data
tri_body <- readtext::readtext('data/三体.txt')$text
# Tokenization (preserve punctuation)
tokenizer <- worker(symbol=T)
tri_words <- segment(tri_body, tokenizer)
# Organize into English format (add spaces between words)
segmented_text <- stringr::str_c(tri_words, collapse = " ") %>% c()
# Write to txt
readr::write_file(segmented_text, file='data/santi.txt')
Training the Word2Vec Model
word2vec(
x,
type = c("cbow", "skip-gram"),
dim = 50,
window = ifelse(type == "cbow", 5L, 10L),
iter = 5L,
lr = 0.05,
min_count = 5L,
split = c(" \n,.-!?:;/\"#$%&'<=>@[]\\^_`{|}~\t\v\f\r", ".\n?!"),
stopwords = character(),
threads = 1L,
...
)
-
x English text data txt file (the Chinese data txt file is the tokenized txt file with words separated by spaces) -
type Training method, default is CBOW -
dim Word vector dimension, default is 50 dimensions -
window Word vector window, default is 5 -
iter Number of training iterations, default is 5 -
split Delimiters for tokenization and sentence splitting. -
lr Learning rate, default is 0.05 -
min_count Words must appear at least 5 times in the corpus (words that appear less than 5 times will not be included in the trained results) -
stopwords Stop words list, default is an empty character set -
threads Parallel acceleration, number of CPU cores, default is 1. To speed up the training process, you can use parallel::detectCores()
to get the number of cores on your computer
# Train a 10-dimensional word vector model
model <- word2vec(x = 'data/santi.txt',
dim = 10,
iter = 20,
split = c(" ", "。?!;"),
threads = parallel::detectCores()) # Parallel, use multiple CPU cores to accelerate
emb <- as.matrix(model)
# Display 6 words
head(emb)
## [,1] [,2] [,3] [,4] [,5] [,6]
## 煮 -1.02566934 -0.9271542 -0.42417252 -0.54280633 1.8847700 0.41640753
## 报 -0.83992052 1.9440031 0.09093992 0.83522910 1.7909089 0.72149992
## 悬空 -0.06369513 -1.3519955 -2.13137460 -0.06198586 0.6096401 1.32933748
## 略 1.74687469 -0.4278547 -0.33822438 1.08505321 2.0168977 -0.07693915
## 伏 -0.68947995 -1.4147453 -1.95522511 -0.39963767 0.5269030 0.30352208
## 石柱 -0.40561640 -1.3643234 0.30329546 -0.94012892 2.1579018 0.79654717
## [,7] [,8] [,9] [,10]
## 煮 -1.1708908 -0.7624418 -0.6275516 1.2417521
## 报 0.5235919 0.8448864 -0.2960095 -0.0773837
## 悬空 0.1527163 -0.1337370 -0.1646384 1.1892601
## 略 -0.3246748 -0.9813624 0.5045205 0.2771466
## 伏 0.3166684 -1.4238008 -1.0167172 -0.0976937
## 石柱 0.2237919 0.6933151 0.7412233 -0.791870
View the Vector of a Word
View the vector of the word 汪淼
emb["汪淼",]
Run
## [1] -0.77559733 -0.90021265 0.66555792 -0.10277803 1.89924443 -0.88817298
## [7] -1.32665634 -0.75938725 -0.09628224 1.18008399
View the vector of the word 地球
emb["地球",]
Run
## [1] 0.29645494 -0.61688840 0.91209215 -0.64530188 0.62816381 -0.72807491
## [7] 0.50655973 2.38137436 1.19238114 -0.0961034
predict()
Find the 20 most similar words to the word 罗辑
in the corpus
predict(model, '罗辑', type='nearest', top_n = 20)
Run
## $罗辑
## term1 term2 similarity rank
## 1 罗辑 胡文 0.9744400 1
## 2 罗辑 申玉菲 0.9678891 2
## 3 罗辑 瓦季姆 0.9550550 3
## 4 罗辑 狄奥伦娜 0.9518393 4
## 5 罗辑 蓝西 0.9472395 5
## 6 罗辑 护士 0.9471439 6
## 7 罗辑 法扎兰 0.9458703 7
## 8 罗辑 白艾思 0.9451101 8
## 9 罗辑 坎特 0.9396626 9
## 10 罗辑 白蓉 0.9387447 10
## 11 罗辑 参谋长 0.9377206 11
## 12 罗辑 弗雷斯 0.9369408 12
## 13 罗辑 第一眼 0.9357565 13
## 14 罗辑 父亲 0.9350463 14
## 15 罗辑 多少次 0.9314436 15
## 16 罗辑 门去 0.9291503 16
## 17 罗辑 维德 0.9267251 17
## 18 罗辑 褐蚁 0.9203902 18
## 19 罗辑 刚 0.9200501 19
## 20 罗辑 吴岳 0.9191605 20
View the mean vector (the center of multiple word vectors) of 10 synonyms
vectors <- emb[c("汪淼", "罗辑", "叶文洁"), ]
centroid_vector <- colMeans(vectors)
predict(model, centroid_vector, type = "nearest", top_n = 10)
Run
## term similarity rank
## 1 罗辑 0.9185568 1
## 2 狄奥伦娜 0.9104245 2
## 3 文洁 0.9088279 3
## 4 汪淼 0.9054156 4
## 5 白艾思 0.9046930 5
## 6 张翔 0.9026827 6
## 7 尴尬 0.8952187 7
## 8 庄颜 0.8952166 8
## 9 皇帝 0.8949283 9
## 10 父亲 0.8915347 10
doc2vec()
-
doc2vec(object, newdata, split = ” “) -
object Word2Vec model object -
newdata List of documents (list of strings separated by spaces) -
split Default delimiter is space
Convert 2 documents to vectors
docs <- c("哦 , 对不起 , 汪 教授 。这是 我们 史强 队长 。",
" 丁仪 博士 , 您 能否 把 杨冬 的 遗书 给 汪 教授 看 一下 ?")
doc2vec(object=model, newdata = docs, split=' ')
Run
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] -1.1769752 -0.1065619 0.1983950 1.734068 0.5478012 -0.8320528 -0.2387014
## [2,] -0.4827189 0.0664595 -0.2119484 1.895074 0.6729840 -0.3008853 -0.6857539
## [,8] [,9] [,10]
## [1,] -0.5519856 -2.007002 0.4182127
## [2,] -0.5976922 -2.130454 -0.4653725
Saving the Word2Vec Model
Save the model for two main purposes
-
To share the Word2Vec model -
To avoid retraining the model repeatedly, saving data analysis time
word2vec::write.word2vec(x = model,
# Create a new output folder and save the model in it
file = "output/santi_word2vec.bin")
Importing a Pre-trained Model
Import the pre-trained Word2Vec model from output/santi_word2vec.bin
pre_trained_model <- word2vec::read.word2vec(file = "output/santi_word2vec.bin")
pre_trained_emb <- as.matrix(pre_trained_model)
head(pre_trained_emb)
Run
## [,1] [,2] [,3] [,4] [,5] [,6]
## 煮 -1.02566934 -0.9271542 -0.42417252 -0.54280633 1.8847700 0.41640753
## 报 -0.83992052 1.9440031 0.09093992 0.83522910 1.7909089 0.72149992
## 悬空 -0.06369513 -1.3519955 -2.13137460 -0.06198586 0.6096401 1.32933748
## 略 1.74687469 -0.4278547 -0.33822438 1.08505321 2.0168977 -0.07693915
## 伏 -0.68947995 -1.4147453 -1.95522511 -0.39963767 0.5269030 0.30352208
## 石柱 -0.40561640 -1.3643234 0.30329546 -0.94012892 2.1579018 0.79654717
## [,7] [,8] [,9] [,10]
## 煮 -1.1708908 -0.7624418 -0.6275516 1.2417521
## 报 0.5235919 0.8448864 -0.2960095 -0.0773837
## 悬空 0.1527163 -0.1337370 -0.1646384 1.1892601
## 略 -0.3246748 -0.9813624 0.5045205 0.2771466
## 伏 0.3166684 -1.4238008 -1.0167172 -0.0976937
## 石柱 0.2237919 0.6933151 0.7412233 -0.7918702
Selected Articles
From Symbols to Embeddings: Two Text Representations in Computational Social Science
Data | Quantitative History and Economics Research
Long-term Call for Papers | Welcome everyone to submit
17G Dataset | Shenzhen Stock Exchange Corporate Social Responsibility Reports
Baidu Index | Using Qdata to Collect Baidu Index
Recommended | Quick Guide to Text Analysis in Social Sciences (Economics and Management)
Video Sharing | Application of Text Analysis in Management Research
Maigret Library | Check the usage of a username on various platforms
MS | Using Network Algorithms to Identify Innovation Disruption
Training GloVe Word Embedding Model with CNtext
Cognitive Measurement | Vector Distance vs Semantic Projection
Wordify | Tool to Discover and Differentiate Consumer Vocabulary
Display PDF Content in Jupyter
NLP Roadmap | Mind Map of Text Analysis Knowledge Points
EmoBank | Chinese Dimensional Emotion Dictionary
Asent Library | English Text Data Sentiment Analysis
Video Column Course | Python Web Crawlers and Text Analysis
PNAS | Text Network Analysis & Cultural Bridge Python Code Implementation
BERTopic Library | Topic Modeling with Pre-trained Models
Tomotopy | Fastest LDA Topic Model
Management World | Building and Measuring Short-sightedness with Text Analysis
Wow~70G Regular Report Dataset of Listed Companies
100min Video | Python Text Analysis and Accounting
Run R Code in Jupyter
Error when Installing Python Packages: Microsoft Visual 14.0 or Greater is Required. What to Do?
Blogdown Package | Using R to Maintain Hugo Static Websites
R Language | Creating Academic Conference Posters with Posterdown Package
R Language | Creating PPT in Rmarkdown with Officedown Package
R Language | Drawing SCI Style Charts with Ggsci Package
From Symbols to Embeddings: Two Text Representations in Computational Social Science
Data | Quantitative History and Economics Research
Long-term Call for Papers | Welcome everyone to submit
17G Dataset | Shenzhen Stock Exchange Corporate Social Responsibility Reports
Baidu Index | Using Qdata to Collect Baidu Index
Recommended | Quick Guide to Text Analysis in Social Sciences (Economics and Management)
Video Sharing | Application of Text Analysis in Management Research
Maigret Library | Check the usage of a username on various platforms
MS | Using Network Algorithms to Identify Innovation Disruption
Training GloVe Word Embedding Model with CNtext
Cognitive Measurement | Vector Distance vs Semantic Projection
Wordify | Tool to Discover and Differentiate Consumer Vocabulary
Display PDF Content in Jupyter
NLP Roadmap | Mind Map of Text Analysis Knowledge Points
EmoBank | Chinese Dimensional Emotion Dictionary
Asent Library | English Text Data Sentiment Analysis
Video Column Course | Python Web Crawlers and Text Analysis
PNAS | Text Network Analysis & Cultural Bridge Python Code Implementation
BERTopic Library | Topic Modeling with Pre-trained Models
Tomotopy | Fastest LDA Topic Model
Management World | Building and Measuring Short-sightedness with Text Analysis
Wow~70G Regular Report Dataset of Listed Companies
100min Video | Python Text Analysis and Accounting
Run R Code in Jupyter
Error when Installing Python Packages: Microsoft Visual 14.0 or Greater is Required. What to Do?
Blogdown Package | Using R to Maintain Hugo Static Websites
R Language | Creating Academic Conference Posters with Posterdown Package
R Language | Creating PPT in Rmarkdown with Officedown Package
R Language | Drawing SCI Style Charts with Ggsci Package
R Language | Combining Multiple TXT Files into One CSV File
R Language | Combining Multiple TXT Files into One CSV File