1. Introduction
Hello everyone! After the Lantern Festival and the New Year, let’s embrace the new year with vigor!
A few days ago, while browsing Xiaohongshu, I came across a post titled: Useful! Treasure WeChat Public Accounts in the Field of Information Science. I took a closer look and saw our Information Science Charging Station.

Thinking about not updating for three months, I truly apologize for the recommendation from the blogger. Coincidentally, yesterday a friend rewarded me, saying that the earlier shared C Journal Intensive Reading helped him greatly, which made me feel that the work I’ve done is valuable. No more delays, let’s get started!
We will continue to update the Read C Journals and Publish C Journals series, taking a specific technology as a starting point to list recent relevant C journals and provide case applications for that technology. I hope this can help everyone with their paper submissions.
In today’s first issue, we will study Word Vector Technology – Word2Vec.
2. What is Word2Vec
Word2Vec is a model used for generating word embeddings, which can convert words into real-valued vectors that capture the semantic and syntactic properties of words. Word2Vec was proposed by Mikolov et al. in 2013, and it has two main training algorithms: CBOW and Skip-Gram.
2.1 CBOW (Continuous Bag of Words)
The CBOW model predicts a word given its surrounding context words. It assumes that the meaning of a word is the average representation of the surrounding words. The objective function of CBOW is to minimize the error in predicting the word.
2.2 Skip-Gram
The Skip-Gram model predicts the context given a word. It considers only one center word in each training step and tries to predict the words around it. Skip-Gram works well for handling both rare and frequent words.
2.3 Key Concepts of Word2Vec
-
Word Embeddings: Vector representations generated by Word2Vec that capture the similarity and associations between words. -
Context Window: During training, the model considers a certain number of words around the target word as context. -
Negative Sampling: A technique to optimize the training process by randomly selecting some negative samples to reduce computational load.
For those specifically interested in the technical details of Word2Vec, please refer to the papers or related technical documents; we will not elaborate further. The key takeaway is to understand what it can do.
3. Related Papers on Word2Vec (2023 to Present)
We retrieved relevant papers from CNKI from 2023 to the present for your reference; you can select those of interest for intensive reading. Among them, **papers [2] and [3]** have been elaborated and published on Bilibili; feel free to check them out.
[1] Yu Hui, Wei Zimeng, Xia Wenlei, Huang Wei, Chen Xiaofang. Distribution of Technical Demand Hotspots Across Fields and Time and Its Trend Prediction[J]. Information Theory and Practice: 1-13.
Abstract: [Purpose/Significance] Predicting technical demand hotspots is a key factor that leads the process of technological innovation, and accurately identifying the practical market value of foresight technology innovation is crucial to reducing innovation risks and upgrading industrial modernization. [Method/Process] By using the Word2Vec model and TF-IDF to extract keywords, combined with LDA modeling and topic clustering for cross-field comprehensive analysis, generating themes of technical hotspots and displaying existing technical hotspot distribution trends through a word cloud. Additionally, by selecting high-frequency and low-frequency words and combining time series and regression analysis, cross-temporal identification of focused areas of technical demand is conducted, and the relative growth rates of different technologies are calculated to obtain the development potential of technologies. [Results/Conclusion] The technical demand hotspot themes in the central region mainly focus on “intelligent systems and equipment upgrades,” “environmental protection and resource utilization,” and “production processes and quality management.” There are technical demand gaps in high-performance, intelligent process technologies and equipment, and product testing, selenium-rich compounds, and intelligent synthesis and production processes have high development potential.
[2] Hu Zewen, Han Yalu, Wang Mengya. Evolution of Research Themes and Identification of Hot Topics in Machine Learning in the Field of Library and Information Science Based on LDA-Word2vec[J]. Modern Information: 1-15.
Abstract: [Purpose/Significance] In the context of the rapid development and profound changes in artificial intelligence technologies and applications, new research themes and methods are constantly emerging in the field of machine learning, with deep learning and reinforcement learning technologies continuously evolving. Therefore, it is necessary to explore the evolution of research themes in machine learning across different fields and identify hot and emerging themes. [Method/Process] This paper takes machine learning research papers from the Web of Science database in the field of library and information science from 2011 to 2022 as an example, integrating LDA and Word2vec methods for topic modeling and evolution analysis, introducing indicators of theme strength, influence, attention, and novelty to identify hot and emerging themes. [Results/Conclusion] The research results indicate that (1) the combination of Word2vec’s semantic processing capability and LDA’s theme evolution capability can more accurately identify research themes and intuitively display the phased evolution patterns of research themes; (2) the research themes in the field of library and information science mainly fall into three categories: natural language processing and text analysis, data mining and analysis, and information and knowledge services. The correlation between various themes is strong, and they exhibit thematic association evolution characteristics; (3) the designed indicators of theme strength, influence, and attention, along with the comprehensive indicators, can effectively identify hot themes in the three different periods of 2011-2014, 2015-2018, and 2019-2022.
—[2] Intensive Reading Video Link: https://www.bilibili.com/video/BV1ZN4y1v71c/
[3] Zhou Aixia, Yan Yalan, Zha Xianjin. Comparative Study of Big Data Attention Hotspots and Word Embedding Profiles Based on Neural Network Word Embedding[J]. Modern Information, 2024, 44(01): 37-47.
Abstract: [Purpose/Significance] Big data has a significant impact on social and economic development. This study combines China’s academic platforms and social Q&A platforms to compare big data attention hotspots and word embedding profiles, aiming to promote big data research and practice in China. [Method/Process] Word2vec is an emerging neural network word embedding algorithm, characterized by low computational cost and high accuracy, which can effectively measure the similarity of words both semantically and grammatically. First, data was collected from CNKI and Zhihu platforms to construct the corpora of academic and social Q&A platforms, and Word2vec models were trained based on these two corpora; second, the attention hotspots of big data on academic and social Q&A platforms were compared using the analysis of the most similar words; finally, dimensionality reduction techniques and data visualization methods were employed to compare the word embedding profiles of the two platforms. [Results/Conclusion] The research results show the differences between China’s academic platforms and social Q&A platforms in the field of big data. This study innovatively utilizes the Word2vec neural network word embedding algorithm to conduct comparative analysis of big data, providing a new perspective for big data research.
—[3] Intensive Reading Video Link: https://www.bilibili.com/video/BV1uH4y1Y7nY/
[4] Qi Ying, Zhang Tao. Comparative Study of Interdisciplinary Research Themes in Humanities and Social Sciences at Home and Abroad: Literature Theme Comparison and China’s Path Selection[J]. Information Science: 1-10.
Abstract: [Purpose/Significance] Cross-disciplinary integration is the inherent logic of scientific development and also the driving force of scientific development. Analyzing interdisciplinary research in the humanities and social sciences at home and abroad not only expands the boundaries of knowledge and enhances cognitive and communication abilities but also maintains national security and enhances academic discourse power. [Method/Process] This paper utilizes the combination of LDA and Word2vec to sort and analyze the themes of interdisciplinary research literature in the humanities and social sciences at home and abroad, and conducts a comprehensive comparison of literature themes from both synchronic and diachronic perspectives to reveal the commonalities and differences in interdisciplinary research at home and abroad. [Conclusion/Findings] The research themes that scholars continue to focus on at home and abroad include interdisciplinary education, while recent hot themes of interest include big data-based interdisciplinary research and interdisciplinary evaluation studies; the research field with significant differences between domestic and foreign scholars is the medical and health field. Overall, interdisciplinary research at home and abroad is developing towards diversification, systematization, and deepening.
[5] Li Xiuxia, Shao Zuoyun. Research on Discovering Academic Innovation Opportunities Based on Outlier Theme Word Interdisciplinary Combinations[J]. Information Theory and Practice, 2023, 46(12): 122-130.
Abstract: [Purpose/Significance] Outlier theme words in academic fields can provide novel and scarce information for discovering innovation opportunities, and interdisciplinary combinations of outlier theme words can generate new knowledge and produce groundbreaking academic innovation opportunities. [Method/Process] Taking information science and political science as examples, this study uses LDA to extract themes from literature in different disciplines, using theme words with low probability distributions as data objects, and applies Word2Vec and PCA techniques to represent theme words containing textual semantics in low-dimensional dense vectors. The semantic similarity between outlier theme words in different disciplines is calculated based on cosine similarity, and combinations of outlier theme words from different disciplines with high similarity are considered as combinations with innovative potential. Further screening of outlier theme word combinations based on designed demand indicators ultimately identifies future academic innovation opportunities with research potential. [Results/Conclusion] By combining theme extraction and semantic analysis, the value and semantic context of outlier theme words are fully considered; the semantic similarity of interdisciplinary combinations of outlier theme words and demand indicators can balance the novelty and utility characteristics of academic innovation. The study shows that this research method can effectively discover academic innovation opportunities and provide reliable references for scientific research guidance and knowledge services.
[6] Lü Kun, Xiang Minhao, Jing Jipeng. Research on the Identification of Disruptive Technology Themes Based on LDA2Vec and DTM Models—Taking the Energy Technology Field as an Example[J]. Library and Information Work, 2023, 67(12): 89-102.
Abstract: [Purpose/Significance] Disruptive technologies are related to national competitiveness and international status. Accurately identifying disruptive technology themes can address issues such as unclear themes and unclear development paths during technological development, effectively grasping technological development dynamics, adjusting national science and technology strategic layouts, and better seizing the high ground of international competition. [Method/Process] Taking patent text data in the energy technology field as the research object, this study constructs a fusion feature vector based on Word2Vec word vectors and LDA (Latent Dirichlet Allocation) topic vectors, and introduces the K-means algorithm to optimize theme clustering effects. Finally, the characteristics of disruptive technology are combined to identify disruptive technology themes, utilizing the DTM (Dynamic Topic Model) to reveal the development status of disruptive technology themes in this field. [Results/Conclusion] Through manual verification and comparison of model results, it can be found that the empirical results are reasonable, and the precision, recall, and F1 values of the model are all higher than those of similar topic models, proving that this method performs well in identifying disruptive technology themes.
[7] Zhang Yuling, Peng Lihui, Zhang Yanfeng, Ou Zhimei. Research on the Identification and Development Trends of Smart Emergency Related Technologies Based on Patent Data Mining[J]. Information Science, 2023, 41(08): 139-146.
Abstract: [Purpose/Significance] Based on the current status of smart emergency technology development, this study reveals the core technologies in the field of smart emergencies and identifies the development trends of smart emergency technologies based on the elements of technology association analysis. [Method/Process] Using the Innojoy retrieval platform as a sample source for smart emergency-related patent data, this study comprehensively applies LDA models, Word2Vec models, and TextRank algorithms to determine relevant themes and explore the distribution of theme words, utilizing wordcloud and ROSTCM6 to achieve technical hotspot analysis and technology association analysis of patent abstracts. [Results/Conclusion] Smart emergency technology is currently in the early stages of maturity, and patent themes can be categorized into nine types, including navigation and positioning technology, emergency lighting technology, and automatic alarm technology. In the future, through promoting multi-technology integration, multi-module development, multi-department linkage, and multi-terminal collaboration, smart emergency technology will further enhance its ability to prevent safety accidents.
[8] Zhang Junrui, Qiu Meng, Zhang Zhichao. Institutional Investor Cohesion and Company Forward-Looking Information Disclosure[J]. Statistics and Information Forum, 2023, 38(05): 53-66.
Abstract: Improving the level of information disclosure is key to alleviating the information asymmetry between a company’s internal management and external investors. As long-term fund providers, institutional investors pay more attention to the prospects and growth potential of enterprises, thus paying special attention to the company’s forward-looking information disclosure. Existing studies at home and abroad have lacked attention to the phenomenon of institutional investors holding shares together in the context of China’s practical situation, especially not exploring the relationship between institutional investors holding shares together and the forward-looking information disclosure of listed companies. Based on data from Chinese A-share listed companies from 2007 to 2019, this study constructs a network of institutional investors’ heavy shareholdings through complex network analysis methods, extracting institutional investor network groups using modularity community algorithms (Louvain algorithm), and constructs forward-looking information disclosure indicators for listed companies using the Word2Vec neural network model algorithm, further examining the impact of institutional investor network cohesion on company forward-looking information disclosure. The study finds that the proportion of institutional investors holding shares together is significantly positively correlated with the company’s forward-looking information disclosure, meaning that after institutional investors hold shares together, the frequency of future tense words in the company’s annual reports significantly increases. Mechanism tests reveal that institutional investors enhance their motivation and ability to supervise management through holding shares together, and that this cohesion prompts companies to increase the forward-looking content and tone in disclosures, with both mechanisms contributing to the increase in future outlook descriptions in annual reports. Further analysis shows that the positive effect of institutional investor cohesion on forward-looking information exists in private enterprises and those with high analyst attention. The research conclusions remain significant through a series of robustness tests, deepening the understanding of the governance role of institutional investors in corporate governance, and promoting the long-term investment role of institutional investors holding shares together, which has implications for the sustainable and stable development of capital markets.
[9] Dong Ke, Chen Xiaoping, Wu Jiachun. Research on the Correlation Between Innovation and Citation Impact of Scientific Papers—Based on Semantic Perspective Measurement[J]. Information Theory and Practice, 2023, 46(10): 24-31.
Abstract: [Purpose/Significance] This paper designs a method for measuring the innovation of papers based on semantic content, revealing the relationship and mechanism between paper innovation and academic impact. [Method/Process] The innovation of a single paper is divided into two characteristics: novelty and conventionality; by training keyword vectors using the Word2Vec model, the novelty and conventionality of papers are calculated by integrating keyword semantics; further regression analysis is conducted to examine the impact of paper innovation on citation counts. The analysis targets natural language processing research papers indexed by WoS, and the results indicate that: (1) the proposed method for measuring novelty and conventionality of papers fully considers textual semantics, allowing for the evaluation of innovation at the publication stage, and the innovation scores are relatively reasonable; (2) regression analysis shows that the more novel the research paper, the fewer citations it receives; (3) under the same novelty, the higher the conventionality of a research paper, the more citations it receives, with conventionality having a stronger impact on citation counts than novelty; (4) citation impact is more suitable for evaluating issues in highly conventionalized research fields.
[10] Zhang Mengyun, Ding Jingda. Research on Semantic Enhancement for Short Text Classification[J]. Library and Information Work, 2023, 67(09): 4-11+3.
Abstract: [Purpose/Significance] The rapid development of information technology has led to a surge in short text data such as user reviews and patient symptoms, making it a research hotspot in text classification to extract valuable information from short texts. [Method/Process] Taking the symptom data of patients from various departments of a domestic hospital as a corpus, this study addresses the issue of insufficient semantic information in short symptom texts by focusing on the importance and relevance of symptom words in each department. Texts with fewer than the set number of symptom words are treated as semantic enhancement objects, and typical symptom keywords from each department are extracted using Word2Vec and probability-based TF-IDF algorithms, which are then supplemented to the semantic enhancement objects to form a new corpus. Finally, machine learning algorithms are employed to classify the symptom texts. [Results/Conclusion] The new corpus constructed using the semantic enhancement method shows significant improvements in classification performance compared to the original corpus, with accuracy rates increasing by approximately 10%, 9%, and 10% on Support Vector Machine (SVM), Multinomial Naive Bayes (MNB), and Random Forest (RF) respectively.
[11] Tang Xiaobo, Wu Haiting, Wu Jialin. Research on Patent Semantic Citation Recognition Method Based on Feature Knowledge Elements—Taking the Field of Quantum Computing as an Example[J]. Information Theory and Practice, 2023, 46(10): 86-95.
Abstract: [Purpose/Significance] Patent citation analysis is an important part of patent analysis research. Traditional patent citation analysis only analyzes patent data that are explicitly marked in the patent literature, which cannot accurately reflect the citation relationships between patents and is difficult to reveal the technical similarities between patents accurately. Semantic citation recognition of patents helps to accurately reveal the potential semantic connections between patents, providing references for the evaluation of patent inheritance and innovation, and assisting in patent review before and after patent grant. [Method/Process] First, feature knowledge elements of patents are extracted based on rules and syntactic analysis; second, Sentence-BERT and Word2Vec are used to vectorize the patent feature knowledge elements and patent title abstract texts; third, the feature similarity and overall similarity of patents are calculated based on cosine similarity, and the semantic citation patent set is obtained by combining the chronological order of patent applications; finally, experiments are conducted using patent data in the field of quantum computing. [Results/Conclusion] This patent semantic citation recognition method can effectively identify semantic citation patents, aiding in the evaluation of the technical novelty, creativity, and practicality of patents, and providing support for patent review and patent value assessment work.
4. Practical Case of Word2Vec
We take [3] Zhou Aixia, Yan Yalan, Zha Xianjin. Comparative Study of Big Data Attention Hotspots and Word Embedding Profiles Based on Neural Network Word Embedding[J]. Modern Information, 2024, 44(01): 37-47. as an example to replicate the implementation process of the entire paper. The implementation ideas are as follows:
-
(1) Read in data
import pandas as pd
engLines = pd.read_excel("Eng4WordCloud.xlsx",names = ['Abstract'])
-
(2) Data preprocessing
def remove_stopwords(lines, sw = sw):
res = []
for line in lines:
original = line
line = [w for w in line if w not in sw]
if len(line) < 1:
line = original
res.append(line)
return res
filtered_lines = remove_stopwords(lines = lines, sw = sw)
-
(3) Train word vectors
w = w2v(
filtered_lines,
min_count=3,
sg = 1,
window=7
)
print(w.wv.most_similar('techniques'))
[OUT]:
[('output', 0.9978842735290527),
('using', 0.9978089928627014),
('determining', 0.9977988600730896),
('receive', 0.9977968335151672),
('second', 0.9977602362632751),
('pertaining', 0.997706413269043),
('herein', 0.9977037310600281),
('value', 0.9976972937583923),
('process', 0.9976484775543213),
('security', 0.9976397156715393)]
-
(4) Dimensionality reduction visualization
pca = PCA(n_components=2, random_state=7)
pca_mdl = pca.fit_transform(emb_df)
emb_df_PCA = (
pd.DataFrame(
pca_mdl,
columns=['x','y'],
index = emb_df.index
)
)
plt.clf()
fig = plt.figure(figsize=(6,4))
plt.scatter(
x = emb_df_PCA['x'],
y = emb_df_PCA['y'],
s = 0.8,
color = 'maroon',
alpha = 0.8
)
plt.xlabel('PCA-1')
plt.ylabel('PCA-2')
plt.title('PCA Visualization')
plt.plot()

5. Conclusion
From the above papers, we can see that Word2Vec has a wide range of applications in various scenarios, especially in conjunction with LDA. Recently, I opened a course on Bilibili titled Beginner’s Guide to Text Mining: Intensive Reading of LDA Model Papers and Practical Cases. If you are interested in LDA combined with Word2Vec, feel free to check it out.

-
Course Link: https://www.bilibili.com/cheese/play/ss11747