Click on the “MLNLP” above and select “Star” to follow the public account
Heavyweight content delivered to you first
Editor: Yi Zhen
https://www.zhihu.com/question/329708785
This article is for academic exchange and sharing only; if there is infringement, the article will be deleted.
The author found an interesting question on Zhihu: What to do when Word2Vec lacks words?. Below are some insights shared by experts that I hope will help your research.
High-Quality Answers from Zhihu:
Author:Howardhttps://www.zhihu.com/question/329708785/answer/739525740
1. UNK Technique
Before training Word2Vec, reserve a
2. Subword Technique
This technique comes from FastText. In short, it involves tokenizing OOV (out-of-vocabulary) words. After tokenization, check for matches; if found, keep them; if not, continue tokenizing until reaching the character level, where corresponding character vectors can surely be found.
3. BPE Technique
BPE (Byte Pair Encoding), also known as digram coding, first splits a complete sentence into individual characters. The most frequent pairs of adjacent characters are merged and added to the vocabulary until the target vocabulary size is reached. The same subword splitting method is applied to test sentences. The advantage of BPE segmentation is that it balances vocabulary size and the number of tokens needed for sentence encoding. The downside is that it cannot provide the probabilities of multiple segmentations.
Additionally, there are many other techniques, such as incremental learning for Word2Vec, which will not be elaborated on here.
Author:Hu Guangyue
https://www.zhihu.com/question/329708785/answer/739525740
You can preprocess the vocabulary before use, such as adding unrecognized words to the vocabulary or learning new words based on the corpus.
Author:Ma Dongshen
https://www.zhihu.com/question/329708785/answer/739525740
Yes, you might have forgotten to apply the same stopword processing to both new and old corpora. I remember that Gensim uses zero or random values to replace unlogged words.
Author:Scottish Fold Cat
https://www.zhihu.com/question/329708785/answer/739525740
You can try performing incremental learning on the Word2Vec model.
Recommended Reading:
Interesting and Accessible Free Books on AI and Machine Learning
From Word2Vec to Bert: Discussing the Past and Present of Word Vectors (Part 1)
Tsinghua Yao Class Graduate, 95-Postdoc Chen Lijie Wins Best Student Paper at Theoretical Computer Science Top Conference