What To Do When Word2Vec Lacks Words?

Click on the “MLNLP” above and select “Star” to follow the public account

Heavyweight content delivered to you first

Editor: Yi Zhen

https://www.zhihu.com/question/329708785

This article is for academic exchange and sharing only; if there is infringement, the article will be deleted.

The author found an interesting question on Zhihu: What to do when Word2Vec lacks words?. Below are some insights shared by experts that I hope will help your research.

High-Quality Answers from Zhihu:

Author:Howardhttps://www.zhihu.com/question/329708785/answer/739525740

1. UNK Technique

Before training Word2Vec, reserve a symbol, replacing all stopwords or low-frequency words with UNK. When using it, keep a vocabulary list and replace words not in the Word2Vec vocabulary with UNK first.

2. Subword Technique

This technique comes from FastText. In short, it involves tokenizing OOV (out-of-vocabulary) words. After tokenization, check for matches; if found, keep them; if not, continue tokenizing until reaching the character level, where corresponding character vectors can surely be found.

3. BPE Technique

BPE (Byte Pair Encoding), also known as digram coding, first splits a complete sentence into individual characters. The most frequent pairs of adjacent characters are merged and added to the vocabulary until the target vocabulary size is reached. The same subword splitting method is applied to test sentences. The advantage of BPE segmentation is that it balances vocabulary size and the number of tokens needed for sentence encoding. The downside is that it cannot provide the probabilities of multiple segmentations.

Additionally, there are many other techniques, such as incremental learning for Word2Vec, which will not be elaborated on here.

Author:Hu Guangyue

https://www.zhihu.com/question/329708785/answer/739525740

You can preprocess the vocabulary before use, such as adding unrecognized words to the vocabulary or learning new words based on the corpus.

Author:Ma Dongshen

https://www.zhihu.com/question/329708785/answer/739525740

Yes, you might have forgotten to apply the same stopword processing to both new and old corpora. I remember that Gensim uses zero or random values to replace unlogged words.

Author:Scottish Fold Cat

https://www.zhihu.com/question/329708785/answer/739525740

You can try performing incremental learning on the Word2Vec model.

Scan to join the Python algorithm learning group, please note
“School, Organization + Nickname”
If the note is not provided as required, it will not be approved.

What To Do When Word2Vec Lacks Words?

What To Do When Word2Vec Lacks Words?

Recommended Reading:

Interesting and Accessible Free Books on AI and Machine Learning

From Word2Vec to Bert: Discussing the Past and Present of Word Vectors (Part 1)

Tsinghua Yao Class Graduate, 95-Postdoc Chen Lijie Wins Best Student Paper at Theoretical Computer Science Top Conference

What To Do When Word2Vec Lacks Words?

Leave a Comment