5-Minute NLP Series: Word2Vec and Doc2Vec

Source: Deephub Imba

This article is approximately 800 words long and is recommended to be read in 5 minutes.
This article mainly introduces <strong>Word2Vec</strong> and <strong>Doc2Vec</strong>.

Doc2Vec is an unsupervised algorithm that learns embeddings from variable-length text segments (such as sentences, paragraphs, and documents). It first appeared in the paper Distributed Representations of Sentences and Documents.

Word2Vec

Let’s first review Word2Vec, as it inspired the Doc2Vec algorithm.

The continuous bag of words architecture of Word2Vec. Image from the paper Distributed Representations of Sentences and Documents.

Word2Vec learns word vectors by predicting words in a sentence using other words in the context. In this framework, each word is mapped to a unique vector represented by a column in matrix W. The concatenation or sum of the vectors is used as features to predict the next word in the sentence.

Word vectors are trained using stochastic gradient descent. After training converges, words with similar meanings are mapped to nearby positions in the vector space.

The architecture presented is called continuous bag of words (CBOW) Word2Vec. There is also an architecture called Skip-gram Word2Vec, where word vectors are learned by predicting the context from a single word.

Doc2Vec

Distributed memory model of Doc2Vec from the paper Distributed Representations of Sentences and Documents.

We will now see how to learn embeddings for paragraphs, but the same method can be used to learn embeddings for entire documents.

In Doc2Vec, each paragraph in the training set is mapped to a unique vector, represented by a column in matrix D, while each word is also mapped to a unique vector, represented by a column in matrix W. The paragraph vector and word vectors are averaged or concatenated to predict the next word in the context.

Paragraph vectors are shared across all contexts generated from the same paragraph but are not shared across paragraphs. The word vector matrix W is shared across paragraphs.

Paragraph tags can be considered as another word. They act as memory, remembering what is missing in the current context. Thus, this model is called distributed memory (DM) Doc2Vec. There is also a second architecture called distributed bag of words (DBOW) Doc2Vec, inspired by Skip-gram Word2Vec.

Paragraph vectors and word vectors are trained using stochastic gradient descent.

During prediction, the paragraph vector for the new paragraph needs to be obtained through gradient descent while keeping the parameters of the rest of the model fixed.

Edited by: Wang Jing

Word2Vec

Doc2Vec

Leave a Comment Cancel reply