Prompt-Based Contrastive Learning for Sentence Representation

This article is approximately 1100 words long and is recommended to be read in 5 minutes.

This article proposes using prompts to capture sentence representations.

Although language models like BERT have achieved significant results, they still perform poorly in terms of sentence embeddings due to issues of sentence bias and anisotropy;
We found that using prompts with different templates can generate positive pairs from different aspects and avoid embedding bias.

Related Work

Contrastive Learning can leverage BERT to better learn sentence representations. The focus is on how to find positive and negative samples. For example, using the inner dropout method to construct positive samples.

Existing research shows that BERT’s sentence vectors exhibit a collapse phenomenon, meaning that sentence vectors are influenced by high-frequency words, leading to collapse within a convex cone, which is anisotropic. This property causes certain issues when measuring sentence similarity, known as the anisotropy problem.

Findings

(1) Original BERT layers fail to improve performance.

Comparing two different sentence embedding methods:

Average the input embeddings of BERT;
Average the outputs (last layer) of BERT.

To evaluate the effects of the two sentence embeddings, the sentence level anisotropy metric is used:

Prompt-Based Contrastive Learning for Sentence Representation

anisotropy: Calculate the cosine similarity between sentences in the corpus pairwise and take the average.

We compared different language models, and the preliminary experiments are shown below:

Prompt-Based Contrastive Learning for Sentence Representation

From the table above, it seems that the anisotropy corresponds to a low Spearman coefficient, indicating low correlation. For example, with bert-base-uncased,
it can be seen that static token embeddings have a high anisotropy, but the final effect is still similar.

(2) Embedding biases harm the performance of sentence embeddings.

Token embeddings are simultaneously influenced by token frequency and word piece.

Prompt-Based Contrastive Learning for Sentence Representation

Token embeddings of different language models are highly affected by word frequency and subword;
Through 2D visualization, high-frequency words tend to cluster together while low-frequency words are dispersed.

For frequency bias, we can observe that high frequency tokens are clustered, while low frequency tokens are dispersed sparsely in all models (Yan et al., 2021). The begin-of-word tokens are more vulnerable to frequency than subword tokens in BERT. However, the subword tokens are more vulnerable in RoBERTa.

Method

To avoid the aforementioned issues when BERT represents sentences, this article proposes using prompts to capture sentence representations. Unlike previous applications of prompts (for classification or generation), we do not obtain labels for sentences but rather obtain their vectors. Therefore, regarding prompt-based sentence embedding, we need to consider two questions:

How to use prompts to represent a sentence;
How to find appropriate prompts;

This article proposes a sentence representation learning model based on prompts and contrastive learning.

How to Use Prompts to Represent a Sentence

This article designs a template, for example, “[X] means [MASK]”, where [X] represents a placeholder corresponding to a sentence, and [MASK] indicates the token to be predicted. Given a sentence and transformed into a prompt, it is fed into BERT. There are two methods to obtain the embedding of this sentence:

Method 1: Directly use the hidden state vector corresponding to [MASK];;
Method 2: Use MLM to predict the top K words at the [MASK] position, and represent the sentence by weighted summation of the word embeddings of each predicted word based on their probabilities;

Prompt-Based Contrastive Learning for Sentence Representation

Method 2 uses several tokens generated by MLM to represent the sentence, which still has bias, therefore this article only adopts the first method.

How to Find Appropriate Prompts

Regarding prompt design, the following three methods can be adopted:

Manual design: explicitly design discrete templates;
Use the T5 model to generate;
OptiPrompt: convert discrete templates into continuous templates;

Prompt-Based Contrastive Learning for Sentence Representation

Training

Using contrastive learning methods, the selection of positives in contrastive learning is crucial. One method is to use dropout. This article adopts the prompt method to generate multiple different templates for the same sentence, thus obtaining multiple different positive embeddings.

The idea is to use different templates to represent the same sentence from different points of view, which helps the model produce more reasonable positive pairs.

To avoid semantic bias from the templates themselves, the author employs a trick:

Feed in sentences containing templates to obtain the embeddings corresponding to [MASK];;
Feed in only the template itself, while retaining the position IDs of the template tokens in their original input position, to obtain the embeddings corresponding to [MASK];

Finally, this is incorporated into the contrastive learning loss for training:

Prompt-Based Contrastive Learning for Sentence Representation

Experiments

The author conducted tests on multiple text similarity tasks, and the results are shown in the figures below:

Prompt-Based Contrastive Learning for Sentence Representation

Interestingly, PromptBERT sometimes outperforms SimCSE, and the author suggests that using contrastive learning may be the result of fine-tuning based on SimCSE.

Editor: Yu Tengkai

Prompt-Based Contrastive Learning for Sentence Representation

Related Work

Findings

Method

How to Use Prompts to Represent a Sentence

How to Find Appropriate Prompts

Training

Experiments

Leave a Comment Cancel reply