XLNet Pre-training Model: Everything You Need to Know

Author | mantch

Reprinted from WeChat Official Account | AI Technology Review

1. What is XLNet

XLNet is a model similar to BERT, rather than a completely different model. In short, XLNet is a general autoregressive pre-training method. It was released by the CMU and Google Brain teams in June 2019, and ultimately, XLNet outperformed BERT on 20 tasks, achieving state-of-the-art results on 18 tasks, including machine question answering, natural language inference, sentiment analysis, and document ranking.

The author states that pre-training models based on denoising autoencoders, like BERT, can effectively model bidirectional contextual information, outperforming pre-training methods based on autoregressive language models. However, due to the need to mask part of the input, BERT ignores the dependencies between masked positions, leading to a discrepancy between pre-training and fine-tuning results.

Based on these pros and cons, the study proposes a generalized autoregressive pre-training model, XLNet. XLNet can:

Learn bidirectional contextual information by maximizing the log-likelihood of all possible factorization orders;
Overcome the shortcomings of BERT using the characteristics of autoregression;
Additionally, XLNet integrates ideas from the current optimal autoregressive model, Transformer-XL.

2. Autoregressive Language Models

Before ELMO/BERT, the language models people usually talked about were essentially predicting the next possible word based on the preceding context, which is known as the left-to-right language modeling task, or vice versa, predicting the preceding word based on the following context. This type of LM is called an autoregressive language model. GPT is a typical autoregressive language model. Although ELMO appears to utilize both preceding and following contexts, it is essentially still an autoregressive LM, which relates to how the model is specifically implemented. ELMO performs two directions (left-to-right and right-to-left language models), but it has two separate autoregressive LMs for each direction, and then concatenates the hidden state outputs from both LSTM directions to reflect the bidirectional language model. So it is essentially a concatenation of two autoregressive language models.

Autoregressive language models have their pros and cons:

Cons include only being able to utilize information from either preceding or following context, but not both simultaneously. Although ELMO seems to address this issue by combining both directions, the fusion method is too simplistic, resulting in suboptimal performance.

Pros are actually related to downstream NLP tasks, such as generative NLP tasks like text summarization and machine translation. In actual content generation, the process is inherently left-to-right, making autoregressive language models naturally fit this process. In contrast, BERT’s DAE mode faces inconsistencies between training and application processes in generative NLP tasks, leading to subpar performance in generative NLP tasks so far.

3. Autoencoding Language Models

Autoregressive language models can only predict the next word based on preceding context or the preceding word based on following context. In contrast, BERT randomly masks a portion of words in the input X, and one of the main tasks during the pre-training process is to predict these masked words based on the surrounding context. If you are familiar with Denoising Autoencoders, you will see that this is indeed a typical DAE approach. The masked words are the so-called noise introduced on the input side. Pre-training methods similar to BERT are referred to as DAE LMs.

The advantages and disadvantages of DAE LMs are essentially the reverse of those of autoregressive LMs. They can naturally incorporate bidirectional language models, seeing both the preceding and following context of the predicted word, which is beneficial. The downside is mainly the introduction of the [Mask] token on the input side, leading to inconsistencies between the pre-training and fine-tuning phases, as the fine-tuning phase does not see the [Mask] token. DAE requires introducing noise, and the [Mask] token serves as a means of introducing noise, which is normal.

XLNet’s starting point is: can we integrate the advantages of autoregressive LMs and DAE LMs? That is, from the perspective of autoregressive LMs, how to introduce effects equivalent to bidirectional language models; from the perspective of DAE LMs, how to incorporate bidirectional language models while discarding the superficial [Mask] token, ensuring consistency between pre-training and fine-tuning. Of course, XLNet also addresses the issue of mutual independence of masked words in BERT.

4. The XLNet Model

4.1 Permutation Language Modeling

BERT’s autoencoding language model also has corresponding shortcomings, which XLNet points out in the text:

The first pre-training phase introduces the [Mask] token to mask out certain words, while during the fine-tuning phase, this forcibly added mask token is not visible, leading to inconsistencies in usage patterns between the two phases, which may cause performance loss;
Another issue is that in the first pre-training phase, if multiple words in a sentence are masked, these masked words are assumed to be conditionally independent, whereas sometimes there are relationships between these words.

The above two points are the problems XLNet aims to solve in the first pre-training phase compared to BERT.

The thought process is relatively straightforward: XLNet still follows a two-phase process, where the first phase is the language model pre-training phase, and the second phase is the task data fine-tuning phase. It primarily hopes to modify the first phase, meaning it does not use BERT’s masked Denoising-autoencoder mode, but instead adopts the autoregressive LM approach. That is, the input sentence X appears to be a left-to-right input, observing the preceding context of the Ti word to predict the Ti word. However, it also hopes that within the Context_before, not only can it see the preceding words but also the following words in the Context_after, thus eliminating the need for the mask symbol introduced during BERT’s pre-training phase. Therefore, the pre-training phase appears to be a standard left-to-right process, and the fine-tuning phase is also this process, achieving unification between the two stages. Of course, that is the goal. The remaining question is how to achieve this.

First, it is important to emphasize that although the above discusses rearranging the words of sentence X and randomly selecting examples as input, in reality, you cannot do this, because during the fine-tuning phase, you cannot rearrange the original input. Therefore, the input during the pre-training phase must still appear as x1, x2, x3, x4 in this order, but some work must be done in the Transformer part to achieve the desired goal.

Specifically, XLNet adopts an attention masking mechanism. You can understand that for the current input sentence X, the word to be predicted Ti is the i-th word, and the preceding words from 1 to i-1 in the input part remain unchanged. However, within the Transformer, through the attention mask, i-1 words are randomly selected from the preceding and following words of Ti in X, placing them in the preceding context position of Ti while masking the input of other words through the attention mask. This allows us to achieve our expected goal (of course, the so-called placement of selected words in the preceding context is just a figurative way of saying it; internally, it is achieved by masking out other unselected words, preventing them from influencing the prediction of word Ti). This looks similar to placing these selected words in the Context_before position.

In practical implementation, XLNet is realized using a “dual-stream self-attention model”. The details can be referenced in the paper, but the basic idea is as described above. The dual-stream self-attention mechanism is just a specific way to realize this idea; theoretically, you could come up with other specific implementation methods to achieve this basic idea and still reach the goal of allowing Ti to see following words.

Here, let me briefly describe the “dual-stream self-attention mechanism”. One stream is content stream self-attention, which is essentially the standard Transformer computation process; the other stream is query stream self-attention, which serves to replace BERT’s [Mask] token, as XLNet aims to discard the [Mask] token. For example, knowing the preceding words x1, x2, when predicting the word x3, the highest layer of the Transformer at the position of x3 will predict this word, but the input side cannot see the word x3. BERT directly introduces the [Mask] token to cover the content of word x3, effectively making [Mask] a general placeholder. In contrast, XLNet discards the superficial [Mask] token, but cannot see the input for x3, so the query stream directly ignores the input for x3, keeping only the positional information, represented by a parameter w for positional embedding coding. Essentially, XLNet only discards the superficial [Mask] placeholder, while still introducing the query stream to ignore the masked word. Compared to BERT, it is just a different implementation approach.

The above-discussed Permutation Language Model is XLNet’s main theoretical innovation, which is why it is introduced in detail. From the model perspective, this innovation is quite interesting, as it opens up a way for autoregressive language models to incorporate following context, and it is believed to inspire subsequent work. Of course, XLNet does not only do this; it also incorporates other factors, making it an integration of current effective technologies.It feels like XLNet is a transformation of BERT, GPT 2.0, and Transformer XL:

First, it absorbs BERT’s bidirectional language model through the PLM (Permutation Language Model) pre-training objective;
Then, the core of GPT2.0 is actually more and higher quality pre-training data, which is clearly absorbed into XLNet;
Furthermore, the main idea of Transformer XL is also absorbed, aimed at solving the issue of the Transformer being less friendly for long document NLP applications.

4.2 Transformer XL

Currently, there are two most advanced architectures for handling language modeling problems in NLP: RNN and Transformer. RNN learns the relationships between input words or characters sequentially, while Transformer accepts an entire segment and uses the self-attention mechanism to learn their dependencies. Both architectures have achieved remarkable success, but they are limited in capturing long-term dependencies.

To address this issue, a new paper titled “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context” was launched by CMU and Google Brain in January 2019, combining the advantages of RNN sequence modeling and Transformer self-attention mechanisms. It uses the attention module of the Transformer on each segment of input data and employs a recurrent mechanism to learn dependencies between consecutive segments.

4.2.1 Vanilla Transformer

Why mention this model? Because Transformer-XL is an improvement based on this model.

Al-Rfou et al. proposed a method for training language models based on Transformer, predicting the next character in a segment based on previous characters. For example, it uses $x1,x_2,…,x{n-1}$ to predict character $x_n$, while masking out the sequence after $x_n$. The paper uses a 64-layer model and is limited to processing relatively short inputs of 512 characters, so it segments the input and learns from each segment separately, as shown in the figure below. During the testing phase, to process longer inputs, the model shifts the input one character to the right at each step to achieve predictions for individual characters.

This model performs better than RNN models on commonly used datasets like enwik8 and text8, but it still has the following shortcomings:

Limited Context Length: The maximum dependency distance between characters is limited by the input length, meaning the model cannot see words that appeared several sentences earlier.
Fragmented Context: For texts longer than 512 characters, each is trained separately from scratch. The lack of contextual dependencies between segments leads to inefficient training and affects model performance.
Slow Inference Speed: During the testing phase, for each prediction of the next word, the context needs to be reconstructed, and calculations begin from scratch, resulting in very slow computation speed.

4.2.2 Transformer XL

The Transformer-XL architecture introduces two innovations based on the vanilla Transformer: a recurrence mechanism and relative positional encoding, to overcome the shortcomings of the vanilla Transformer. Compared to the vanilla Transformer, another advantage of Transformer-XL is that it can be used for both word-level and character-level language modeling.

1. Introduction of Recurrence Mechanism

Similar to the basic idea of the vanilla Transformer, Transformer-XL still models using a segmented approach, but the essential difference from the vanilla Transformer is the introduction of a recurrence mechanism between segments, allowing the current segment to utilize information from previous segments to achieve long-term dependencies. As shown in the figure below:

During the training phase, when processing subsequent segments, each hidden layer receives two inputs:

The output from the preceding hidden layer of the current segment, similar to the vanilla Transformer (the gray line in the figure above).
The output from the preceding segment’s hidden layer (the green line in the figure above), which allows the model to create long-term dependencies.

These two inputs are concatenated and used to compute the current segment’s Key and Value matrices.

This method allows for utilizing information from more preceding segments, enabling longer dependency during testing. In the testing phase, compared to the vanilla Transformer, its speed is also faster. In the vanilla Transformer, progress is made one step at a time, requiring reconstruction of segments and calculations from scratch; whereas in Transformer-XL, an entire segment can be advanced at once, using data from previous segments to predict the output of the current segment.

2. Relative Positional Encoding

In the Transformer, an important aspect is considering the positional information of the sequence. In the case of segmentation, if we directly use the position encoding from the Transformer for each segment, meaning each different segment uses the same position encoding for the same position, problems arise. For example, the first position of segment i-2 and segment i-1 will have the same position encoding, but their importance for modeling segment i is evidently different (for instance, the first position in segment i-2 may be less important). Therefore, it is necessary to distinguish between these positions.

The paper proposes a new way of encoding position based on the relative distance between words rather than the absolute position used in the Transformer. From another perspective, the computation of attention can be divided into the following four parts:

Content-based “addressing”, which is the original score without adding the original position encoding.
Content-based positional bias, which is the relative position deviation from the current content.
Global content bias, which measures the importance of the key.
Global positional bias, which adjusts importance based on the distance between query and key.

Detailed formulas can be found in:Interpretation of Transformer-XL (Paper + PyTorch Source Code, https://blog.csdn.net/magical_bubble/article/details/89060213)

5. Comparison of XLNet and BERT

Despite seeming that XLNet’s introduction of the Permutation Language Model as a new pre-training objective is quite different from BERT’s method of using mask tokens, if you think deeply about it, you will find that the essence of both is similar.

The main differences are:

BERT directly introduces mask tokens on the input side, hiding a portion of the words to prevent them from contributing during prediction, requiring the use of other words in the context to predict a masked word;
In contrast, XLNet discards the mask token on the input side, using an attention mask mechanism to randomly mask out a portion of words within the Transformer (the proportion of masked words is related to the position of the current word in the sentence; the closer to the front, the higher the proportion masked, and vice versa), preventing these masked words from influencing the prediction of a specific word.

Therefore, essentially, there is not much difference between the two; it is just the position of the mask. BERT is more superficial, while XLNet hides this process within the Transformer. This allows XLNet to discard the superficial [Mask] token, addressing the inconsistency issue between pre-training and fine-tuning caused by the presence of the [Mask] token. As for XLNet’s claim regarding the mutual independence of masked words in BERT, it means that when predicting a masked word, other masked words do not have an effect. However, if you think deeply about it, this issue is not significant, because during XLNet’s internal attention masking, a certain proportion of contextual words will also be masked. As long as a portion of the masked words is there, it essentially faces this problem. Moreover, if the training data is large enough, it can rely on other examples to compensate for the direct interdependence issues of masked words, as there will always be other examples that can learn these words’ interdependencies.

Of course, XLNet’s transformation maintains the superficial left-to-right pattern of autoregressive language models, which BERT cannot achieve. This has clear advantages for generative tasks, allowing the model to implicitly contain contextual information while maintaining a superficial left-to-right generation process. Thus, it appears that XLNet should have a significant advantage over BERT for generative NLP tasks. Additionally, because XLNet incorporates the Transformer XL mechanism, it should also have a significant advantage for long document input types in NLP tasks compared to BERT.

6. Code Implementation

Chinese XLNet Pre-training Model:

https://github.com/ymcui/Chinese-PreTrained-XLNet

Machine Learning Simplified Series Articles:

https://github.com/NLP-LOVE/ML-NLP

References:

Interpretation of XLNet Principles (https://blog.csdn.net/weixin_37947156/article/details/93035607)

XLNet: Operating Mechanism and Comparison with BERT (https://zhuanlan.zhihu.com/p/70257427)

Interpretation of Transformer-XL (Paper + PyTorch Source Code, https://blog.csdn.net/magical_bubble/article/details/89060213)

Cover image source: https://www.maxpixel.net/photo-3704026

XLNet Pre-training Model: Everything You Need to Know

Recommended Reading:

Notes on Learning Sentence Representation

Common Normalization Methods: BN, LN, IN, GN

PaddlePaddle Practical NLP Classic Model BiGRU + CRF Detailed Explanation

XLNet Pre-training Model: Everything You Need to Know

4.1 Permutation Language Modeling

4.2 Transformer XL

4.2.2 Transformer XL

Leave a Comment Cancel reply