Follow the public account “ML_NLP“

Set as “Starred“, heavy content delivered first-hand!

Recently, I have started using Transformer for some tasks, specifically recording related knowledge points to build a relevant and complete knowledge structure system.

The following is the article I am going to write; this is the sixteenth article in this series:

Transformer: The Culmination of Attention
GPT-1 & 2: Miracles Brought by Pre-training + Fine-tuning
Bert: Bidirectional Pre-training + Fine-tuning
Bert and Model Compression

Bert and Model Distillation: PKD and DistillBert
ALBert: Lightweight Bert
TinyBert: Comprehensive Application of Model Distillation
MobileBert: Only 40ms on Pixel4
More to come

Bert and AutoML (To Be Continued)
Bert Variants

Roberta: Fine-tuning Bert
Electra: Discrimination or Generation, It’s a Choice
Bart: Seq2Seq Pre-training Model (This Article)
Transformer Optimization with Adaptive Width Attention
Transformer Optimization with Sparse Attention
Reformer: Efficiency from Local Sensitive Hashing and Reversible Residuals
Longformer: A Mix of Local and Global Attention – Linformer: Linear Complexity Attention
XLM: Cross-lingual Bert
T5 (To Be Continued)
More to come

GPT-3
More to come

Overall

The Transformer was originally developed to solve machine translation problems; thus, the Transformer model is divided into two parts: Encoder and Decoder. The Bert model uses the Encoder part, while the GPT model uses the Decoder part, both achieving excellent pre-training models.

The Bart model discussed in this article, however, goes back to its roots, reintroducing the Encoder-Decoder structure, i.e., the seq2seq structure.

Bart Pre-training Model

The comparison between Bert, GPT, and Bart is as follows: Bart inputs corrupted sentences into the Encoder and then restores them in the Decoder. The advantage of this approach is that compared to the single data corruption method in Bert, Bart is more flexible.

Bart’s Data Processing Method

In Bert, data is sampled and replaced with a special token [MASK] with a probability of 15%; there is also an 8:1:1 distribution, which I won’t detail here; interested readers can refer to the Bert article.

In Bart, however, the data processing is much more varied. For example, it can use a Masking method similar to Bert, deletion, reversal of order, or replacing multiple words with a single [MASK].

Among these, methods like word deletion and replacing multiple words with a single [MASK] can change the sentence length, which Bert cannot achieve.

This data restoration method gives Bart a significant advantage in generation tasks, as experiments have shown; additionally, Bart can achieve results comparable to Roberta in discrimination tasks.

Experiments

The results on the GLUE dataset are as follows: Bart can achieve results comparable to Roberta.

The results on the Summarization dataset are as follows:

The results on Abstractive QA are as follows:

It can be seen that Bart’s approach is very effective in generation tasks.

References

[1]. Lewis, Mike, et al. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.” arXiv preprint arXiv:1910.13461 (2019).

Repository address shared:

Reply "code" in the backend of the Machine Learning Algorithms and Natural Language Processing public account to obtain 195 NAACL + 295 ACL 2019 papers with open-source code. The open-source address is as follows: https://github.com/yizhen20133868/NLP-Conferences-Code

Big news! The Machine Learning Algorithms and Natural Language Processing communication group has officially been established! There are a lot of resources in the group, and everyone is welcome to join and learn!

Extra bonus resources! Qiu Xipeng's Deep Learning and Neural Networks, official Chinese PyTorch tutorial, data analysis using Python, machine learning study notes, official Chinese version of pandas documentation, effective java (Chinese version), and other 20 bonus resources.

How to obtain: After entering the group, click on the group announcement to get the download link.
Note: Please modify the remarks when adding to [School/Company + Name + Direction]. For example - Harbin Institute of Technology + Zhang San + Dialogue System. Please avoid adding if you are a micro-business. Thank you!

Recommended reading:
Review of Open-domain Knowledge Base Question Answering Research
Automatically Train Your Deep Neural Network Using PyTorch Lightning
Collection of Common PyTorch Code Snippets

Overall

Bart Pre-training Model

Bart’s Data Processing Method

Experiments

References

Leave a Comment Cancel reply