NLP Development Trends from Classic Models Like ULMFiT, Transformer, and BERT

NLP Development Trends from Classic Models Like ULMFiT, Transformer, and BERT

Natural Language Processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence, focusing on human-computer language interaction and exploring how to process and utilize natural language. The research on NLP can be traced back to the Turing test, evolving from rule-based research methods to the currently popular statistical models and methods, transitioning from early traditional machine learning methods based on high-dimensional sparse features to the mainstream deep learning methods that utilize low-dimensional dense vector features trained by neural networks.
Summarizing the brilliant achievements brought by countless predecessors over the past twenty years, the following three representative works should be regarded as milestone events:
1) In 2003, Bengio proposed the Neural Network Language Model (NNLM), unifying the feature representation in NLP—Embedding;
2) In 2013, Mikolov introduced Word2Vec, extending NNLM and introducing the idea of large-scale pre-training;
3) In 2017, Vaswani proposed the Transformer model, enabling a single model to handle multiple NLP tasks. By the end of 2018, a large number of pre-trained language models based on the Transformer architecture emerged, refreshing various NLP tasks.
Currently, with the development of deep learning and related technologies, research in the NLP field has achieved one breakthrough after another. Researchers design various models and methods to address a variety of NLP issues. Today, NLP applications have become ubiquitous. We often find websites and applications utilizing natural language processing technologies in one form or another. In fact, at recent top conferences in the field of natural language processing, deep learning often occupies a significant portion, making NLP a contest of models and computing power. Therefore, this article introduces some top pre-trained models since 2018, which readers can use to start their journey into natural language processing and replicate the latest research achievements in this field.
1. Overview of NLP Models
1. ULMFiT
GitHub project link:
https://github.com/fastai/fastai/tree/master/courses/dl2/imdb_scripts
ULMFiT pre-trained model paper:
https://www.paperswithcode.com/paper/universal-language-model-fine-tuning-for-text
Other research papers
https://arxiv.org/abs/1801.06146

NLP Development Trends from Classic Models Like ULMFiT, Transformer, and BERT

ULMFiT was proposed and designed by Jeremy Howard from fast.ai and Sebastian Ruder from DeepMind. ULMFiT stands for Universal Language Model Fine-Tuning. As its name suggests, its operational process can be summarized as follows: it is divided into three phases: first, pre-training the language model; second, fine-tuning the language model; and finally, fine-tuning for classification tasks.
ULMFiT achieves state-of-the-art results using new natural language generation techniques. This method includes fine-tuning a pre-trained language model trained on the Wikitext 103 dataset, ensuring it does not forget previously learned information, thus adapting it to a new dataset. In text classification tasks, ULMFiT outperforms many state-of-the-art techniques. By using this pre-trained language model, we can train classifiers with less labeled data. Despite the nearly infinite amount of unlabeled data available online, the cost of labeled data is high and very time-consuming.
2. Transformer
GitHub project link:
https://github.com/tensorflow/models/tree/master/official/transformer
Transformer pre-trained model paper “Attention Is All You Need”
https://www.paperswithcode.com/paper/attention-is-all-you-need
Other research papers
https://arxiv.org/abs/1706.03762
Before 2017, language models were modeled using RNNs and LSTMs. While this could learn the relationships between contexts, it could not be parallelized, which posed challenges for model training and inference. Therefore, Google researchers proposed a model for language modeling entirely based on attention, called Transformer. The Transformer eliminates the dependency of NLP tasks on RNNs and LSTMs, using self-attention to model context, thereby improving training and inference speed. The Transformer also serves as the foundation for subsequent more powerful NLP pre-trained models.
Practical findings have shown that as models grow larger and the number of samples increases, self-attention significantly outperforms traditional RNNs and LSTMs in terms of both training speed from parallelization and modeling over long distances. The Transformer has become the basis for various representative NLP pre-trained models, where the BERT series uses the Transformer encoder, and the GPT series uses the Transformer decoder. In the recommendation field, the multi-head attention of the Transformer is also widely applied.
3. BERT
GitHub project link:
https://github.com/google-research/bert
BERT pre-trained model paper
https://www.paperswithcode.com/paper/bert-pre-training-of-deep-bidirectional#code
Other research papers
https://arxiv.org/pdf/1810.04805.pdf
BERT stands for Bidirectional Encoder Representations, which considers context from both sides of a word (left and right). Prior to BERT, the application of pre-trained embeddings to downstream tasks could generally be categorized into two methods: feature-based, such as ELMo, which incorporates pre-trained embeddings as features into downstream task networks, and fine-tuning, such as GPT, which attaches downstream tasks to pre-trained models and trains them together. However, both methods face the same issue of not being able to directly learn contextual information. For example, ELMo learns contextual information separately and then concatenates it, while GPT only learns the preceding context. Therefore, the authors proposed a pre-trained model based on the Transformer encoder that can directly learn contextual information, called BERT. BERT uses 12 transformer encoder blocks and was pre-trained on 13GB of data, making it a representative breakthrough in the NLP field.
BERT is the first unsupervised, deep bidirectional natural language processing model pre-training system. It is trained using only a plain text corpus. Upon its release, BERT achieved state-of-the-art results on 11 natural language processing tasks. This is quite an impressive achievement. You can train your own natural language processing model (e.g., question-answering system) using BERT in just a few hours (on a single GPU). In short, BERT has truly influenced both academia and industry. Whether it’s GLUE or SQUAD, the high-scoring methods on the leaderboard are improvements based on BERT. However, BERT is not omnipotent; its framework determines that this model is suitable for solving natural language understanding problems, as there is no decoding process, making BERT unsuitable for natural language generation tasks. Therefore, how to adapt BERT for machine translation and text summarization is a point worth researching in the future.
4. Transformer-XL
GitHub project link:
https://github.com/kimiyoung/transformer-xl
Research paper:
https://arxiv.org/abs/1901.02860
Transformer-XL was developed by the Google AI team as an improvement or variant of the Transformer, primarily addressing the issue of long sequences, where XL stands for extra long, helping machines understand contexts beyond fixed length limits. Transformer-XL is 1800 times faster than a standard Transformer. The recently popular XLNet uses Transformer-XL as its foundational module.
5. XLNet
GitHub project link:
https://github.com/topics/xlnet
At the end of 2018, Google launched BERT, which quickly dominated the NLP field. Now, CMU and Google Brain jointly released XLNet, an improved version of BERT. Prior to this, many companies optimized BERT, including Baidu and Tsinghua’s knowledge graph integration, and Microsoft’s multi-task learning during the pre-training phase. However, these optimizations did not address the critical shortcomings of BERT. As an upgraded model of BERT, XLNet has optimized the following three aspects:
  • Uses an AR model instead of an AE model to mitigate the negative impacts of masking

  • Dual-stream attention mechanism

  • Introduces transformer-xl

6. GPT-2
GitHub project link:
https://github.com/openai/gpt-2
Research paper:
https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
GPT-2 is a large language model based on the Transformer, with 1.5 billion parameters, trained on an 8 million webpage dataset. It is the code implementation of the paper “Language Models are Unsupervised Multitask Learners.”
After training, GPT-2 can predict the next word in 40GB of internet text data. The model is trained on a dataset of 8 million web pages. To allow researchers and engineers to test it, developers released a much smaller version of GPT-2. The original model has 1.5 billion parameters, while the open-source example model has only 117 million.
7. MPNet
Code and model link:
https://github.com/microsoft/MPNet
Paper link:
https://arxiv.org/pdf/2004.09297.pdf
In recent years, pre-trained language models have undoubtedly become a research hotspot in natural language processing. These models learn better language representations to assist in understanding and generating natural language by designing effective pre-training objectives on large-scale corpora. Among them, the masked language model (MLM) used by BERT and the permuted language model (PLM) used by XLNet are two successful pre-training objectives. However, both training objectives have their advantages and disadvantages, with considerable room for improvement. Therefore, combining the ideas of BERT and XLNet, Nanjing University and Microsoft jointly proposed a new pre-trained language model in 2020: MPNet: Masked and Permuted Pre-training for Language Understanding. It builds on PLM and MLM, achieving better performance than pre-trained models like BERT, XLNet, and RoBERTa in natural language understanding tasks such as GLUE and SQuAD.
8. ALBERT
Paper link:
https://arxiv.org/pdf/1909.11942.pdf
Although GPT-2, XLNet, RoBERTa, and other pre-trained models have indeed made some improvements based on BERT, with innovations in model structure, training modes, and other aspects, most pre-trained models share a common “characteristic”—they are relatively “heavy,” with high pre-training costs. The authors of ALBERT proposed this model against this backdrop, aiming to address the high training costs and large parameter sizes of most pre-trained models. ALBERT reduces model parameters mainly through the following points:
1. Factorization of word embedding parameters;
2. Parameter sharing between hidden layers
To enhance model performance, ALBERT introduced a new training task: sentence order prediction.

NLP Development Trends from Classic Models Like ULMFiT, Transformer, and BERT

Albert performance
From the results, compared to BERT, ALBERT can significantly reduce parameter size without sacrificing model performance. Additionally, ALBERT has an albert_tiny model, which has only 4 hidden layers and approximately 1.8M parameters, making it very lightweight. Compared to BERT, its training and inference speed is about 10 times faster, while maintaining similar accuracy, achieving 85.4% on the semantic similarity dataset LCQMC, only a 1.5 point drop compared to bert_base. For relatively simpler tasks or those requiring real-time performance, such as semantic similarity computation and classification tasks, ALBERT is very suitable.
9. ELECTRA
GitHub link:
https://github.com/google-research/electra
Paper link:
https://openreview.net/pdf?id=r1xMH1BtvB
ELECTRA comes from Google AI, not only possessing the advantages of BERT but also being more efficient. It is a new pre-training method called replaced token detection (RTD). It efficiently learns how to accurately segment collected sentences, commonly referred to as token-replacement. In terms of efficiency, it requires only a quarter of the computation of RoBERTa and XLNet to achieve their performance on GLUE. Moreover, it has achieved new breakthroughs in performance on SQuAD. This means that “small-scale can also have a large impact”; training on a single GPU takes only 4 days, with accuracy exceeding that of OpenAI’s GPT model. Currently, ELECTRA has been released as an open-source model in TensorFlow, including many user-friendly pre-trained language representation models.
10. ELMo
GitHub project link:
https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md
Research paper
https://arxiv.org/pdf/1802.05365.pdf
ELMo (Embedding from Language Models) is a new method for representing words with vectors and embeddings, which is very useful in building natural language processing models in context. ELMo was introduced in March 2018 and was the Best Paper at NAACL18. In previous works like word2vec in 2013 and GloVe in 2014, each word corresponds to a fixed vector, which struggles with polysemy. ELMo proposed a better solution to this issue. Unlike previous methods where one word corresponds to one fixed vector, in the ELMo framework, the pre-trained model does not simply represent a relationship of vectors; it is a trained model. When used, a sentence or paragraph is input into the model, which infers the corresponding word vectors based on the context. One of the significant benefits of this approach is that it allows for understanding polysemous words based on surrounding context. For example, “apple” can be understood as either a company or a fruit based on the context.
2. NLP Development Trends
Currently, the approach of large-scale corpus pre-training + fine-tuning is likely to be the mainstream in NLP for the next few years. Various improvements based on language models continue to emerge. Although the methods vary, we can still see some groundbreaking directions.
1. Behemoth Series: T5, GPT-3, MegatronLM
In the initial stages, the improvements from BERT to RoBERTa, GPT to GPT-2 have already proven that more data can yield more powerful and general pre-trained models. From late last year to this year, NVIDIA, Google, and OpenAI have successively released behemoth models like MegatronLM (8.3 billion parameters), T5 (11 billion), and GPT-3 (150 billion), continuously setting astonishing records while showcasing the strength of these giants. It is believed that in the future, behemoth models will remain one of the research goals of large companies, while remaining out of reach for ordinary researchers.
2. Small and Beautiful Series: DistillBERT, TinyBERT, FastBERT
Without the financial strength of the leading giants, ordinary companies and research institutions have focused on the opposite track—model lightweighting. How to achieve effects close to those of large models with as few parameters as possible, while doubling training/prediction speed, is a practical and valuable topic. Representative works in this area include TinyBERT released by Huawei Noah’s Ark Lab and FastBERT from Peking University, both achieving remarkable results. For example, FastBERT integrates a classifier at each layer of BERT, automatically adjusting the computational load for each sample through a sample-adaptive mechanism (easier samples can be predicted through just a few layers, while more difficult samples require the full process).
3. Potential Stocks Series: Few-Shot Learning
In practical business scenarios, small and medium AI companies often encounter issues with insufficient data volume. In such cases, transfer learning and few-shot learning can be very helpful. Inspired by humans’ ability to learn quickly from few (single) samples, enabling models to learn from a small number of samples to gain strong generalization capabilities has become one of the research hotspots in recent years.
Summarizing the development of natural language processing in recent years, we can observe the following trend changes:
First, neural networks have penetrated into various fields of NLP, leading to new modeling, learning, and reasoning methods in neural NLP, which have made significant progress in the typical NLP tasks I just introduced;
Second, a series of pre-trained models represented by BERT have been widely applied, reflecting the potential of the universal language patterns and knowledge contained in large-scale language data combined with specific application scenarios;
Third, low-resource NLP tasks have received widespread attention and have made good progress.
In addition to the significant technological advancements, the progress of NLP in China has also attracted global attention. In terms of paper publications at top conferences (ACL, EMNLP, COLING, etc.), China has ranked second in the world for the past five years, second only to the United States and far ahead of other countries. Machine translation centered around the Chinese language is now at the leading level globally; in conversation and dialogue, China also ranks among the world’s leaders. In summary, from China to Asia to the world, the entire trend in the field of NLP shows that efforts are being made at different levels and scales. As Zhou Ming, the vice president of Microsoft Research Asia, said, NLP has entered a golden decade. With the future development of the national economy and the enormous demand for NLP brought about by artificial intelligence, a large amount of various data is available for model training. Various new methods represented by neural network NLP will gradually enhance modeling capabilities, various evaluations and open platforms will promote NLP research and dissemination capabilities, and the increasingly prosperous AI and NLP fields will foster the cultivation of specialized talents. It is believed that the NLP field will witness more milestone practices, and an increasing number of intelligent applications will follow.

Leave a Comment