Report by Machine Heart
Editor: Zhang Qian
This nearly one hundred page review outlines the evolution of pre-trained foundation models, showing us how ChatGPT has gradually achieved success.
All successes have a traceable path, and ChatGPT is no exception.
Recently, Turing Award winner Yann LeCun was trending due to his overly harsh evaluation of ChatGPT.
In his view, “In terms of underlying technology, ChatGPT does not have any particularly innovative aspects,” nor is it “something revolutionary.” Many research labs are using the same technology and conducting similar work. More importantly, ChatGPT and its underlying GPT-3 are composed of various technologies developed by multiple parties over many years, resulting from decades of contributions from different individuals. Therefore, LeCun believes that rather than considering ChatGPT a scientific breakthrough, it is more accurate to view it as a decent engineering example.

“Is ChatGPT revolutionary?” is a topic filled with controversy. But undoubtedly, it is indeed built on numerous technologies accumulated beforehand, such as the core Transformer proposed by Google a few years ago, which was inspired by Bengio’s work on the concept of attention. If we trace back further, we can link it to research from decades ago.
Of course, the public may not feel this gradual process, as not everyone reads papers one by one. However, for technicians, understanding the evolution of these technologies is very helpful.
In a recent review article, researchers from Michigan State University, Beihang University, and Lehigh University carefully reviewed hundreds of papers in this field, focusing mainly on pre-trained foundation models in text, image, and graph learning, which is well worth reading. Duke University Professor and Canadian Academy of Engineering Fellow Pei Jian, and Philip S. Yu, Yu Shilun, a distinguished professor at the University of Illinois Chicago’s Department of Computer Science, as well as Xiong and Cai Ming, are among the authors of this paper.
Paper link: https://arxiv.org/pdf/2302.09419.pdf
The paper’s table of contents is as follows:
On overseas social platforms, DAIR.AI co-founder Elvis S. recommended this review and received over a thousand likes.
Introduction
Pre-trained foundation models (PFM) are an important component of artificial intelligence in the era of big data. The name “foundation model” comes from a review published by Percy Liang, Fei-Fei Li, and others — “On the Opportunities and Risks of Foundation Models,” which is a general term for a type of model and its functions. PFMs have been widely studied in the fields of NLP, CV, and graph learning. They have shown strong potential for feature representation learning in various tasks such as text classification, text generation, image classification, object detection, and graph classification. Whether trained on large datasets across multiple tasks or fine-tuned on small-scale tasks, PFMs have demonstrated superior performance, enabling rapid data processing.
PFM and Pre-training
PFMs are based on pre-training techniques aimed at utilizing a large amount of data and tasks to train a general model that can be easily fine-tuned for different downstream applications.
The idea of pre-training originated from transfer learning in CV tasks. However, after observing the effectiveness of this technology in the CV field, people began to leverage it to improve model performance in other fields.
When pre-training techniques are applied in the NLP field, well-trained language models can capture rich knowledge beneficial for downstream tasks, such as long-term dependencies and hierarchical relationships. Furthermore, a significant advantage of pre-training in the NLP field is that the training data can come from any unlabeled text corpus, meaning there is virtually an unlimited amount of training data available for the pre-training process. Early pre-training was a static technique, such as NNLM and Word2vec, but static methods struggle to adapt to different semantic contexts. Therefore, dynamic pre-training techniques, such as BERT and XLNet, were proposed. Figure 1 illustrates the history and evolution of PFMs in the NLP, CV, and GL fields. PFMs based on pre-training techniques utilize large corpora to learn general semantic representations. With the introduction of these pioneering works, various PFMs have emerged and been applied to downstream tasks and applications.
The recently popular ChatGPT is a typical case of PFM application. It is fine-tuned from the generative pre-trained transformer model GPT-3.5, which was trained using a large amount of text and code. Additionally, ChatGPT also applied reinforcement learning from human feedback (RLHF), which has become a promising way to align large language models with human intent. The outstanding performance of ChatGPT may bring transformative changes to the training paradigms of all types of PFMs, such as instruction alignment techniques, reinforcement learning, prompt tuning, and chain of thought applications, leading towards general artificial intelligence.
This article will focus on PFMs in the text, image, and graph fields, which is a relatively mature classification method of research. For text, it is a versatile LM used to predict the next word or character in a sequence. For example, PFMs can be used for machine translation, question-answering systems, topic modeling, sentiment analysis, and more. For images, it is similar to PFMs on text, using massive datasets to train a model suitable for many CV tasks. For graphs, similar pre-training ideas are also used to obtain PFMs, which are utilized for various downstream tasks. In addition to PFMs aimed at specific data domains, this article also reviews and elaborates on several other advanced PFMs, such as those for speech, video, and cross-domain data, as well as multimodal PFMs. Moreover, a large integration of PFMs capable of handling multimodal tasks is emerging, known as unified PFMs. The authors first define the concept of unified PFMs, then review the state-of-the-art unified PFMs (such as OFA, UNIFIED-IO, FLAVA, BEiT-3, etc.) in recent research.
Based on the characteristics of existing PFMs in the aforementioned three fields, the authors conclude that PFMs have the following two major advantages. First, to improve performance in downstream tasks, the model only needs minimal fine-tuning. Second, PFMs have already undergone quality scrutiny. We can apply PFMs to task-related datasets instead of building models from scratch to solve similar problems. The broad prospects of PFMs have sparked a wealth of related work focusing on model efficiency, safety, and compression issues.
Contributions and Structure of the Paper
Before this article was published, several reviews had already covered some specific domains of pre-trained models, such as text generation, visual transformers, and object detection.
“On the Opportunities and Risks of Foundation Models” summarizes the opportunities and risks of foundation models. However, existing works have not achieved a comprehensive review of PFMs across different fields (such as CV, NLP, GL, Speech, Video) in various aspects, such as pre-training tasks, efficiency, effectiveness, and privacy. In this review, the authors detail the evolution of PFMs in the NLP field and how pre-training has migrated to the CV and GL fields and been adopted.
Compared to other reviews, this article does not provide a comprehensive introduction and analysis of existing PFMs across all three fields. Unlike previous reviews of pre-trained models, the authors summarize existing models from traditional models to PFMs, as well as the latest work in the three fields. Traditional models emphasize static feature learning. Dynamic PFMs introduce structures that are the mainstream research.
The authors further introduce some other research on PFMs, including other advanced and unified PFMs, model efficiency and compression, safety, and privacy. Finally, the authors summarize future research challenges and open questions across different fields. They also comprehensively present relevant evaluation metrics and datasets in Appendices F and G.
In summary, the main contributions of this article are as follows:
-
A detailed and up-to-date review of the development of PFMs in NLP, CV, and GL. In the review, the authors discuss and provide insights into the design and pre-training methods of general PFMs in these three main application areas;
-
A summary of the development of PFMs in other multimedia fields, such as speech and video. Additionally, the authors discuss cutting-edge topics regarding PFMs, including unified PFMs, model efficiency and compression, as well as safety and privacy.
-
By reviewing various modes of PFMs in different tasks, the authors discuss the major challenges and opportunities for the future research of large-scale models in the era of big data, guiding the next generation of collaborative and interactive intelligence based on PFMs.
The main content of each chapter is as follows:
Chapter 2 of the paper introduces the general conceptual architecture of PFMs.
Chapters 3, 4, and 5 summarize existing PFMs in the fields of NLP, CV, and GL respectively.
Chapters 6 and 7 introduce other cutting-edge research on PFMs, including advanced and unified PFMs, model efficiency and compression, as well as safety and privacy.
Chapter 8 summarizes the main challenges of PFMs. Chapter 9 concludes the entire text.
Further Reading:
-
“Hot Interpretations: The Emergent Capabilities of Large Models and the Paradigm Shift Triggered by ChatGPT”
-
“Where Do ChatGPT’s Superpowers Come From? A Ten Thousand Word Analysis Tracing the Technical Roadmap!”
-
“Learn about the Technology Behind ChatGPT from Li Mu: Understand the InstructGPT Paper in 67 Minutes”
-
“Why Did All GPT-3 Replications Fail? These Are Things You Should Know When Using ChatGPT”
-
“Comprehensive Learning of ChatGPT: Machine Heart Has Prepared a Collection of 89 Articles”
© THE END
For reprints, please contact this public account for authorization
Submissions or inquiries: [email protected]