-
Editor’s Note
Since its launch on December 2nd, 2022, ChatGPT, developed by the American startup OpenAI, has gained over a million users and sparked intense discussions. It can perform a range of common text output tasks, including writing code, debugging, translating literature, writing novels, creating business copy, generating recipes, doing homework, and evaluating assignments. Moreover, it can remember the context of conversations with users, providing a very realistic interaction.
Although industry insiders believe that ChatGPT still has issues such as an outdated training dataset, we cannot help but ponder: Where will human-created AI ultimately lead? How will the relationship between humans and thinking machines evolve? These questions are impossible to ignore.
Written by | Sun Ruichen
Reviewed by | Zhang Zheng
Edited by | Chen Xiaoxue
Movie poster for Dune (Image source: IMBD.com)
The film Dune, released at the end of last year, is a science fiction story set in the year 10191 (8169 years from now). While watching, I had a lingering question: the lives of people in this story seem more primitive than today, and there are not many traces of artificial intelligence (AI) in the narrative. Later, after reading the original novel of Dune, I understood that this was a deliberate design by the author: there was a war at some point before 10191, where humanity’s adversaries were the thinking machines they had created. After a brutal war, humans managed to defeat these intelligent machines with great effort. Consequently, humanity decided to permanently ban the existence of such machines, which led to the primitive world of Dune in 10191.
Last Friday, OpenAI launched a new AI chat model, ChatGPT. Many, including myself, have experienced this new chatbot over the past week. After interacting with this chatbot, you may have guessed—images of the Dune world came to mind.
Over the past decade, it seems like a “Cambrian explosion” has occurred in the field of artificial intelligence technology, with a plethora of new terms rapidly emerging and becoming popular in a short time. Many of these new terms and their abbreviations lack standardized Chinese translations, leading industry insiders to commonly communicate using English abbreviations. This creates a cognitive barrier for outsiders wanting to fully understand these technologies.
To understand the ChatGPT chatbot, one must first understand InstructGPT, GPT-3, GPT-2, GPT, Transformer, and the commonly used RNN models in the field of natural language processing prior to these developments.
1. The Predecessor of ChatGPT
In 2017, the Google Brain team presented a paper titled “Attention is All You Need” at the Neural Information Processing Systems (NeurIPS) conference, a top academic conference in machine learning and artificial intelligence. In this paper, the authors introduced the transformer model based on self-attention mechanisms for the first time and applied it to understanding human language, i.e., natural language processing.[1] Before this paper was released, the mainstream model in natural language processing was the recurrent neural network (RNN). The advantage of RNN models is their ability to better handle sequential data, such as language. However, this also leads to issues when processing longer sequences, such as long articles or books, where the model may become unstable or stop effective training too early (due to gradient vanishing or exploding during training, which will not be elaborated here) and may take too long to train (since data must be processed sequentially, parallel training is not possible).
Initial architecture of the transformer model (Image source: Reference [1])
The transformer model proposed in 2017 can perform data computation and model training in parallel, shortening training time and providing models that can be explained grammatically, meaning the model has interpretability.
This initial transformer model had 65 million adjustable parameters. The Google Brain team used various publicly available language datasets to train this initial transformer model. These datasets included the 2014 English-German machine translation workshop (WMT) dataset (with 4.5 million pairs of English-German sentences), the 2014 English-French machine translation workshop dataset (36 million English-French sentence pairs), and parts of the Pennsylvania Treebank language dataset (which included 40,000 sentences from the Wall Street Journal and another 17 million sentences selected from that corpus). Furthermore, the Google Brain team provided the model architecture in the paper, allowing anyone to build similar models using their own data for training.
After training, this initial transformer model achieved the highest industry scores in translation accuracy, English component syntax analysis, and other metrics, becoming the most advanced large language model (LLM) at that time.
Major milestones of large language models (LLM)
Since its inception, the transformer model has profoundly influenced the trajectory of AI development in the following years. In just a few years, its impact has spread across various fields of AI—from various natural language models to the AlphaFold2 model predicting protein structures, all relying on it.
2. Continuous Iteration: Seeking the Limits of Language Models
Among the many teams researching the transformer model, OpenAI has been one of the few that has consistently focused on finding its limits.
Founded in December 2015 in San Francisco, OpenAI was co-founded by Tesla’s founder Elon Musk, who provided early funding (he later exited the company but remained an investor). In its early days, OpenAI was a non-profit organization with the mission of developing AI technologies that are beneficial and friendly to humanity. In 2019, OpenAI changed its nature to become a for-profit entity, a change that was not unrelated to the transformer model.
In 2018, less than a year after the transformer model was born, OpenAI published a paper titled “Improving Language Understanding by Generative Pre-training” and released the GPT-1 (Generative Pre-training Transformers) model with 117 million parameters. This model was trained on a large dataset of classic books (BookCorpus), which contains over 7,000 unpublished books across genres such as adventure, fantasy, and romance. After pre-training, the authors further trained the model using different specific datasets for four different language scenarios (known as fine-tuning). The final trained model outperformed the basic transformer model in question answering, text similarity assessment, semantic entailment judgment, and text classification, becoming the new industry leader.
In 2019, the company announced a model with 1.5 billion parameters: GPT-2. The architecture of this model was the same as GPT-1, with the main difference being the scale of GPT-2, which was ten times larger. They also published a paper introducing this model titled “Language Models are Unsupervised Multitask Learners.” In this work, they used their own collected dataset primarily consisting of web text information. Unsurprisingly, the GPT-2 model set new records for large language models across multiple language scenarios. In the paper, they provided results of the GPT-2 model answering new questions (questions and answers not present in the training data).
Results of GPT-2 model answering new questions (Image source [3])
In 2020, this startup team surpassed themselves again, publishing the paper “Language Models are Few-Shot Learners” and launching the latest GPT-3 model, which has 175 billion parameters. The architecture of the GPT-3 model is fundamentally the same as GPT-2, except that it is two orders of magnitude larger. The training set for GPT-3 is also much larger than that of the previous two GPT models: it includes a filtered web-crawled dataset (429 billion tokens), Wikipedia articles (3 billion tokens), and two different book datasets (670 million tokens in total).
Due to the huge number of parameters and the scale of the required training dataset, training a GPT-3 model conservatively costs between five million to twenty million dollars—more GPUs used for training increase costs but reduce time; the opposite is also true. It can be said that this magnitude of large language model research is no longer affordable for ordinary scholars or individuals. Facing such a massive GPT-3 model, users can provide only small sample prompts or even no prompts at all to receive high-quality responses that meet their needs. Small sample prompts refer to users providing a few examples to the model before posing their language tasks (translation, text creation, question answering, etc.).
GPT-3 can better answer questions based on user prompts (Image source: [4])
When GPT-3 was released, it did not provide a wide user interaction interface and required users to submit applications for approval before registering, so the number of people who directly experienced the GPT-3 model was limited. Based on the experiences shared online by those who have tried it, we know that GPT-3 can automatically generate complete, coherent long articles based on simple prompts, making it hard to believe that this is a machine’s work. GPT-3 can also write code, create recipes, and perform almost all text creation tasks. After early testing, OpenAI commercialized the GPT-3 model: paying users can connect to GPT-3 through an API to perform their required language tasks. In September 2020, Microsoft obtained exclusive licensing for the GPT-3 model, meaning they could access the source code exclusively. This exclusive license does not affect paying users’ ability to continue using the GPT-3 model through the API.
In March 2022, OpenAI published another paper titled “Training Language Models to Follow Instructions with Human Feedback” and introduced the InstructGPT model, which is based on GPT-3 and further fine-tuned. InstructGPT’s training incorporated human evaluations and feedback data, not just pre-prepared datasets.
During the public testing of GPT-3, users provided a wealth of conversational and prompt data, and OpenAI’s internal data labeling team also generated a significant amount of manually labeled datasets. These labeled data can help the model learn not only from the data itself but also from the human annotations (for example, certain sentences or phrases should be used sparingly).
OpenAI first used this data to fine-tune GPT-3 through supervised learning.
Next, they collected samples of answers generated by the fine-tuned model. Generally speaking, for each prompt, the model can provide countless answers, but users typically want to see just one answer (which aligns with human communication habits). The model needs to rank these answers and select the best one. Therefore, the data labeling team manually scored and ranked all possible answers, selecting the one that best fits human communication habits. These manually scored results can further establish a reward model—a reward model can automatically provide feedback to the language model, encouraging it to give good answers and suppressing bad ones, helping the model find the optimal answer.
Finally, the team used the reward model and more labeled data to continue optimizing the fine-tuned language model and iterating it. The final model is called InstructGPT.
3. The Birth of ChatGPT
Our main focus today is ChatGPT and its predecessors, so it is unavoidable to tell the story primarily through OpenAI. From GPT-1 to InstructGPT, if we only focus on OpenAI, we might overlook that other AI companies and teams were also attempting similar efforts concurrently. In the two years following the release of GPT-3, several similar large language models emerged, but it must be said that GPT-3 remains the most well-known model.
Some competitors of GPT-3 (Image source: gpt3demo.com)
Returning to the present, during this year’s Neural Information Processing Systems conference, OpenAI announced their latest large language pre-training model: ChatGPT.
Similar to the InstructGPT model, ChatGPT is a chatbot developed by OpenAI after fine-tuning the GPT-3 model (also known as GPT-3.5). According to information from OpenAI’s official website, the ChatGPT model is a sibling to the InstructGPT model. Since the largest InstructGPT model has 175 billion parameters (the same as the GPT-3 model), it is reasonable to believe that ChatGPT’s parameter count is also in this range. However, literature suggests that the InstructGPT model, which performs optimally on dialogue tasks, has 1.5 billion parameters, so it is also possible that ChatGPT’s parameter count is similar.[5]
Since its launch on December 2nd, ChatGPT has garnered over a million users. Examples of conversations shared by users on social media indicate that ChatGPT, like GPT-3, can perform a range of common text output tasks, including writing code, debugging, translating literature, writing novels, creating business copy, generating recipes, doing homework, and evaluating assignments. One advantage of ChatGPT over GPT-3 is that the former interacts more conversationally, while the latter is better at producing longer articles but lacks a conversational tone. Some users have even used ChatGPT to communicate with customer service and recover overpayments (which may suggest that ChatGPT has, in some sense, passed the Turing test), and it could become a good companion for socially anxious individuals.
4. Warning of Issues
OpenAI’s research team warned users about some issues with the ChatGPT model upon its release, and these issues have been confirmed by global netizens through repeated testing.
First, the large language model behind ChatGPT has a training dataset that is only up to date until the end of 2021, so it cannot provide accurate answers about events that occurred in the past year. Secondly, when users seek accurate information from ChatGPT (for example, writing code or checking recipes), the accuracy of its responses is inconsistent, requiring users to possess the ability to discern the quality and accuracy of the answers. Due to accuracy issues, the coding community website StackOverflow has prohibited users from citing code generated by ChatGPT on their site.
In response, Zhang Zheng, director of the AI research institute at Amazon AWS in Shanghai, commented that the training method of the ChatGPT model has a fatal flaw: the scoring mechanism for various possible answers during question answering is based on ranking, meaning that the second step is a rough scoring. This leads to erroneous reasoning being mixed in (for example, just because answer A is ranked higher than answer B doesn’t mean A is free from common sense or factual errors). Question answering is not only open-ended but also the rationality of each step can be gray and requires segmentation. This problem is not without solution; there is still a lot of foundational work to be done.
Finally, the way the questioner describes the question can also affect the accuracy of ChatGPT’s answers. This issue may have unexpected consequences. Earlier this year, OpenAI launched the latest AI painting system DALL·E 2 (alongside several similar products, such as Midjourney). Users only need to provide a textual description, and DALL·E 2 can generate an image based on that description. It is not an exaggeration to say that the quality and style of these images can rival those created by professional artists.
An abstract painting generated by DALL·E 2 (Image source: openai.com)
Thus, while the art world is shocked by this, the business of prompt engineering has quietly risen: good prompts can guide AI models to generate more satisfactory and aesthetically pleasing works; while poor prompts often lead to subpar or worse works reminiscent of student exercises. Therefore, how to write good prompts and engage in high-quality dialogue with AI models has become a new entrepreneurial hotspot. The San Francisco startup PromptBase has launched a service for $1.99 per prompt, primarily targeting content creation models like DALL·E 2 and GPT-3. Perhaps they will soon add ChatGPT to their offerings.
Based on the previously mentioned principles of few-shot learning and incorporating human feedback, we already know that if we provide the ChatGPT model with several examples before presenting a language task, or continuously give feedback to guide ChatGPT, its responses will align more closely with our requirements. Therefore, writing a good prompt can yield more surprises from ChatGPT.
5. The Evolution of AI: Where Will It End?
From the transformer model of 2017 to today’s ChatGPT, large language models have undergone numerous iterations, each more powerful than the last. In the future, OpenAI will continue to bring us GPT-4, GPT-5, and even GPT-100. Our current fascinating, bizarre, and mind-bending conversations with ChatGPT will all become training data for the next generation of models.
When OpenAI was founded in 2016, its original intention was to develop AI technologies beneficial to humanity. Over the past six years, there has been no evidence to suggest they have deviated from this intention—in fact, ChatGPT and the large language model behind it seem to be advanced productivity tools aimed at the future. We have reason to believe that AI technologies exemplified by large language models can help us better accomplish our learning and work, leading to a better life; we also have reason to believe that we should continue to support, develop, and promote AI to benefit the public. However, we can no longer ignore that the speed of AI technology evolution and iteration far exceeds that of human and biological evolution.
Elon Musk, co-founder of OpenAI, once discussed the founding intention of OpenAI when he realized the enormous potential of AI: “How can we ensure that the future brought by AI is friendly? There will always be a risk in trying to develop friendly AI technologies that we might create things that concern us. However, the best barrier may be to allow as many people as possible to access and own AI technology. If everyone can utilize AI technology, then there is no risk of a small group of people possessing overly powerful AI technology leading to dangerous consequences.”
However, what Musk did not mention is that even if everyone has the opportunity and ability to use AI technology, if AI technology develops to a point that humans cannot control, how do we build our fortress? How can we avoid a world war between humans and thinking machines, as alluded to in the story of Dune? The existence of ChatGPT is not yet a cause for concern, but where will the evolution of AI ultimately end?
In the process of creating AI, it is difficult for humanity to stop questioning—will the rapidly evolving AI technology one day force us to choose a primitive future like that in Dune?
ChatGPT does not know either.
Author’s Bio:
Sun Ruichen, Ph.D. in Neurobiology from the University of California, San Diego, currently a data scientist at a pharmaceutical company.
References:
1. https://arxiv.org/abs/1706.03762
2. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
3. https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
4. https://arxiv.org/abs/2005.14165v4
5. https://arxiv.org/abs/2203.02155
Layout Editor | Xiao Mao