The Evolution of Large Models: From Transformer to DeepSeek-R1

📖 Reading Time: 19 minutes

🕙 Release Date: February 14, 2025

❝

Recent Hot Articles: The Most Comprehensive Mathematical Principles of Neural Networks (Code and Formulas) Intuitive Explanation Welcome to follow the Zhihu and WeChat public account columns LLM Architecture Column Zhihu LLM Column Zhihu【Boqi】 WeChat Public Account【Boqi Technology Talk】【Boqi Reading】

At the beginning of 2025, the emergence of DeepSeek-R1 created a stir in the field of artificial intelligence. This article will review the development history of large language models, starting with the revolutionary Transformer architecture in 2017, which redefined natural language processing (NLP) through the self-attention mechanism. We witnessed the rise of models like BERT and GPT, which transformed context understanding and generation capabilities, ultimately leading to the birth of GPT-3 with 175 billion parameters. The article will also explore how to address the “hallucination” problem in large language models through supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which refers to the generated text contradicting facts, giving the impression of “serious nonsense.” By 2023, multimodal models like GPT-4 integrated text, images, and audio, while reasoning models like OpenAI-o1 and DeepSeek-R1 broke through the boundaries of complex problem-solving.

The Evolution of Large Models: From Transformer to DeepSeek-R1

1. What is a Language Model?

A language model is an artificial intelligence (AI) system designed to process, understand, and generate content similar to human language. They learn patterns and structures from large datasets, capable of generating coherent and contextually relevant text, with applications including translation, summarization, chatbots, and content generation.

1.1 Autoregressive Language Models

Most large language models operate autoregressively, meaning they predict the probability distribution of the next token based on the preceding token sequence. This autoregressive feature allows the model to capture complex language patterns and dependencies. Mathematically, this process can be represented as:

In the text generation process, large language models determine the next output token through decoding algorithms. This process can adopt different strategies: selecting the token with the highest probability (i.e., greedy search) or sampling a token randomly from the predicted probability distribution. The latter approach allows for variability in generated text, mirroring the diversity and randomness of human language.

1.2 Generative Capability

The autoregressive nature of large language models enables them to generate one token at a time, sequentially, by leveraging the context established by preceding words. Starting from an initial token or prompt, the model iteratively predicts the next token until a complete sequence is formed or a predefined stopping condition is met.

To generate a complete response to a prompt, large language models iteratively query by adding previously chosen tokens to the input.

This generative capability supports various applications, including creative writing, conversational AI, and automated customer support systems.

2. The Transformer Revolution (2017)

In 2017, Vaswani et al. introduced the Transformer architecture in their groundbreaking paper, “Attention is All You Need,” marking a watershed moment in the field of natural language processing. It addressed key limitations of earlier models (such as recurrent neural networks (RNNs) and long short-term memory networks (LSTMs)) that struggled with long-distance dependencies and sequential processing. These challenges made it difficult to implement effective language models using RNNs or LSTMs due to their computational inefficiency and susceptibility to issues like gradient vanishing. The Transformer overcame these obstacles, fundamentally changing the field and laying the groundwork for modern large language models.

2.1 Key Innovations of the Transformer Architecture

Self-Attention Mechanism: Unlike RNNs that process tokens sequentially and struggle with long-distance dependencies, the Transformer uses a self-attention mechanism to measure the importance of each token relative to others. This allows the model to dynamically focus on relevant parts of the input. The mathematical formula is as follows:

Here, Q, K, and V are the query, key, and value matrices, is the dimension of the key. The self-attention mechanism enables parallel computation, speeding up training while enhancing understanding of global context.

Multi-Head Attention: Multiple attention heads operate in parallel, each focusing on different aspects of the input. Their outputs are concatenated and transformed to achieve richer contextual representations.

Feed-Forward Networks and Layer Normalization: Each Transformer layer includes a feed-forward network applied to each token, along with layer normalization and residual connections. These operations stabilize the training process and support deeper architectures.
Positional Encoding: Since the Transformer does not inherently encode the order of tokens, positional encoding (sine functions of position and frequency) is added to represent word order while preserving sequential information without sacrificing parallelization.

2.2 Impact on Language Modeling

Scalability: The Transformer enables fully parallelized computation, making it feasible to train large-scale models on large datasets.
Context Understanding: The self-attention mechanism captures both local and global dependencies, enhancing the coherence and contextual awareness of the text.

The introduction of the Transformer architecture laid the foundation for building large-scale, efficient language models capable of handling complex tasks with unprecedented accuracy and flexibility.

3. The Era of Pretrained Transformer Models (2018 – 2020)

The introduction of the Transformer architecture in 2017 ushered in a new era in natural language processing, characterized by the rise of pretrained models and unprecedented attention to model scale. During this period, two influential model families emerged: BERT and GPT, showcasing the powerful potential of large-scale pretraining and fine-tuning paradigms.

3.1 BERT: Bidirectional Context Understanding (2018)

In 2018, Google launched BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking model that utilizes the encoder of the Transformer, achieving state-of-the-art performance across a wide range of natural language processing tasks. Unlike previous models that processed text unidirectionally (left-to-right or right-to-left), BERT employs a bidirectional training approach, allowing it to capture context from both directions simultaneously. By generating deep, context-rich text representations, BERT excelled in language understanding tasks such as text classification, named entity recognition (NER), and sentiment analysis.

Key innovations of BERT include:

Masked Language Modeling (MLM): Instead of predicting the next word in the sequence, BERT is trained to predict masked tokens in a sentence. This forces the model to consider the entire context of the sentence—both preceding and succeeding words—when making predictions. For example, for the sentence “The cat sat on the [MASK] mat,” BERT learns to predict “soft” based on the surrounding context.
Next Sentence Prediction (NSP): In addition to MLM, BERT is trained on a secondary task called next sentence prediction, where the model learns to predict whether two sentences are consecutive in a document. This helps BERT perform well on tasks requiring an understanding of the relationship between sentences, such as question answering and natural language inference.

The impact of BERT: BERT’s bidirectional training led to breakthrough performance on benchmark tests such as GLUE (General Language Understanding Evaluation) and SQuAD (Stanford Question Answering Dataset). Its success highlighted the importance of contextual embeddings—representations that dynamically change based on surrounding words—paving the way for a new generation of pretrained models.

3.2 GPT: Generative Pretraining and Autoregressive Text Generation (2018 – 2020)

While BERT focused on bidirectional context understanding, OpenAI’s GPT series adopted a different strategy, emphasizing generative capabilities through autoregressive pretraining. By utilizing the decoder of the Transformer, GPT models excelled as autoregressive language models for text generation.

GPT-1 (2018): The first version of GPT released in 2018 is a large-scale Transformer model trained to predict the next word in a sequence, similar to traditional language models.

Unidirectional Autoregressive Training: GPT is trained using a causal language modeling objective, where the model predicts the next token based only on preceding tokens. This makes it particularly suitable for generative tasks such as text completion, summarization, and dialogue generation.
Downstream Task Fine-Tuning: One of GPT’s key contributions is its ability to fine-tune for specific downstream tasks without a specific architecture for the task. By simply adding a classification head or modifying the input format, GPT can adapt to tasks like sentiment analysis, machine translation, and question answering.

GPT-2 (2019): Building on the success of the original GPT, OpenAI released GPT-2, a larger model with 1.5 billion parameters. GPT-2 demonstrated impressive zero-shot capabilities, meaning it could perform tasks without any specific task fine-tuning. For example, it could generate coherent articles, answer questions, and even perform language translation without explicit training.

GPT-3 (2020): The release of GPT-3 marked a turning point in the scaling of language models. With an astonishing 175 billion parameters, it pushed the boundaries of large-scale pretraining. It showcased exceptional few-shot and zero-shot learning capabilities, performing tasks with minimal or no examples during inference. GPT-3’s generative capabilities extended to creative writing, coding, and complex reasoning tasks, demonstrating the potential of ultra-large-scale models.

3.3 The Impact of GPT and the Role of Scale

The launch of GPT models, particularly GPT-3, marked a transformative era in the field of artificial intelligence, showcasing the powerful capabilities of autoregressive architectures and generative abilities. These models opened up new possibilities for applications such as content creation, conversational agents, and automated reasoning, achieving human-like performance across a wide range of tasks. GPT-3, with its 175 billion parameters, demonstrated the profound impact of scale, establishing new benchmarks in AI capabilities for larger models trained on extensive datasets.

Language modeling performance steadily improves with increases in model size, dataset size, and computational resources used in training. Relevant paper link

Between 2018 and 2020, the field was driven by an unrelenting pursuit of scale. Researchers found that as model sizes grew from millions to billions of parameters, they performed better at capturing complex patterns and generalizing to new tasks. This scale effect was supported by three key factors:

Dataset Size: Larger models require large-scale datasets for pretraining. For example, GPT-3 was trained on a massive corpus of internet text, allowing it to learn diverse language patterns and knowledge domains.
Computational Resources: The availability of powerful hardware (such as GPUs and TPUs) and distributed training techniques made efficient training of models with billions of parameters feasible.
Efficient Architectures: Innovations like mixed-precision training and gradient checkpointing reduced computational costs, making large-scale training more feasible within reasonable time and budget constraints.

This era of scaling not only improved the performance of language models but also laid the foundation for future breakthroughs in artificial intelligence, emphasizing the importance of scale, data, and computation in achieving state-of-the-art results.

4. Post-Training Alignment: Bridging the Gap Between AI and Human Values (2021 – 2022)

The large language model GPT-3, with 175 billion parameters, can generate text that is nearly indistinguishable from human writing, raising serious concerns about the authenticity and credibility of AI-generated content. While this achievement marks an important milestone in AI development, it also highlights the critical challenges of ensuring that these models align with human values, preferences, and expectations. A major issue is “hallucination,” where the content generated by large language models is factually incorrect, nonsensical, or contradictory to the input prompt, giving the impression of “serious nonsense.” To address these challenges, researchers focused on improving the alignment of large language models with human intent and reducing hallucination in 2021 and 2022, leading to the development of techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

4.1 Supervised Fine-Tuning (SFT)

The first step in enhancing GPT-3’s alignment capabilities is supervised fine-tuning (SFT), a foundational component of the RLHF framework. SFT is akin to instruction tuning, involving training the model on high-quality input-output pairs (i.e., examples) to teach it how to follow instructions and generate desired outputs.

These examples are carefully curated to reflect the expected behaviors and outcomes, ensuring the model learns to generate accurate and contextually appropriate responses.

However, SFT alone has limitations:

Scalability: Collecting human examples is both labor-intensive and time-consuming, especially for complex or niche tasks.
Performance: Simply mimicking human behavior does not guarantee that the model will outperform human performance or generalize well on unseen tasks.

To overcome these challenges, a more scalable and efficient approach was needed, paving the way for the next step—reinforcement learning from human feedback (RLHF).

4.2 Reinforcement Learning from Human Feedback (RLHF)

RLHF, introduced by OpenAI in 2022, addressed the scalability and performance limitations of SFT. Unlike SFT, which requires humans to write complete outputs, RLHF involves ranking multiple outputs generated by the model based on quality. This approach allows for more efficient data collection and labeling, significantly improving scalability.

The RLHF process consists of two key stages:

Training a Reward Model: Human annotators rank multiple outputs generated by the model, creating a preference dataset. This data is used to train a reward model that learns to assess the quality of outputs based on human feedback.
Fine-Tuning the Large Language Model Using Reinforcement Learning: The reward model uses Proximal Policy Optimization (PPO) (a reinforcement learning algorithm) to guide the fine-tuning of the large language model. Through iterative updates, the model learns to generate outputs that better align with human preferences and expectations.

This two-stage process combining SFT and RLHF enables the model not only to accurately follow instructions but also to adapt to new tasks and continuously improve. By integrating human feedback into the training loop, RLHF significantly enhances the model’s ability to generate reliable outputs that align with human expectations, setting new standards for AI alignment and performance.

4.3 ChatGPT: Advancing Conversational AI (2022)

In March 2022, OpenAI launched GPT-3.5, an upgraded version of GPT-3 with the same architecture but improved training and fine-tuning. Key enhancements include better adherence to instructions through optimized data, reduced hallucination (though not completely eliminated), and the use of more diverse and up-to-date datasets to generate more relevant and context-aware responses.

Building on GPT-3.5 and InstructGPT, OpenAI launched ChatGPT in November 2022, a groundbreaking conversational AI model fine-tuned for natural multi-turn dialogues. Key improvements of ChatGPT include:

Dialogue-Focused Fine-Tuning: ChatGPT is trained on a large dataset of dialogues, excelling at maintaining context and coherence in conversations, resulting in more engaging and human-like interactions.
RLHF: By integrating RLHF, ChatGPT learns to generate responses that are not only useful but also honest and harmless. Human trainers rank responses based on quality, allowing the model to iteratively improve its performance.

The launch of ChatGPT marked a pivotal moment in the field of artificial intelligence, often referred to as the “ChatGPT Moment,” as it demonstrated the potential of conversational AI to transform human-computer interaction.

5. Multimodal Models: Connecting Text, Images, and More (2023 – 2024)

Between 2023 and 2024, multimodal large language models (MLLMs) like GPT-4V and GPT-4o redefined artificial intelligence by integrating text, images, audio, and video into a unified system. These models expanded the capabilities of traditional language models, achieving richer interactions and more complex problem-solving.

5.1 GPT-4V: The Combination of Vision and Language

In 2023, OpenAI launched GPT-4V, which combines the language capabilities of GPT-4 with advanced computer vision. It can interpret images, generate image captions, answer visual questions, and infer contextual relationships within visual content. Its cross-modal attention mechanism allows for seamless integration of text and image data, making it valuable in fields like healthcare (e.g., analyzing medical images) and education (e.g., interactive learning tools).

5.2 GPT-4o: The Frontiers of All Modalities

By early 2024, GPT-4o further expanded multimodality by integrating audio and video inputs. It operates in a unified representation space, capable of transcribing speech, describing video, or synthesizing audio from text. Real-time interaction and enhanced creativity—such as generating multimedia content—make it a versatile tool in industries like entertainment and design.

Real-World Impact

Multimodal large language models have fundamentally transformed industries such as healthcare (diagnosis), education (interactive learning), and creative industries (multimedia production). Their ability to handle multiple modalities opens up new possibilities for innovation.

6. Open Source and Open Weight Models (2023 – 2024)

During the period from 2023 to 2024, there was a growing momentum for open-source and open-weight AI models, allowing advanced AI technologies to be more widely applied.

Open-weight large language models provide publicly accessible model weights with minimal restrictions. This allows users to fine-tune and adapt them, although the model architecture and training data remain closed. They are suitable for rapid deployment. For example, Meta AI’s LLaMA series and Mistral AI’s Mistral 7B/Mixtral 8x7B.

Open-source large language models make the underlying code and structure publicly available. This enables comprehensive understanding, modification, and customization of models, fostering innovation and adaptability. Examples include OPT and BERT.

Community-driven innovation: Platforms like Hugging Face facilitate collaboration, and tools like LoRA and PEFT enable efficient fine-tuning. The community has developed specialized models for fields such as healthcare, law, and creativity while emphasizing ethical AI practices.

With the emergence of cutting-edge alignment technologies, the open-source community is currently in an exciting phase. This progress has led to the release of more outstanding open-weight models. As a result, the gap between closed-source models and open-weight models is steadily narrowing. The LLaMA3.1 – 405B model is landmark, as it marks the first time the gap with closed-source models has shrunk.

7. Reasoning Models: Transition from System 1 Thinking to System 2 Thinking (2024)

In 2024, the development of artificial intelligence began to emphasize enhancing reasoning abilities, moving from simple pattern recognition to more logical and structured thinking processes. This transition is influenced by cognitive psychology’s dual-process theory, which divides thinking into System 1 (fast, intuitive) and System 2 (slow, analytical). While previous models such as GPT-3 and GPT-4 excelled at System 1 tasks like text generation, they fell short in deep reasoning and problem-solving.

7.1 OpenAI-o1: A Huge Leap in Reasoning Ability

OpenAI’s release of the o1 series models in December 2024 aimed to enhance artificial intelligence’s reasoning abilities, particularly excelling in complex tasks like code generation and debugging. A key feature of the o1 model is its ability to enhance reasoning capabilities through a Chain of Thought (CoT) process, which breaks complex problems into smaller, more manageable steps.

Chain of Thought (CoT): The o1 model takes more time to “think” by generating a chain of thought before providing an answer, enhancing its complex reasoning abilities in fields such as science and mathematics. The model’s accuracy is correlated with the logarithm of the computational effort used for thinking prior to answering.
Variants: The o1 model suite includes o1, o1-mini, and o1 pro. The o1-mini is faster and more cost-effective than the o1-preview, suitable for programming and STEM-related tasks, although it may lack the breadth of world knowledge compared to o1-preview.
Performance: The o1-preview achieved near-PhD-level performance on benchmarks in physics, chemistry, and biology. In the American Mathematics Competition, it solved 83% of the problems, while GPT-4o only solved 13%. In Codeforces programming competitions, o1-preview ranked in the top 11%.

The release of OpenAI-o1 is a pivotal moment in the development of artificial intelligence, demonstrating the potential to combine generative capabilities and reasoning abilities to create models that think and behave more like humans. As the field continues to evolve, reasoning models are expected to open new frontiers in artificial intelligence, enabling machines to tackle some of the most challenging problems faced by humans.

8. Cost-Effective Reasoning Models: DeepSeek-R1 (2025)

Large language models typically require substantial computational resources during training and inference. Cutting-edge large language models like GPT-4o and OpenAI-o1, due to their closed-source nature, limit the widespread use of advanced AI technologies.

8.1 DeepSeek-V3

In late December 2024, DeepSeek-V3 emerged as a cost-effective open-weight large language model, setting a new standard for AI accessibility. DeepSeek-V3 can compete with top products like OpenAI’s ChatGPT but at a significantly lower development cost, estimated at approximately $5.6 million, only a fraction of what Western companies invest. The model features up to 671 billion parameters, of which 37 billion are activation parameters, and employs a mixture of experts (MoE) architecture, dividing the model into specialized components for tasks such as mathematics and coding to alleviate the training burden. DeepSeek-V3 incorporates engineering optimizations, such as improvements in managing key-value caches, and further advances the mixture of experts approach. The model introduces three key architectures:

Multi-Head Latent Attention (MLA): Reduces memory usage by compressing the keys and values of attention while maintaining performance, utilizing Rotary Positional Embedding (RoPE) to enhance positional information.

DeepSeek’s Mixture of Experts (DeepSeekMoE): Combines shared experts and routing experts in feed-forward networks (FFNs) to enhance efficiency and balance expert utilization.

Multi-Token Prediction: Enhances the model’s ability to generate coherent and contextually relevant outputs, particularly suited for tasks requiring complex sequence generation.

The release of DeepSeek-V3 triggered a global sell-off in tech stocks, with a market capitalization at risk of $1 trillion, as Nvidia’s shares fell 13% in pre-market trading. DeepSeek-V3 is priced at $2.19 per million output tokens, about one-thirtieth the cost of similar models from OpenAI.

8.2 DeepSeek-R1-Zero and DeepSeek-R1

Just a month later, in late January 2025, DeepSeek released DeepSeek-R1-Zero and DeepSeek-R1, which caused a sensation, showcasing exceptional reasoning capabilities at extremely low training costs. With advanced reinforcement learning techniques, these models proved that high-performance reasoning can be achieved without the exorbitant computational costs typically required for cutting-edge AI. This breakthrough solidified DeepSeek’s leading position in efficient and scalable AI innovation.

DeepSeek-R1-Zero: This is a reasoning model built on DeepSeek-V3, enhanced with reinforcement learning (RL) to improve reasoning abilities. It completely skips the supervised fine-tuning phase, starting training directly from a pretrained model called DeepSeek-V3-Base. It employs a rule-based reinforcement learning method, namely Group Relative Policy Optimization (GRPO), which calculates rewards based on predefined rules, simplifying and scaling the training process.

DeepSeek-R1: To address the issues of low readability and language mixing in DeepSeek-R1-Zero, DeepSeek-R1 introduces a set of limited high-quality cold-start data and undergoes additional reinforcement learning training. The model is fine-tuned and reinforced through multiple stages, including rejection sampling and a second round of reinforcement learning training, to enhance its general capabilities and align it more closely with human preferences.

Distilled Version of DeepSeek Model: DeepSeek developed smaller distilled versions of the DeepSeek-R1 model with parameter sizes ranging from 1.5 billion to 70 billion, enabling advanced reasoning capabilities even on less powerful hardware. These models are fine-tuned using synthetic data generated by the original DeepSeek-R1 to ensure excellent performance on reasoning tasks while being lightweight enough for local deployment.

DeepSeek-R1 exhibited competitive performance across various benchmarks in mathematics, coding, common sense, and writing. Depending on usage patterns, it can save significant costs compared to competitors like OpenAI’s o1 model, with costs being 20 to 50 times lower.

8.3 Impact on the AI Industry

The launch of DeepSeek-R1 challenges the existing landscape of artificial intelligence, enabling wider use of advanced large language models and fostering a more competitive ecosystem. Its affordability and accessibility are expected to drive adoption and innovation across various industries. Recently, leading cloud service providers like AWS, Microsoft, and Google Cloud have offered DeepSeek-R1 on their platforms. Smaller cloud providers and DeepSeek’s parent company also provide the model at competitive prices.

Conclusion

From the birth of the Transformer architecture in 2017 to the development of DeepSeek-R1 in 2025, the evolution of large language models has written a revolutionary chapter in the field of artificial intelligence. The rise of large language models has four milestone achievements:

Transformer (2017): The introduction of the Transformer architecture laid the foundation for building large-scale, efficient models capable of handling complex tasks with unprecedented precision and flexibility.
GPT-3 (2020): This model demonstrated the transformative power of scale in artificial intelligence, proving that large models trained on extensive datasets can achieve human-like performance across a wide range of applications, setting new benchmarks for AI development.
ChatGPT (2022): ChatGPT brought conversational AI into the mainstream, making advanced AI more accessible to ordinary users and enhancing interactivity. It also sparked important discussions about the ethical and social implications of widespread AI adoption.
DeepSeek-R1 (2025): DeepSeek-R1 achieved a significant leap in cost-effectiveness, employing a mixture of experts architecture and optimization algorithms, reducing operational costs by up to 50 times compared to many U.S. models. Its open-source nature enables broader applications of cutting-edge AI technologies, empowering innovators across industries and highlighting the importance of scalability, calibration, and accessibility in shaping the future of AI.

This developmental journey highlights how foundational innovations and advancements in scale, availability, and cost-effectiveness propel artificial intelligence toward a more inclusive and impactful future.