The Path to AGI: 19 Landmark Papers in AI History

The Path to AGI: 19 Landmark Papers in AI History

Source: AI Unconventional Frank

Some people around say that AI seems to have suddenly become popular in the past two years, instantly transforming into the products we are familiar with today.
In November 2022, ChatGPT emerged like a thunderbolt. But where did the thunder come from?
From the perspective of academic research, the explosion of AI in the past two years actually has its roots. Years of foundational research and discoveries have led to breakthroughs in areas such as Infra, Data, Model, and Application.
Some companies like OpenAI started with research as their initial purpose. There are also some research institutions that, although obscure, have inspired countless entrepreneurs through their papers.
Today’s article is primarily from Lightspeed Venture Partners, which I highly recommend!
15 years, 19 papers, 4 major research camps, a comprehensive look at the past and present of AI!
I have always believed that to judge the future, one must start from the past. As a directory and research framework, it is worth collecting!
As one of the leading VCs in the United States, Lightspeed has invested in familiar names like Zoom, and in the AI field, Scale AI, Poni.ai, etc., and also invested early in Meituan and Pinduoduo.
Over the past decade, Lightspeed has closely monitored AI research, often actively participating in helping researchers turn their ideas into groundbreaking companies. They are early supporters of Mistral, SKILD, and Snorkel, all of which stem from foundational discoveries in AI technology.
This article lists the most influential AI research papers from the past 15 years and reviews the most significant academic factions and industry innovations from the perspective of the intersection between academia and industry.

The Path to AGI: 19 Landmark Papers in AI History

The Four Waves of AI
Over the past 15 years, AI research results have inspired, inherited, and developed one another. Some research has created waves in the entrepreneurial world, some scientists have become founders, and some research institutions have transformed into great companies…
The academic community flutters its wings gently, while AI companies and ecosystems gradually move to the center stage of the world, impacting various fields such as energy, brain-computer interfaces, and aerospace.
We observe that there are four main waves of research in AI academic exploration that interdependently drive AI to reach today’s heights—
1. Model Architecture Improvements
Since the 2010s, advancements in artificial intelligence model architectures have propelled significant breakthroughs and startup innovations.
This includes the work of AlexNet in deep convolutional neural networks in 2012, and the highly acclaimed paper Attention is All You Need, which fundamentally changed natural language processing.
2. Developer Productivity Enhancements
In the past decade, tools and frameworks have made significant progress, greatly improving developer productivity, which is crucial for the development of startups.
Milestones include the launch of TensorFlow (and other tools like PyTorch launched in 2015), the 2018 launch of the Hugging Face Transformers library, and Meta’s open-source Llama model in 2023.
3. Task Performance Optimization
Several different papers published in the past 10 years have fundamentally changed the efficiency and diversity of AI in performing tasks:
Training “deep neural networks” to perform complex tasks, “federated learning” for “alignment and translation,” thereby reducing training complexity.
Breakthroughs in “unsupervised learning” have improved task performance without any fine-tuning. And using “retrieval-augmented generation” (RAG) and “external data storage” to perform knowledge-intensive tasks.
4. Computational Optimization
In the 2010s, new optimization techniques like dropout and batch normalization improved model performance and stability. In 2020, OpenAI’s landmark paper emphasized how model performance predictably scales with increased computational resources.
This was followed by DeepMind’s work in 2022, which demonstrated the importance of “balancing model size and training time for optimal performance”.
II. The Genealogy of AI Research
Directly look at the picture—
The Path to AGI: 19 Landmark Papers in AI HistoryGoogle-related papers and researchers
The Path to AGI: 19 Landmark Papers in AI HistoryOpenAI-related papers and researchers
The Path to AGI: 19 Landmark Papers in AI HistoryFacebook-related papers and researchers, pointing to the familiar Guo Wenjing (Pika)
The Path to AGI: 19 Landmark Papers in AI History
Stanford-related papers and researchers, slightly sparse, but with extremely significant results
III. Early Breakthroughs
Early papers laid the foundation for today’s AI ecosystem by introducing frameworks, models, and methods that have become the basis for the development of startups and subsequent research. The frameworks proposed in these papers, such as Transformers, GPT, TensorFlow, and BERT, introduced new architectures for natural language processing, training language models, and fine-tuning model development. Below are 19 disruptive papers presented chronologically.
2012
ImageNet Classification with Deep Convolutional Neural Networks
The paper “ImageNet Classification with Deep Convolutional Neural Networks” (2012) by Geoffrey Hinton, Ilya Sutskever, and Alex Krizhevsky is often referred to as AlexNet (named after author Alex Krizhevsky), marking a milestone achievement in the field of deep learning. It demonstrated that deep convolutional neural networks (CNNs) with five convolutional layers achieved significantly better results on the ImageNet dataset compared to previous methods, dispelling doubts and proving the feasibility of deep learning architectures for complex tasks like image classification.
The paper also emphasizes the importance of utilizing GPUs (graphics processing units) for training deep CNNs. GPUs are faster at handling the parallel computations involved in training, making large-scale training possible.
Paper link:
https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
2015
Deep Residual Learning for Image Recognition
The paper “Deep Residual Learning for Image Recognition” (2015) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun addressed the problem of performance degradation when training deep CNNs. As networks became deeper, accuracy tended to stabilize or even decline. This paper introduced the concept of residual learning, reconfiguring the layers in CNNs to allow them to learn a modified residual function of the input rather than trying to learn the entire mapping from scratch, enabling networks to more easily learn identity mappings and achieve deeper architectures.
Residual connections allowed researchers to train CNNs that were deeper than ever before. Residual connections have now become fundamental building blocks of most modern model architectures, including very successful models like ResNet (the model from the original paper), Inception, and DenseNet.
Paper link:https://arxiv.org/abs/1512.03385
2016
Neural Machine Translation by Jointly Learning to Align and Translate
The paper “Neural Machine Translation by Jointly Learning to Align and Translate” (2016) by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio addressed the challenge traditional neural machine translation (NMT) models faced in accurately aligning elements between source and target sentences, leading to issues like information loss or incorrect word order in translated text. This paper introduced a new architecture where the model learns to jointly align and translate, allowing it to better capture the relationships between words and phrases in the source and target languages. The joint learning approach helps the model generate more accurate and fluent translations and simplifies the training process compared to separate alignment and translation models.
Paper link: https://arxiv.org/abs/1409.0473
2016
TensorFlow: A system for large-scale machine learning
The paper “TensorFlow: A system for large-scale machine learning” (2016) by Martín Abadi et al. had a significant impact on the productivity of machine learning developers. It allows developers to define machine learning models without writing low-level code for numerical computation, thereby streamlining the development process and reducing the time required to build and experiment with models.
Additionally, TensorFlow can be deployed on various hardware platforms, including CPUs, GPUs, and TPUs (tensor processing units). This flexibility allows developers to choose the best hardware for their specific needs and efficiently train large models.
Paper link:
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=NMS69lQAAAAJ&citation_for_view=NMS69lQAAAAJ:JqN3CTdJtl0C
2017
Attention Is All You Need
The paper “Attention Is All You Need” (2017) by Ashish Vaswani et al. represents a major breakthrough in model architecture. Before this paper, most sequence-to-sequence models relied on recurrent neural networks (RNNs) or convolutional neural networks (CNNs) to capture relationships between elements in a sequence. Due to the sequential nature of RNNs, training can be particularly slow.
This paper proposed a new architecture, the Transformer, which relies entirely on an attention mechanism called “self-attention.” This allows the model to focus directly on relevant parts of the input sequence, leading to better understanding of long-distance dependencies.
The Transformer architecture accelerates training by eliminating RNNs and performs exceptionally well on machine translation tasks, widely applicable to tasks such as text summarization, question answering, and text generation.
Paper link: https://arxiv.org/abs/1706.03762
2019
Language Models are Unsupervised Multitask Learners The paper “Language Models are Unsupervised Multitask Learners” (2019) by Alec Radford et al. discusses the potential of unsupervised learning, where models learn from large amounts of unmarked text data.
In the past, training large language models (LLMs) involved supervised learning, requiring large amounts of labeled data for desired tasks. By training on vast amounts of unmarked text data, LLMs can naturally learn to perform various tasks (multitask learning) without explicit task-specific supervision.
Unsupervised learning can also improve efficiency—when fine-tuned for specific tasks, LLMs can learn from smaller amounts of labeled data.
Paper link:
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=dOad5HoAAAAJ&citation_for_view=dOad5HoAAAAJ:YsMSGLbcyi4C
2019
Roberta: A robustly optimized bert pretraining approach
The paper “Roberta: A Robustly Optimized BERT Pretraining Approach” (2019) by Yinhan Liu et al. focuses on improvements in the pretraining process of BERT (Bidirectional Encoder Representations from Transformers). Compared to BERT, this paper generally performs better on various NLP tasks, converging faster in training, allowing developers to iterate models more quickly and spend less time on hyperparameters during the fine-tuning stage.
Although the paper on Roberta was not as revolutionary and well-known as its predecessor, its uniqueness lies in the fact that several co-authors have developed the AI ecosystem by founding or leading new startups, including executives from Tome, Character.ai, and Birch.ai.
Paper link:
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=dOad5HoAAAAJ&citation_for_view=dOad5HoAAAAJ:YsMSGLbcyi4C
2019
Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences
The paper “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences” (2019) by Alexander Rives et al. explores the use of unsupervised learning on large protein sequence datasets (250 million) to learn the inherent properties of proteins.
Traditionally, analyzing protein structure and function relied on techniques that required labeled data (e.g., experimentally determined structures). By training deep learning models on a large amount of unmarked sequence data, the model can learn representations that capture important biological information about proteins, including secondary structure, inter-residue contacts, and even potential biological activity.
Paper link:
https://www.pnas.org/doi/abs/10.1073/pnas.2016239118
IV. Recent Developments
After 2020, the pace of AI development and application has accelerated.
Recent AI research has made significant progress in learning and processing, making technology more efficient and scalable to a wider range of applications.
We have also seen AI solutions applied in the real world, with startups based on early models flourishing and new startups continually emerging based on new models.
2020
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
The paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (2020) by Patrick Lewis et al. discusses how LLMs trained on vast amounts of text data often struggle to complete tasks that require access to and reasoning about specific factual knowledge. This paper proposes a new model architecture called retrieval-augmented generation (RAG). RAG combines two key components—a retrieval module that retrieves relevant documents from an external knowledge base based on input prompts or questions, and a generation module that uses the retrieved documents and its own knowledge to generate responses.
This dual-memory architecture improves performance on knowledge-intensive tasks (such as question answering and summarizing factual topics) and makes language more precise and authentic. RAG provides a solution to the knowledge access limitation of LLMs, demonstrating that by combining powerful language models with external knowledge sources, we can achieve better results on knowledge-intensive tasks.
Paper link: https://arxiv.org/abs/2005.11401
2020
Transformers: State-of-the-art natural language processing
The paper “Transformers: State-of-the-art natural language processing” (2020) by Thomas Wolf et al. presents the Hugging Face Transformers, a popular open-source library built on the Transformer architecture. It provides a wealth of pre-trained Transformer models for various NLP tasks and offers a user-friendly API, allowing developers to focus on fine-tuning models based on specific needs rather than training numerous models from scratch, saving a lot of time and resources.
Paper link: https://aclanthology.org/2020.emnlp-demos.6/
2020
Language Models Are Few-Shot Learners
The paper “Language Models Are Few-Shot Learners” (2020) by Amanda Askell et al. demonstrates that LLMs can learn new tasks with just a few examples (few-shot learning), making them suitable for various tasks where obtaining a large amount of labeled data may be costly or difficult.
This challenges the traditional view that LLMs always require large amounts of data to perform well and highlights the few-shot learning capability of LLMs—improving sample efficiency, meaning that fine-tuning with just a few examples can yield surprisingly good performance on new tasks and speed up model deployment, allowing models to quickly adapt even when labeled data is scarce.
Paper link: https://arxiv.org/pdf/2005.14165
2020
Scaling Laws for Neural Language Models
The paper “Scaling Laws for Neural Language Models” (2020) by Jared Kaplan et al. quantifies the relationship between model size, data size, compute, and performance, achieving significant breakthroughs in understanding how to optimize computational resources for training large language models (LLMs).
By understanding these scaling laws, researchers and developers can make informed decisions about how to allocate computational resources for training LLMs.
Paper link: https://arxiv.org/pdf/2001.08361
2021
Efficiently Modeling Long Sequences with Structured State Spaces
The paper “Efficiently Modeling Long Sequences with Structured State Spaces” (2021) by Albert Gu et al. proposes a new method for handling long sequences using state space models (SSMs). RNNs and CNNs struggle to capture long-distance dependencies in very long sequences (thousands of elements or more). S4 addresses this issue by using SSMs, which theoretically have a more effective capacity for handling long-distance dependencies.
S4 also introduces a new parameterization technique called “structured state spaces,” which provides a way to leverage the advantages of SSMs to handle long-distance dependencies while maintaining computational efficiency. This opens the door to building models capable of effectively handling very long sequences while being faster to train and use compared to traditional methods.
Paper link: https://arxiv.org/abs/2111.00396
2022
Flamingo: a Visual Language Model for Few-Shot Learning
The paper “Flamingo: A Visual Language Model for Few-Shot Learning” (2022) by Jean-Baptiste Alayrac et al. introduces Flamingo, a visual language model (VLM) designed for few-shot learning tasks in visual language processing (VLP). While previous research primarily focused on few-shot learning in either language or vision, Flamingo specifically addresses the challenges in the combined VLP domain. Flamingo utilizes pre-trained models for image understanding and language generation, reducing the amount of data needed for fine-tuning.
Paper link: https://arxiv.org/abs/2204.14198
2022
Training Compute-Optimal Large Language Models
The paper “Training Compute-Optimal Large Language Models” (2022) by Jordan Hoffmann et al. explores the concept of optimal computational budgets for training LLMs, arguing that current models often under-train because there is a focus on scaling model size while keeping training data constant—optimal computational usage requires scaling model size and training data proportionally. This paper introduces Chinchilla, a large language model trained using this optimal computational approach.
Paper link: https://arxiv.org/abs/2203.15556
2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
The paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2023) by Jason Wei et al. shows that LLMs can provide seemingly correct answers without revealing the reasoning process behind them, but chain-of-thought prompting can significantly improve the way LLMs perform reasoning tasks by incorporating examples of reasoning steps into prompts used to guide LLMs, leading them to explicitly demonstrate their reasoning process step-by-step when solving problems.
LLMs trained with this technique show better performance on reasoning tasks such as mathematical word problems, answering commonsense questions, and performing symbolic operations.
Paper link: https://arxiv.org/abs/2201.11903
2023
Llama: Open and efficient foundation language models
The paper “Llama: Open and Efficient Foundation Language Models” (2023) by Hugo Touvron et al. introduces the LLaMA series of foundation language models (FLMs), focusing on efficiency in training. These models achieve optimal performance across various NLP tasks while requiring less computational power compared to previous models, meaning faster training times and lower training costs.
Even with a smaller amount of fine-tuning data, LLaMA models can achieve good performance on NLP tasks, which is highly beneficial for those using limited datasets or needing to quickly adapt models for new tasks.
LLaMA models also allow developers to leverage pre-trained components to accomplish various tasks—reducing the need to build models from scratch and facilitating code reuse, thus saving development time and effort.
Paper link:
https://scholar.google.com/citations?view_op=view_citation&hl=fr&user=tZGS6dIAAAAJ&citation_for_view=tZGS6dIAAAAJ:roLk4NBRz8UC
2023
Legged locomotion in challenging terrains using egocentric vision
The paper “Legged Locomotion in Challenging Terrains Using Egocentric Vision” (2023) by Ananye Agarwal et al. addresses the key challenge of robot locomotion in rugged and complex terrains. Typically, legged robots rely on pre-built maps or complex depth sensors to navigate their surroundings, limiting their ability to adapt to unforeseen obstacles and requiring significant computational resources. This novel approach enables robots to use a single front-facing depth camera (egocentric vision) to perceive their surroundings and plan their movements in real-time, eliminating the need for pre-built maps and reducing reliance on bulky sensors.
By relying on egocentric vision, robots can react to unseen obstacles and navigate challenging terrains such as stairs, curbs, and uneven surfaces, making their movements more robust and adaptable to real-world environments.
Paper link:
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=AEsPCAUAAAAJ&pagesize=80&sortby=pubdate&citation_for_view=AEsPCAUAAAAJ:rO6llkc54NcC
2023
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
The paper “Multimodal Foundation Models: From Specialists to General-Purpose Assistants” (2023) by Chenyu Wang et al. discusses the development of multimodal foundation models that can handle various tasks across different modalities (such as vision and language), marking a significant shift from traditional models focused on a single data type (e.g., image classification models that only handle images).
Multimodal foundation models achieve better performance on complex tasks, paving the way for developing AI systems that can interact with the world in a more natural and diverse way, similar to how humans use various senses to understand and respond to their surroundings.
Paper link: https://arxiv.org/abs/2309.10020
This concludes the 19 papers presented in chronological order.
In the next 5 to 10 years, emerging AI technologies will bring transformative leaps across various fields. Looking forward to the next era of cutting-edge AI technology.
The Path to AGI: 19 Landmark Papers in AI History
References:
1. Lightspeed original address:https://lsvp.com/research-to-reality/
2. AI must go overseas, must go to Japan?
3. In 2024, the only way for small companies|a16z special position
4. After watching the pricing models of 40 AI products, I seem to have discovered the secret to $10 million in revenue
5. Comparing eight top AI companies in Japan, how many years is the technology gap between China and Japan?
6. Sequoia Capital|The latest five predictions about AI: from selling shovels to AI factories

(End)

More exciting:

Principal Interview|Rooted in ethnic minority areas, focusing on teacher education to cultivate high-quality applied talents—Interview with Principal Chen Benhui of Lijiang Normal University

Yan Shi│A review and prospect of cultivating computer system capabilities

Discussion on the concept and implementation path of “student-centered” teaching

Principal Interview|Promoting interdisciplinary integration to cultivate innovative talents in the new era—Interview with Professor Ni Mingxuan, founding principal of Hong Kong University of Science and Technology (Guangzhou)

The New Year’s Message from the Seventh Editorial Committee

Guidelines for Ideological and Political Education in Computer Science Courses

Academician Chen Guoliang|Cultural construction of virtual teaching and research room for computer courses

Professor Chen Daoxu of Nanjing University|Change and Constancy: The Dialectics of the Learning Process

Yan Shi│Reflections and suggestions on the “dilemma” of young teachers in colleges and universities

Xu Xiaofei et al.|Metaverse education and its service ecosystem

【Contents】”Computer Education” 2024 Issue 6

【Contents】”Computer Education” 2024 Issue 5

【Contents】”Computer Education” 2024 Issue 4

【Contents】”Computer Education” 2024 Issue 3

【Editorial Message】Professor Li Xiaoming of Peking University: Reflections from the “Year of Classroom Teaching Improvement”…

Professor Chen Daoxu of Nanjing University: Which is more important, teaching students to ask questions or teaching students to answer questions?

【Yan Shi Series】: Development Trends in Computer Disciplines and Their Impact on Computer Education

Professor Li Xiaoming of Peking University: From interesting mathematics to interesting algorithms to interesting programming—one way for non-professional learners to experience computational thinking?

Several questions to consider in building first-class computer disciplines

New engineering and big data professional construction

Other mountains can attack jade—Compilation of Chinese and foreign research articles on computer education

The Path to AGI: 19 Landmark Papers in AI History

The Path to AGI: 19 Landmark Papers in AI History

Leave a Comment