Andrew Ng Discusses 4 Agent Models: Workflow Efficiency of GPT-3.5 Over GPT-4

This year, the Sequoia AI Ascent 2024 conference explored many aspects of AI. In a previous article titled ‘Sequoia AI Conference: One of the Biggest Opportunities in AI is Replacing Services with Software,’ we shared insights from Sequoia partners Sonya Huang, Pat Grady, and Konstantine Buhler on the current state of AI.

In today’s article, I will primarily share Andrew Ng’s views on AI Agents from the AI Ascent 2024 conference.

Andrew Ng Discusses 4 Agent Models: Workflow Efficiency of GPT-3.5 Over GPT-4

In this presentation, one fact/viewpoint that Andrew Ng mentioned was very interesting: the value of workflow Agents (Agent workflows) is greatly underestimated, and Agent workflows built on GPT-3.5 perform better in practical applications than GPT-4.

Ng stated that the Agent workflow does not have the LLM (Large Language Model) directly generate the final output but prompts the LLM multiple times to gradually build a higher quality output.

Additionally, Ng discussed four modes of Agents: Reflection, Tool Use, Planning, and Multi-Agent Collaboration.

This resonates with the value of collaboration I often mention in SaaS; collaboration may still be very important in the AI era, but now it involves AI Agents more than humans. I suspect that AI+Human collaboration may be a long-term state in the future. Below is the full text of Andrew Ng’s presentation:

I am delighted to share my views on AI Agents today. I believe this is an exciting trend that everyone involved in AI development should pay attention to. Currently, most people use language models in a non-Agent workflow, meaning they input a prompt and generate an answer, similar to asking a person to write an article from start to finish without using the backspace key. While this is challenging, language models do it very well.

In contrast, the Agent workflow involves having AI draft an outline, determining if research is needed, writing a first draft, reviewing the draft to decide what needs modification, and then revising the draft, iterating this process. This workflow is more iterative, allowing the language model to engage in some thinking, revise the article, and then think more, iterating multiple times.

Andrew Ng Discusses 4 Agent Models: Workflow Efficiency of GPT-3.5 Over GPT-4

Surprisingly, the Agent workflow can yield very good results. I have also been quite surprised by the effectiveness of these Agent workflows; their performance is unexpectedly good. My team conducted a case study using a coding benchmark called HumanEval, which contains coding problems such as returning the sum of all elements at even positions in a given list of integers.

If you use GPT-3.5 for zero-shot prompting, the accuracy is 48%, while GPT-4 achieves 67%. However, if you use an Agent workflow based on GPT-3.5, it actually performs better than GPT-4. If you apply this type of workflow to GPT-4, it will also perform very well. You will notice that the GPT-3.5 using the Agent workflow actually outperforms GPT-4, which I believe has significant implications for how we build applications.

Andrew Ng Discusses 4 Agent Models: Workflow Efficiency of GPT-3.5 Over GPT-4

Agents are a frequently mentioned term. Many consulting reports discuss how Agents are the future of AI, etc. I want to share some broad design patterns I see in the field of Agents more specifically. This is a very chaotic and uncertain area with a large amount of research and open-source projects. Nevertheless, I have tried to categorize the development of Agents more specifically.

Four Modes of AI Agents

Reflection is a tool I believe we should all use because it is indeed effective, yet I think it has not been widely recognized. Planning and Multi-Agent Collaboration are still in their emerging stages. Sometimes I am shocked by their performance when I use them, but at least for now, I feel they cannot be made to work stably.

Next, I will introduce these four design patterns through several slides. If any of you go back and have your engineers use these patterns, I believe you will quickly see an increase in productivity.

Andrew Ng Discusses 4 Agent Models: Workflow Efficiency of GPT-3.5 Over GPT-4

1. Regarding Reflection, for example, suppose you ask the system to write code for a given task. Then we have a Coder Agent, which is essentially a language model, and you prompt it to write a function like def do_task().

An example of self-reflection is if you prompt the language model with the exact same code it just generated and say, ‘Please check the correctness, robustness, efficiency, and good structure of the code carefully.’

Andrew Ng Discusses 4 Agent Models: Workflow Efficiency of GPT-3.5 Over GPT-4

It turns out that the same language model that you prompted to write the code may be able to identify issues, such as ‘There is a bug on line 5, which can be fixed by doing xxx.’ If you now provide its own feedback back to it and prompt it again, it may propose a better version of the code than the first version. This is not guaranteed, but in many applications, it often works and is worth trying.

If you let it run unit tests and it fails, ask it why it failed, engage in a dialogue to find out why it did not pass the unit test, and then try to modify some things to propose version 3 of the code. By the way, for those who want to learn more about these techniques, I am very excited about them. For each of these four parts, I have a recommended reading section at the bottom, hoping to provide more reference material.

Once again, regarding Multi-Agent Collaboration, I describe it as a single Coder Agent that you prompt to have a dialogue with itself. A natural evolution of this idea is not to have a single code Agent but to have two Agents, one being the Coder Agent and the other the Critic Agent.

They can be the same base language model, but you prompt them differently, such as telling one, ‘You are a professional programmer, please write code,’ and the other, ‘You are a professional code reviewer, please review this code.’ This type of workflow is actually easy to implement, and I think it is a very general technique applicable to many workflows, which will significantly improve the performance of language models.

2. The second design pattern is Tool Use. We have seen examples of language model-based systems using tools. On the left is a screenshot of GitHub Copilot, and on the right is content extracted from GPT-4. However, today’s language models, if you ask them, ‘What is the best coffee maker?’ they will perform a web search.

Andrew Ng Discusses 4 Agent Models: Workflow Efficiency of GPT-3.5 Over GPT-4

For some questions, it will generate code and run it. It turns out there are many different tools used for analysis, information gathering, taking action, and enhancing personal productivity.

Interestingly, much of the early work on tool use seems to stem from the field of computer vision, as they could not handle images before the advent of large language models. Therefore, the only option was to have the language model generate a function call that could manipulate images, such as generating images or performing object detection.

So if you really look at the literature, it is interesting that a lot of work on tool use seems to stem from the visual domain because, before the emergence of models like GPT-4, language models were blind to images. This is using tools, which extends the capabilities of language models.

3. Regarding Planning, for those who have not extensively used planning algorithms, I think many people have talked about the ChatGPT moment and thought, ‘Wow, I’ve never seen anything like this.’

I think if you have not used planning algorithms, many people will have a sense of ‘AI Agent, wow, I cannot imagine AI Agents can do this.’ I have conducted live demonstrations where sometimes things fail, and the AI Agent bypasses these failures. In fact, I have had many such moments where I felt, ‘Wow, I can’t believe my AI system just did that autonomously.’

For example, I adapted an example from a HuggingGPT paper. Suppose you say, ‘There is a generic image of a girl reading a book, make the girl in the image pose like the boy.’ Jack, please describe the new image with your voice.

Given such an example, today we have AI Agents that can first determine the boy’s pose and then possibly find a suitable model on Hugging Face to extract that pose. Next, a pose image model needs to be found to synthesize an image of the girl following the instructions, then use image-to-text conversion, and finally use text-to-speech conversion.

Andrew Ng Discusses 4 Agent Models: Workflow Efficiency of GPT-3.5 Over GPT-4

Today, we actually have Agents. I don’t want to say they work reliably; they don’t always work, but when they do work, it is quite amazing. Using the Agent loop, sometimes you can also recover from early failures. So I find myself using research Agents to complete some of my work, such as research tasks, where I don’t want to Google and spend a long time. I will send it to the Research Agent and check back in a few minutes to see what it has proposed. Sometimes it works, sometimes it doesn’t, but it has become part of my personal workflow.

4. The last design pattern is Multi-Agent Collaboration, which is an interesting thing, but its effectiveness is much better than you might imagine.

As shown in the figure, on the left is a screenshot from a paper called ChatDev, which is fully open source. Many of you may have seen dazzling social media announcements demonstrating DeepMind’s AI coding assistant. ChatDev is open source and can run on my laptop.

What ChatDev does is an example of a multi-Agent system (see my earlier article titled ‘A Company Composed of 7 Agents Completed the Development of a Game in 7 Minutes’). You prompt a language model, sometimes playing the role of the CEO of a software company, sometimes as a designer, sometimes as a product manager, and sometimes as a tester. You prompt the language model to tell them, ‘You are now the CEO, you are now the software engineer,’ establishing a group of Agents that collaborate and engage in extensive dialogue.

Andrew Ng Discusses 4 Agent Models: Workflow Efficiency of GPT-3.5 Over GPT-4

So if you tell it, ‘Please develop a game, develop a GUI game,’ they will actually spend a few minutes writing code, testing the code, iterating, and then generating a very complex program. It does not always work; sometimes I use it, and sometimes it doesn’t work, but sometimes it is amazing, and this technology is indeed continuously improving.

It turns out that having different Agents debate, such as having ChatGPT and Gemini debate each other, actually leads to better performance. So having multiple simulated AI agents collaborate has become a powerful design pattern.

To summarize, I believe these are the patterns I see, and if we use these patterns in our work, many of us can quickly see increases in productivity. I believe the reasoning design patterns for Agents will be important, and I expect that what AI can do this year will be greatly expanded due to Agent workflows.

Andrew Ng Discusses 4 Agent Models: Workflow Efficiency of GPT-3.5 Over GPT-4

One thing that people find difficult to adapt to is that when we prompt language models, we want an immediate response. In fact, ten years ago, when I was involved in discussions about large search boxes at Google, you would input a long prompt, one reason being that when you search the web, you want a response in half a second; this is human nature, we like instant feedback.

But for many Agent workflows, I think we need to learn to delegate tasks to AI Agents and be patient, waiting a few minutes or even hours for a response. Just like I see many new product managers delegate tasks to others and then check back after some time, this may not be efficient. I think we also need to do the same with some of our AI Agents.

Additionally, fast token generation is also crucial because in these Agent workflows, we iterate repeatedly, and the language model generates tokens for itself to read. Being able to generate tokens at a speed much faster than anyone can read is fantastic.

I believe that quickly generating more tokens from a slightly lower quality language model may yield better results than slowly generating tokens from a better quality language model. This may be a bit controversial because it may allow you to bypass the loop more times, similar to the results of GPT-3 with the Agent architecture shown on the first slide.

I am very much looking forward to Anthropic’s Claude, GPT-5, Gemini 2.0, and all the other exciting models that many of you are building. I have a feeling that if you expect to run your zero-shot tasks on GPT-5, using Agent reasoning on a weaker model may get you closer to that performance level than you might think; I see this as an important trend.

Honestly, the path to AGI feels more like a journey than a destination. But I believe that this type of Agent workflow can help us take a small step forward on this very long journey.

·END·