Understanding AutoGPT and LLM Agents

In the past two weeks, projects like AutoGPT and BabyAGI have gained immense popularity. Over the weekend, I spent some time reviewing the code of these AI agent projects and decided to write an article summarizing my technical insights and thoughts on the current advancements in this field for everyone to discuss.

From Language Understanding to Task Execution

Most related projects and products previously leveraged the language understanding capabilities of GPT models, such as Jasper for generating copy, Notion AI for assisting with webpage and document summaries, Glarity, Bearly.ai for Q&A, New Bing, and ChatPDF. To expand the application scope of GPT, a natural direction is to enable GPT to learn to use various external tools for executing a wider range of tasks, achieving a state of “knowing and doing” 😊. Besides AutoGPT and BabyAGI, many interesting projects like Toolformer[1], HuggingGPT[2], and Visual ChatGPT[3] are also exploring this direction.

The principle of task execution is not complex; the basic approach is still to let GPT generate responses. However, we inform GPT in the prompt that if it needs to call external tools, it should generate specific instructions/code in a defined format. The program then receives this and calls the external tools based on the content generated by GPT to obtain the corresponding results, which can then be fed back into GPT for further understanding and generation, creating a loop. For example, in LangChain, the commonly used ReAct prompt is as follows:

...You can use the following tools to complete tasks:
1. Calculator, for performing various mathematical calculations to get precise results; input expressions like 1 + 1 to get results...
Question: What is the result of 123 multiplied by 456?...

The content generated by the model is as follows:

Thought: I need to use the calculator to calculate the result of 123 multiplied by 456.
Action: Call the calculator.
Action Input: 123 * 456.
Observation Result:

We can then process this returned data, call the calculator program, obtain the result of 123 * 456, and fill this result into the observation result before allowing the model to continue generating the next segment of content.

This is the basic method of task execution. For more content, you can refer to my previous shares about LangChain: How Microsoft 365 Copilot is Achieved? Unveiling How LLM Generates Instructions[4].

Understanding AutoGPT and LLM Agents
Typical ReAct prompt

Model Memory

Another common pattern is enhancing model memory through external storage. One typical scenario is long-session chat processes. Due to the 4000 token input limit of the GPT API, users often find that ChatGPT has “forgotten” previously discussed content after prolonged conversations. Another typical scenario is providing LLM with new information, such as understanding and answering questions about an entire PDF or knowledge base, which cannot be directly fed into GPT in the prompt.

This is where external storage comes into play to help GPT expand its memory. The simplest method is to save these conversation records and external information as text in files or database systems. Later, during interactions with the model, relevant external information can be retrieved as needed. We can consider the content in the prompt as the model’s “short-term memory,” while this external storage becomes “long-term memory.” Besides the mentioned benefits, this memory system can also help reduce model hallucinations to some extent, avoiding pure reliance on “generation” to achieve task objectives.

The most common method for acquiring long-term memory is through “semantic search.” This means using an embedding model to convert all memory texts into vectors. Subsequent interactions with the model can also be transformed into vectors using the same embedding model, and then the most similar memory texts can be found through similarity calculations. Finally, these memory texts can be concatenated into the prompt as input for the model. Popular open-source projects for this method include OpenAI’s ChatGPT Retrieval Plugin[5] and Jerry Liu’s LlamaIndex[6].

Understanding AutoGPT and LLM Agents
Retrieval Pattern

This memory expansion model feels somewhat “coarse” compared to human brain operation modes. The concepts of long-term and short-term memory (including some more complex implementations in LangChain and LlamaIndex) still feel relatively “hard coded.” If there are breakthrough research advancements in model context size in the future, this type of model may no longer be necessary.

From an overall interaction process perspective, these model memory implementation patterns can also be viewed as a form of “task execution,” where the task is to “write/retrieve memory” rather than “execute an external tool.” We can unify both views, as they represent the most common application development patterns for current large language models. Later, we will see that various so-called intelligent agents are also expanding and implementing under this thought process.

Understanding AutoGPT and LLM Agents

Application patterns of LLM calling external tools

Interestingly, OpenAI’s Jack Rae and Ilya Sutskever previously mentioned the concept of compression as wisdom[7]. For the “compression rate” of the model, if external tools can be used more effectively, it can significantly improve the accuracy of next-token predictions for many specific tasks. I personally feel that there is still a lot of room for development in this direction. For example, from the perspective of “effective data,” the data generated when humans perform various tasks using tools or even their thought processes can be very valuable. From the model training perspective, how to incorporate the model’s ability to use tools into the loss function during the process may also be an interesting direction.

Understanding AutoGPT and LLM Agents
Means to enhance “compression rate”

AutoGPT

With the preceding information laid out, understanding the internal structure and core logic of AI agents like AutoGPT becomes relatively easier. Most of the innovations in these projects still lie in the prompt layer, using better prompts to stimulate the model’s capabilities, converting many processes that previously required hard-coded logic into dynamically generated logic by the model. For example, the core prompt of AutoGPT is as follows:

You are Guandata-GPT, 'an AI assistant designed to help data analysts do their daily work.' Your decisions must always be made independently without seeking user assistance. Play to your strengths as an LLM and pursue simple strategies with no legal complications.
GOALS:
1. 'Process data sets'
2. 'Generate data reports and visualizations'
3. 'Analyze reports to gain business insights'
Constraints:
1. ~4000 word limit for short-term memory. Your short-term memory is short, so immediately save important information to files.
2. If you are unsure how you previously did something or want to recall past events, thinking about similar events will help you remember.
3. No user assistance
4. Exclusively use the commands listed in double quotes e.g. "command name"
Commands:
1. Google Search: "google", args: "input": "<search>"
2. Browse Website: "browse_website", args: "url": "<url>", "question": "<what_you_want_to_find_on_website>"
3. Start GPT Agent: "start_agent", args: "name": "<name>", "task": "<short_task_desc>", "prompt": "<prompt>"
4. Message GPT Agent: "message_agent", args: "key": "<key>", "message": "<message>"
5. List GPT Agents: "list_agents", args: 
6. Delete GPT Agent: "delete_agent", args: "key": "<key>"
7. Clone Repository: "clone_repository", args: "repository_url": "<url>", "clone_path": "<directory>"
8. Write to file: "write_to_file", args: "file": "<file>", "text": "<text>"
9. Read file: "read_file", args: "file": "<file>"
10. Append to file: "append_to_file", args: "file": "<file>", "text": "<text>"
11. Delete file: "delete_file", args: "file": "<file>"
12. Search Files: "search_files", args: "directory": "<directory>"
13. Evaluate Code: "evaluate_code", args: "code": "<full_code_string>"
14. Get Improved Code: "improve_code", args: "suggestions": "<list_of_suggestions>", "code": "<full_code_string>"
15. Write Tests: "write_tests", args: "code": "<full_code_string>", "focus": "<list_of_focus_areas>"
16. Execute Python File: "execute_python_file", args: "file": "<file>"
17. Generate Image: "generate_image", args: "prompt": "<prompt>"
18. Send Tweet: "send_tweet", args: "text": "<text>"
19. Do Nothing: "do_nothing", args: 
20. Task Complete (Shutdown): "task_complete", args: "reason": "<reason>"
Resources:
1. Internet access for searches and information gathering.
2. Long Term memory management.
3. GPT-3.5 powered Agents for delegation of simple tasks.
4. File output.
Performance Evaluation:
1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.
2. Constructively self-criticize your big-picture behavior constantly.
3. Reflect on past decisions and strategies to refine your approach.
4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.
You should only respond in JSON format as described below Response Format: {    "thoughts": {        "text": "thought",        "reasoning": "reasoning",        "plan": "- short bulleted
- list that conveys
- long-term plan",        "criticism": "constructive self-criticism",        "speak": "thoughts summary to say to user"    },    "command": {        "name": "command name",        "args": {            "arg name": "value"        }    }} Ensure the response can be parsed by Python json.loads

From this lengthy prompt, it is evident that AutoGPT is indeed a current advanced “culmination” of prompt application patterns, with many aspects to learn from. Compared to the classic reason + act model, we can examine the further developments and improvements it has made.

Constraints & Resources

This section informs the model of its various limitations, which is quite amusing. For example, the model’s input context size is limited, so it needs to save important information to files. This action is crucial, especially in code generation scenarios, as it enables the generation and execution of long code. Additionally, AutoGPT provides long-term memory management, which is vital since the processes generated by complex prompts can often become lengthy, leading to incoherent outputs without such memory management.

Moreover, it is explicitly stated that the default model is “not connected to the internet,” and all knowledge is updated only up to the cutoff date of the training data. Therefore, it is clearly stated that the model can search the internet for more timely external information.

Commands

The commands, or tool options provided, are quite extensive. This is one reason many articles mention that AutoGPT can accomplish various tasks, showcasing high flexibility and versatility.

The specific commands can be divided into several categories, including search and web browsing, starting other GPT agents, file read/write operations, code generation and execution, etc. The idea of using other agents is somewhat similar to HuggingGPT, as currently, the GPT model performs better with more specific and detailed tasks, leading to more accurate and stable outputs. Thus, this “divide and conquer” approach is quite necessary.

Performance Evaluation

This section provides guiding principles for the model’s overall thinking process, divided into several specific dimensions, including reviewing the match between its abilities and behaviors, maintaining a big-picture view and self-reflection, optimizing decision-making actions in conjunction with long-term memory, and completing tasks efficiently with fewer actions. This thinking logic aligns closely with human thinking, decision-making, and feedback iteration processes.

Response

From the response format perspective, it also integrates several patterns, including needing to articulate its thoughts, reasoning to acquire relevant background knowledge, generating a detailed plan with specific steps, and self-critique of its thinking process. These format constraints also serve as practical operational guidelines for the aforementioned thinking principles.

The generation of specific commands is fundamentally consistent with the previously mentioned ReAct approach. Here, commands can also be nested, for example, starting another GPT agent within a command and then messaging this agent, allowing for more complex tasks to be accomplished. In LangChain, however, there should only be one call and return between child agents and the main process, which is relatively limited.

It is worth noting that this entire response is generated by the model in one interaction, unlike some other frameworks where planning, reflection, and action generation are produced through multiple model interactions. I feel this is because the solution process generated by AutoGPT can often be very lengthy; if each action’s generation required multiple interactions with the LLM, the time and token consumption would be substantial. However, if a specific decision action incurs a high cost, such as needing to call a costly API for image generation, it may be more efficient to review and optimize this action multiple times before making a final decision.

Human Intervention

If you have ever run AutoGPT, you may find that the model can easily complicate issues or “go off track” during execution planning. Therefore, AutoGPT allows users to intervene during execution, providing additional input for each specific execution step to guide the model’s behavior. After providing human feedback, the model will regenerate the aforementioned response, iterating back and forth. You can access this interface-enabled AutoGPT product[8] to experience this process firsthand. Although the current task completion perspective is still in its early stages, the design of this prompt and interaction method is quite enlightening.

BabyAGI

In contrast to AutoGPT, BabyAGI focuses more on the “thinking process” aspect and does not add support for various external tools. Its core logic is very simple:

  1. Retrieve the first task from the task list.
  2. Obtain task-related “memory” information, which is executed by the task execution agent to get results. Currently, this execution is a simple LLM call, without involving external tools.
  3. Store the returned results in memory.
  4. Based on current information, such as overall objectives, the most recent execution results, task descriptions, and pending task lists, generate new tasks as needed.
  5. Add new tasks to the task list and then prioritize all tasks, reordering them.

The author states that this process simulates their real work process for a day: checking tasks in the morning, completing tasks during the day, receiving feedback, and then reviewing if there are new tasks to add in the evening, followed by re-prioritization.

Understanding AutoGPT and LLM Agents
BabyAGI operation process

The entire project’s codebase is quite small, and the relevant prompts are straightforward and easy to understand. Interested readers can read it themselves.

Understanding AutoGPT and LLM Agents
Example of BabyAGI prompts

Subsequently, some evolutionary versions of this project have emerged, such as BabyASI[9], which borrowed from AutoGPT to add support for search, code execution, and other tools. Theoretically, if this ASI (Artificial Super Intelligence) is intelligent enough, it could even generate code to optimize its own prompts, transform processes, and even continue model training, allowing GPT to develop future versions of itself—imagine how exciting that would be 😆.

HuggingGPT

If BabyAGI explores the plan & execution application of LLMs, HuggingGPT, which emerged earlier, showcases the imaginative space in the realm of “external tools.” Its core operational logic also combines planning and execution, but at the execution tool level, it can utilize a rich array of “domain-specific models” to assist LLMs in completing complex tasks more effectively, as shown in the following diagram:

Understanding AutoGPT and LLM Agents
HuggingGPT process

Through various examples provided by the author, it can be seen that LLMs can effectively understand tasks and call relevant models to solve them. Although many examples might later be accomplished through the multimodal GPT series in an end-to-end manner, this idea is still quite interesting. External tools are not limited to searches and API calls; they can also invoke other complex models. In the future, it may be possible to not only call models but also trigger data collection, model training/fine-tuning, and other actions to complete even more complex task processes.

From another perspective, for specific, professional, and high-frequency scenarios, rich data can be utilized to build smaller proprietary models to meet related demands at a lower cost. Meanwhile, for more ambiguous and variable “fat tail” demands, large models’ robust understanding, reasoning, and generation capabilities can be better utilized, potentially replacing many heuristic rule-driven business processes in the future. This may represent a common combinatorial application form of large and small models in the future.

Camel / Generative Agents

In the earlier discussion of AutoGPT, we saw some methods for adding long-term memory to model agents and interacting with other agents. Additionally, in the previous prompt patterns, we found that allowing models to self-reflect or plan before executing often leads to significant improvements in effectiveness. If we further extrapolate along this line, could we assemble multiple agents into a team, each playing different roles, to better solve complex problems, or even allow this small “community” to evolve more complex behavior patterns and discover new knowledge? Recently, two notable works related to the direction of agent “communities” have gained popularity.

Camel

In the work Camel[10], the author proposes simulating users and AI assistants through LLMs, allowing two agents to role-play (for example, one as a business expert and the other as a programmer), and then have them communicate and collaborate autonomously to complete a specific task. This idea is quite straightforward, but the author also notes that prompt design is crucial; otherwise, issues like role-switching, repeated instructions, infinite message loops, flawed replies, and determining when to terminate conversations can easily arise. Interested readers can look at the prompt settings provided in the project code, which includes many explicit instructions to guide agents in their intended communication and collaboration.

Understanding AutoGPT and LLM Agents

AI User and AI Code Assistant prompt

In addition to optimizing agent prompts and operational modes, the author also designed prompts to automatically generate various roles, scene demands, and other content. These elements can automatically form various role-playing scenarios to collect interaction data between agents in different contexts, facilitating further exploration and analysis. Interested readers can explore the conversation records generated between various agent combinations on this website[11]. The project code is also open-source, making it a great starting point for researching AI agent communities.

Understanding AutoGPT and LLM Agents

Data generation prompt

Generative Agents

In the work Generative Agents[12], the author forms a virtual town community with 25 agents having defined identities, each equipped with a memory system, allowing them to plan, respond to actions, and self-reflect, enabling them to operate freely and truly simulate community dynamics. Throughout the simulation process, this community also “emerges” various phenomena seen in real society, which is quite fascinating.

From a technical perspective, there are several behavior settings of the agents in this article that are worth learning:

  • Each agent’s memory retrieval is more detailed, considering timeliness, importance, and relevance for memory recall. Compared to simple vector similarity searches, this approach yields significantly better results.
  • Memory storage also includes a reflection step, periodically summarizing memories to maintain the agents’ “sense of purpose.”
  • In plan generation, a multi-level recursive approach is adopted, generating action plans from coarse to fine, aligning more closely with our daily thought processes.
  • Using “character interviews” to evaluate the effectiveness of these behavior settings, ablation experiments reveal notable improvements.

Understanding AutoGPT and LLM Agents

Agent structure

This entire logic of identity, plan, act/react, reflect, and memory stream appears reasonable and can complement AutoGPT’s approach. Of course, there should be several limitations, such as the simulation process only allowing one-on-one conversations between agents, without settings for meetings/broadcasts. Currently, the simulation’s runtime is also limited, making it difficult to ensure the effects of memory and behavior pattern evolution, as well as the exploration and advancement of community goals over extended periods.

From an application perspective, it seems that the focus is mainly on social activity simulation and game applications. Whether it can be expanded to broader fields like task handling and knowledge exploration remains to be seen.

Prompt Patterns

Finally, let’s summarize the prompt design patterns reflected in the aforementioned projects.

  1. CoT prompt, where instructions are provided alongside the breakdown or examples of the task execution process. Many people have likely used this, e.g., “let’s think step by step” 😊.
  2. “Self-reflection,” reminding the model to self-reflect before producing results to see if better solutions exist. This can also be applied after obtaining results, prompting the model to review its output. For instance, AutoGPT includes “Constructively self-criticize your big-picture behavior constantly.”
  3. Divide and conquer; when writing prompts, it becomes clear that the more specific the context and goals, the better the model performs. Thus, breaking tasks down and applying the model often yields better results than asking it to complete an entire task at once. Utilizing external tools and nested agents also stem from this perspective, serving as a natural extension of CoT.
  4. Plan first, execute later. BabyAGI, HuggingGPT, and Generative Agents all adopted this model. This pattern can also be expanded; for example, during the planning phase, the model could proactively ask questions, clarify goals, or propose possible solutions, with human review for confirmation or feedback to reduce the likelihood of goal deviation.
  5. Memory systems, including short-term memory scratchpads, long-term memory storage, processing, and retrieval. This pattern is also applied across almost all agent projects and currently reflects some models’ real-time learning capabilities.

These patterns show a significant similarity to human cognition and thought processes. Historically, there has been dedicated research on cognitive architecture[13], systematically considering the design of intelligent agents from dimensions like memory, world cognition, problem-solving (action), perception, attention, reward mechanisms, and learning. I feel that current LLM agents still have significant room for improvement in terms of reward mechanisms (whether there are good goal guides) and learning evolution (whether they can continuously improve their capabilities). Perhaps future applications of RL in model agents will hold great potential, beyond merely aligning values as is primarily done now.

Understanding AutoGPT and LLM Agents

Research on cognitive architecture

Common Questions

If you have practically engaged with these projects, you should have experienced some of the current issues and limitations of model agents, such as:

  1. Memory recall issues. If only simple embedding similarity recall is performed, it is easy to find that the results are not satisfactory. There should be plenty of room for improvement in this area, such as the more detailed handling of memory in Generative Agents and the many choices and tuning opportunities in index structures in LlamaIndex.
  2. Error accumulation issues. Many examples available online are cherry-picked; the overall model performance may not be as impressive, and errors can often occur early in the process, leading to increasingly significant deviations… A crucial issue here may still be high-quality training data related to task decomposition execution and external tool utilization, which is relatively scarce. This is likely one reason why OpenAI is developing its own plugin system.
  3. Exploration efficiency issues. For many simple scenarios, allowing model agents to explore and complete the entire solution process can still be cumbersome and time-consuming, and agents can easily complicate issues. Considering the costs associated with LLM calls, significant optimizations are needed before practical implementations can occur. One approach might be to introduce human judgment intervention and feedback inputs midway, as seen in AutoGPT.
  4. Task termination and result validation. In open-ended questions or situations where results cannot be evaluated through clear assessment methods, determining how to terminate model agent work presents a challenge. This circles back to the earlier point that data collection related to task execution, model training, and the application of reinforcement learning may help address this issue.

What tricky problems have you encountered while using these model agents, and what good solutions have you found? Or have you discovered any scenarios where existing agents can effectively meet needs? Feel free to share and discuss in the comments.

References

[1]

Toolformer: https://arxiv.org/abs/2302.04761

[2]

HuggingGPT: https://github.com/microsoft/JARVIS

[3]

Visual ChatGPT: https://github.com/microsoft/visual-chatgpt

[4]

How Microsoft 365 Copilot is Achieved? Unveiling How LLM Generates Instructions: https://www.bilibili.com/video/BV1DY4y1Q7Te/

[5]

ChatGPT Retrieval Plugin: https://github.com/openai/chatgpt-retrieval-plugin

[6]

LlamaIndex: https://github.com/jerryjliu/llama_index

[7]

Compression as Wisdom: https://www.youtube.com/watch?v=dO4TPJkeaaU

[8]

Interface-enabled AutoGPT Product: https://godmode.space/

[9]

BabyASI: https://github.com/oliveirabruno01/babyagi-asi

[10]

Camel: https://www.camel-ai.org/

[11]

This website: http://data.camel-ai.org/

[12]

Generative Agents: https://arxiv.org/abs/2304.03442

[13]

Research on Cognitive Architecture: https://cogarch.ict.usc.edu/

Leave a Comment