Understanding Intelligent Agents Through AutoGPT

1. Concept of Intelligent Agents

What is an intelligent agent? The answer given by LLM is as follows: An agent, also known as an intelligent agent, is an important concept in the field of artificial intelligence. It is an entity capable of autonomously understanding, planning decisions, and executing complex tasks. Intelligent agents can perceive their environment, make decisions, and take actions. In practical implementations, agents often utilize large language models as core controllers, enhancing their intelligence through memory retrieval, decision reasoning, and action sequence selection.

So how do we construct an intelligent agent? We can glean some insights from LLM’s answer: by using LLM as the core controller and adding planning capabilities, memory capabilities, and tool invocation capabilities, we can achieve this.

Without going into too much detail about large models themselves, let’s briefly introduce three fundamental concepts related to agents:

How to achieve planning?
How to achieve memory?
How to achieve tool use?

1.1 Planning

A complex task typically consists of multiple sub-steps; the agent needs to decompose the task in advance and plan accordingly. The two most important mechanisms are task decomposition and self-reflection.

Task decomposition is implemented through the Chain of Thought (CoT) technique, enhancing model performance when solving complex tasks. By thinking step by step, the model can utilize more test-time computation to break down the task into smaller, simpler sub-steps and explain its reasoning process. Similar techniques include the Tree of Thoughts.

Another unique method for task decomposition is LLM+P, which employs an external classical planner for long-term planning. This method uses Planning Domain Definition Language (PDDL) as an intermediate interface to describe planning problems. In this process, the LLM needs to accomplish the following steps:

Transform the problem into a “PDDL problem”;
Request the classical planner to generate a PDDL plan based on the existing “domain PDDL”;
Translate the PDDL plan back into natural language.

The planning steps completed by external tools are common in certain robotic environments but are not often seen in other fields.

Self-reflection allows autonomous agents to improve past decision-making and correct previous mistakes for iterative improvement, which is very useful in real tasks where trial and error is possible. ReAct is one of the foundational components of Auto-GPT, proposed by the Google Research Brain Team in the paper ReAct: Synergizing Reasoning and Action in Language Models.

In simple terms, the ReAct method combines reasoning and actions to achieve results. The inspiration comes from the authors’ insight into human behavior: when humans engage in a multi-step task, there is often a reasoning process between each step. The authors suggest allowing the LLM to “speak” its inner monologue and then take corresponding actions based on that monologue, mimicking human reasoning processes to improve the accuracy of LLM responses. ReAct integrates reasoning and actions in the LLM by expanding the action space to include a combination of task-related discrete actions and language space, where actions allow the LLM to interact with the environment (e.g., using the Wikipedia search API), while the language space enables the LLM to generate reasoning trajectories in natural language.

The ReAct prompt template includes clear steps for LLM thinking, roughly formatted as:

Thought: ...
Action: ...
Observation: ...
... (Repeated many times)

1.2 Memory

Memory can be broadly categorized into the following three types:

Raw sensory memory as learning embedding representations of inputs (including text, images, or other forms);
Short-term memory which involves in-context learning, very brief with limited impact, constrained by the context window length of Transformers.
Long-term memory, which is an external vector store available to the agent during queries, accessible through quick retrieval. Long-term memory can alleviate the limitations of limited attention spans, with common operations involving saving information embedding representations into vector storage databases that support fast maximum inner product search (MIPS).

1.3 Tool Use

MRKL (Modular Reasoning, Knowledge, and Language) is a neural-symbolic architecture designed for autonomous agents. The MRKL architecture includes a collection of “expert” modules, with the LLM acting as a router, querying routes to find the most suitable expert module. Through the MRKL system, different types of modules can be integrated to create a more efficient, flexible, and scalable AI system.

TALM (Tool-Enhanced Language Model) and Toolformer are examples of fine-tuned language models that learn to use external tool APIs. ChatGPT plugins and OpenAI API function calls are also examples of enhancing language models’ tool usage capabilities, where the collection of tool APIs can be provided by other developers (such as plugins) or customized (such as function calls).

2. What Exactly Is AutoGPT?

Currently, several experimental projects such as AutoGPT, GPT-Engineer, and BabyAGI fully utilize LLM as the brain of intelligent agent thinking, supplemented by other key components to form a comprehensive autonomous system.

Next, we will take the AutoGPT project as an example to understand the content and construction principles of agents from the following aspects:

Introduction to the AutoGPT Project
Core Principles of AutoGPT
How AutoGPT Generates Prompts
How AutoGPT Stores Memory
How AutoGPT Evaluates Subtask Completion

2.1 Introduction to the AutoGPT Project

AutoGPT is an AI agent and an open-source application that combines GPT-4 and GPT-3.5 technologies. Given a natural language goal, it attempts to break it down into subtasks and uses search engines and other tools in an automated loop to achieve this goal. It is powered by GPT-4, autonomously developing and managing tasks. The official website describes the advantages of GPT as follows:

🌐 Internet access for searches and information gathering
💾 Long-term and short-term memory management
🧠 GPT-4 instances for text generation
🔗 Access to popular websites and platforms
🗃️ File storage and summarization with GPT-3.5
🔌 Extensibility with Plugins

Here is a distinction between AutoGPT and ChatGPT: ChatGPT primarily follows human instructions and multi-turn dialogue to complete a complex task step by step. In contrast, AutoGPT only requires a single instruction and will find ways to complete the task step by step. It can autonomously generate prompts through LLM and utilize tools like search engines and Python scripts to achieve its ultimate goal.

2.2 Core Principles of AutoGPT:

The language model integrated behind AutoGPT can be either GPT-4 or GPT-3.5’s text-davinci-003. The LLM model itself cannot perform operations like browsing the web, executing code, or publishing information, so AutoGPT turns these operations into commands for the GPT-4 model to select from and then performs actions based on the results returned. This can be understood as AutoGPT designing a very clever prompt and then encapsulating the commands we want to execute based on the prompt template before sending them to GPT-4, executing actions based on the results.

According to the source code of the GitHub project, the prompt has been made public, resulting in the following:

1. Google Search: "google", args: "input": "<search>"
2. Browse Website: "browse_website", args: "url": "<url>", "question": "<what_you_want_to_find_on_website>"
3. Start GPT Agent: "start_agent", args: "name": "<name>", "task": "<short_task_desc>", "prompt": "<prompt>"
4. Message GPT Agent: "message_agent", args: "key": "<key>", "message": "<message>"
5. List GPT Agents: "list_agents", args: ""
6. Delete GPT Agent: "delete_agent", args: "key": "<key>"
7. Write to file: "write_to_file", args: "file": "<file>", "text": "<text>"
8. Read file: "read_file", args: "file": "<file>"
9. Append to file: "append_to_file", args: "file": "<file>", "text": "<text>"
10. Delete file: "delete_file", args: "file": "<file>"
11. Search Files: "search_files", args: "directory": "<directory>"
12. Evaluate Code: "evaluate_code", args: "code": "<full_code_string>"
13. Get Improved Code: "improve_code", args: "suggestions": "<list_of_suggestions>", "code": "<full_code_string>"
14. Write Tests: "write_tests", args: "code": "<full_code_string>", "focus": "<list_of_focus_areas>"
15. Execute Python File: "execute_python_file", args: "file": "<file>"
16. Execute Shell Command, non-interactive commands only: "execute_shell", args: "command_line": "<command_line>".
17. Task Complete (Shutdown): "task_complete", args: "reason": "<reason>"
18. Generate Image: "generate_image", args: "prompt": "<prompt>"
19. Do Nothing: "do_nothing", args: ""

For example, AutoGPT would send a question like “What sensational events happened this year?” to GPT-4 and ask GPT-4 to choose the most suitable method to get the answer based on the above COMMANDs, providing the parameters needed for each COMMAND, including URLs, executed code, etc. Then, AutoGPT uses the results to execute the commands suggested by GPT-4.

Of course, in addition to this prompt, AutoGPT also employs several techniques to ensure that tasks are completed more effectively. Here are a few technical aspects:

It uses a list to save historical messages sent and sends the maximum number of historical messages to GPT-4 within the allowed token conditions at each request. From the code analysis, it can be seen that in order to help GPT-4 complete tasks better, AutoGPT tries to use the maximum number of input tokens available each time, inserting the current command and, as long as it can continue to add historical information, it will retrieve previously stored commands to include. Therefore, while AutoGPT performs very well, this processing leads to a significant consumption of API quotas. Each time a request is sent, it makes GPT aware of the current time and contextual information to facilitate handling time-related content. As mentioned earlier, AutoGPT sends recent historical messages to GPT-4 to enhance the probability of task completion, and it also forwards the most relevant information from the goal. Thus, AutoGPT retains all historical information.

The following diagram illustrates the general process.

Understanding Intelligent Agents Through AutoGPT

2.3 How AutoGPT Generates Prompts

At the beginning of a conversation, AutoGPT uses the “system” role to configure constraints and conduct self-performance evaluations. AutoGPT has a prompt generator, where some constraints are hard-coded:

Understanding Intelligent Agents Through AutoGPT

Under the guidance of these prompts, the autonomy and self-sufficiency of GPT are enhanced, and many cases can be seen where continuous feedback loops help GPT improve its output.

Understanding Intelligent Agents Through AutoGPT

In summary:

Auto GPT generates appropriate prompts for GPT-4 based on set goals, tasks, and data stored in the database.
Auto GPT utilizes GPT-4’s learning capabilities to learn and improve its learning and execution methods based on its performance and results.

2.4 How AutoGPT Stores Memory

AutoGPT can retain context and make more informed decisions by integrating with vector databases, akin to equipping a robot with long-term memory to remember past experiences. In reality, AutoGPT manages short-term and long-term memory through writing to and reading from databases and files.

AutoGPT uses OpenAI’s embedding API to create embeddings based on GPT text output, allowing for various vector storage services. AutoGPT employs local storage, Pinecone (third-party service), Redis, and Milvus (open-source service). Pinecone and Milvus optimize vector search algorithms to search text embeddings based on relevant contexts. AutoGPT stores embeddings in one of these vector storage services and injects context into GPT by searching for relevant vectors from the current task session.

2.5 How AutoGPT Evaluates Subtask Completion

AutoGPT evaluates and improves its performance based on generated prompts, results, and human feedback. For example, if a user asks AutoGPT to write a marketing copy and provides feedback like, “The copy is well-written, but some parts are not engaging enough and do not reach the user’s decision points. I hope you can elaborate more,” AutoGPT modifies and refines the copy based on this feedback. Simultaneously, AutoGPT generates a new prompt: “What characteristics should marketing copy have?” and lets GPT-4 respond. Based on GPT-4’s answer (e.g., concise, emphasizes value, highlights features, evokes emotional resonance), AutoGPT assesses whether the subtask is complete and meets the goal. If the response is not comprehensive or accurate enough, it generates a new prompt and allows GPT-4 to continue answering. Once a satisfactory answer is obtained, AutoGPT stores the result in the database and proceeds to execute the next subtask. This enables Auto GPT to exhibit higher autonomy and flexibility when executing more complex, multi-step tasks.

3. Conclusion:

AutoGPT has garnered significant attention online, with examples including business surveys, no-code app or webpage generation, automated office tasks, and text generation. With the addition of Stable Diffusion, it can even generate images.

In summary, if applications built with LLM can be seen as assisted driving, then applications constructed using the intelligent agent model like AutoGPT represent autonomous driving. However, the core LLM’s general capabilities remain crucial; only with sufficiently powerful capabilities can the agent handle tasks from completely different domains as expected, achieving “intent recognition” for humans, “organizational planning” for selectable tools, and introducing “perceptors” to report “execution results” to the large model, guiding it to complete “autonomous error correction” when errors occur, thereby fully realizing the intelligent application process.

References:

“Is Autonomous AI Here? A Detailed Explanation of the Viral AutoGPT”
“The Rise of Next-Generation Language Model Paradigms: LAM! The AutoGPT Model Sweeps LLMs, A Comprehensive Overview of Three Components: Planning, Memory, and Tools”
“AI Agent Insights 03 – AutoGPT: Autonomous AI Agent Based on ChatGPT”
“AutoGPT Documentation” https://docs.agpt.co/

1. Concept of Intelligent Agents

1.1 Planning

1.2 Memory

1.3 Tool Use

2. What Exactly Is AutoGPT?

2.1 Introduction to the AutoGPT Project

2.2 Core Principles of AutoGPT:

2.3 How AutoGPT Generates Prompts

2.4 How AutoGPT Stores Memory

2.5 How AutoGPT Evaluates Subtask Completion

3. Conclusion:

Leave a Comment Cancel reply