Understanding AI Agents: A Comprehensive Guide

This is the 255th article in the series for job-seeking product managers.

🔥What is an Agent?The term Agent originates from the Latin word Agere, meaning “to do”. In the context of LLMs, an Agent can be understood as an intelligent entity that can autonomously understand, plan, make decisions, and execute complex tasks‼️

Understanding AI Agents: A Comprehensive Guide

👸An Agent is not an upgraded version of ChatGPT; it not only tells you “how to do it” but also helps you do it. If Copilot is the co-pilot, then the Agent is the main driver🚗.

👉An autonomous Agent is a program driven by artificial intelligence that can create tasks, complete them, create new tasks, re-prioritize the task list, and continue looping until the goal is achieved when given a target.

🔥The Most Intuitive Formula

Agent = LLM + Planning + Feedback + Tool use

🔥Agent Decision-Making Process

👉Perception → Planning → Action

Perception refers to the Agent’s ability to collect information from the environment and extract relevant knowledge👸.
Planning refers to the decision-making process the Agent undertakes to achieve a specific goal.
Action refers to the actions taken based on the environment and planning🚗.

💪The Agent collects information from the environment through perception and extracts relevant knowledge. Then, it makes decisions through planning to achieve a certain goal. Finally, through action, it takes specific actions based on the environment and planning💪. The policy is the core decision-making process for the Agent’s actions, and the actions provide the premise and foundation for further perception, forming a self-sustaining closed-loop learning process.

🔥The Explosion of Agents❤️

On March 21, Camel was released.
On March 30, AutoGPT was released.
On April 3, BabyAGI was released👸.
On April 7, Westworld Town was released.
On May 27, NVIDIA’s AI agent Voyager, after integrating with GPT-4, completely outperformed AutoGPT. By autonomously writing code, it dominated the game “Minecraft”, enabling lifelong learning in the game environment without human intervention‼️
At the same time, SenseTime, Tsinghua University, and others proposed the generalist AI agent Ghost in the Minecraft (GITM), which also excels in solving tasks through autonomous learning💪. These high-performing AI agents give us a glimpse of the embryonic form of AGI + agents.

👸An Agent enables LLMs to achieve goals and accomplish them through a self-motivated loop🚗.

It can operate in parallel (using multiple prompts simultaneously to try to solve the same goal) and unidirectionally (without human involvement in the conversation)❤️.

💪After creating a goal or main task for the Agent, it is mainly divided into the following three steps:

Obtain the first incomplete task.
Collect intermediate results and store them in a vector database.
Create new tasks and reset the task list’s priority.

🔥How Do Humans Get Things Done?

In our work, we often use the PDCA thinking model. Based on the PDCA model, we can break down the completion of a task👸 into planning, implementation, checking the implementation results🚗, and then incorporating successes into standards while leaving failures for the next cycle to resolve. This is currently a very successful summary of how people efficiently complete tasks‼️

Understanding AI Agents: A Comprehensive Guide

🔥How to Let LLM Replace Humans in Work?

To allow LLM to replace humans in work, we can plan, execute, evaluate, and reflect based on the PDCA model❤️.

👉Planning Ability (Plan) → Task Decomposition: The Agent’s brain breaks down large tasks into smaller, manageable subtasks, which is very effective for effectively handling large, complex tasks🚗.

👉Execution Ability (Done) → Tool Usage: The Agent can learn to call external APIs when its internal knowledge is insufficient (for example, when it does not exist during pre-training and cannot be altered later), such as retrieving real-time information, executing code, accessing proprietary knowledge bases, etc.💪 This is a typical platform + tools scenario; we need to have an ecological mindset‼️. That is, we build a platform and some necessary tools, then actively attract other vendors to provide more component tools, forming an ecosystem.

👉Evaluation Ability (Check) → Confirm Execution Results: The Agent must be able to determine whether the outputs meet the goals after the task is executed normally, and in case of anomalies, it should classify the anomalies (harm level)👸, locate the source of the anomalies (which subtask caused the error)💪, and analyze the causes of the anomalies (what led to the anomaly). This ability is not possessed by general large models and requires training unique small models tailored to different scenarios🚗.

👉Reflection Ability (Action) → Re-planning Based on Evaluation Results: The Agent must be able to end tasks in a timely manner when the outputs meet the goals, which is the core part of the entire process‼️. At the same time, it should conduct attributions to summarize the main factors leading to the results❤️. Additionally, if anomalies occur or outputs do not meet the goals, the Agent should provide countermeasures and re-plan to initiate the recycling process.

🌿As a smart agent, LLM has sparked thoughts about the relationship between artificial intelligence and human work, as well as future development. It makes us ponder how humans collaborate with intelligent agents‼️ to achieve more efficient work methods. This collaboration also prompts us to reflect on the value and strengths of humanity itself.

🔥Architecture

🔥Memory

Short-term Memory: Learning in context. It is temporary and limited due to the constraints of the Transformer’s context window length.
Long-term Memory: External vector storage that the agent can notice during queries, accessible through rapid retrieval.

🔥Reflection

Reflection is a higher-level 🚗, more abstract thinking generated by the Agent. Because reflection is also a form of memory, it will be included along with other observations during retrieval👸.

Reflection is generated periodically‼️ when the sum of the importance scores of the most recent events perceived by the Agent exceeds a certain threshold🌿.

Let the Agent determine what to reflect on💪.
Generate questions as queries for retrieval.

🔥Planning

Planning is for making longer-term plans. Like reflection, planning is also stored in the memory stream (the third type of memory) and is included in the retrieval process💪, allowing the Agent to consider observations, reflections, and plans when deciding how to act. If needed, the Agent may change its plans midway (i.e., responding, reacting).

🔥Various Concepts in LangChain

Models: Referring to the familiar calling of large model APIs.
Prompt Templates: Introducing variables into prompts to adapt to user input templates‼️.
Chains: Chained calls to models👸 where the previous output forms part of the next input.
Agent: Can autonomously execute chained calls🚗 and access external tools.
Multi-Agent: Multiple Agents share part of the memory and autonomously collaborate.

Understanding AI Agents: A Comprehensive Guide

🔥Bottlenecks in Implementing Agents

Agents rely on two capabilities: one part is the LLM serving as its “IQ” or “brain”, and the other part is based on the LLM, which requires an external controller to complete various prompts, such as enhancing memory through retrieval, obtaining feedback from the environment, and performing reflections.

Agents need both a brain and external support.

Issues with the LLM itself: Insufficient “IQ” 💪 can upgrade the LLM to GPT-5; incorrect prompting methods, questions need to be unambiguous.
External tools: Insufficient systematization, requiring calls to external tool systems, which is a long-term unresolved issue🚗.

Currently, the implementation of Agents requires not only a sufficiently general LLM but also a universal external logic framework. It’s not just an “IQ” issue, but also how to leverage external tools.

❤️From specialized to general—this is a more important question.

🌿Solving specific problems in specific scenarios—using LLM as a general brain‼️ by designing prompts for different roles to accomplish specialized tasks💪 rather than universal applications. The key issue, that is, feedback, will become a major constraint for the implementation of Agents; the success probability for complex tool applications will be very low👸.

🔥Path to Implementing Agents from Specialized to General

Assuming that Agents will eventually land in 100 different environments, is it possible to abstract a framework model to solve all external generality issues under the premise that even the simplest external applications are difficult to achieve?

First, achieve excellence in an Agent in a specific scenario—sufficiently stable and robust, and then gradually transform it into a general framework. Perhaps this is one of the paths to realizing a general Agent.

🔥Multimodality in the Development of Agents

Multimodality can only solve perception issues for Agents, but cannot address cognitive issues.
Multimodality is an inevitable trend; future large models will inevitably be multimodal, and future Agents will also be Agents in a multimodal world.

🔥New Consensus on Agents is Gradually Forming

Agents need to call external tools.
The way to call tools is to output code.

🌿The LLM brain outputs an executable code, akin to a semantic analyzer‼️. It understands the meaning of each sentence and then converts it into machine instructions🚗 to call external tools for execution or to generate answers. Although the current Function Call format still needs improvement💪, this method of calling tools is very necessary and is the most thorough means to solve the hallucination problem.

🔥Going Out to Ask: Hope to Create a General Agent

In the market environment of China👸, if a deeply integrated Agent with enterprises is created, it will ultimately become an “outsourcing”‼️ because it requires private deployment, integrated into enterprise workflows🌿. Many companies will compete for large clients in the insurance, banking, and automotive sectors. This will be very similar to the fate of the previous generation of AI companies, where marginal costs are hard to reduce, and generality is lacking.

🌿Current AIGC products like the Magic Sound Workshop🚗 and Wonderful Text are aimed at content creators and lie between deep and shallow applications, not fully belonging to consumers or enterprises, while also targeting enterprise users with CoPilot, which aims to find specific “scenarios” within enterprises‼️ to create relatively general scenario applications.

🔥HF: Transformers Agents Released

Control over more than 100,000 HF models through natural language👸.

🌿Transformers Agents have been added to versions after Transformers 4.29. It provides a natural language API on the basis of Transformers to “make Transformers do anything”❤️.

💪There are two concepts here: one is Agent (proxy), and the other is Tools. We define a series of default tools for the Agent to understand natural language and use these tools‼️.

Agent: Here refers to the large language model (LLM), and you can choose to use OpenAI’s model (requiring a key), or open-source models like StarCoder and OpenAssistant, and we prompt the Agent to access a specific set of tools🚗.
Tools: Refers to individual functions👸. We define a series of tools and then use the descriptions of these tools to prompt the Agent, demonstrating how it will utilize the tools to execute the requested content in the query‼️.

💪Our Transformers toolkit includes: document Q&A, text Q&A, image captioning, image Q&A, image segmentation, speech-to-text, text-to-speech❤️, zero-shot text classification, text summarization, translation, etc.‼️ However, you can also extend these tools to include those unrelated to transformers, such as reading text from the web, and see how to develop custom tools🚗.

👸The future is the world of Agents🌿. In today’s Agent process, the stories of yesterday’s AI are still repeated; private deployment will face challenges.

✅I am Boss Xue, continuously sharing valuable content related to AIGC product manager job seeking. More free valuable materials await you.

Essential for Career Transition/Job Seeking

If you want to enter the AIGC field as a product manager, we recommend our “AIGC Product Practical Training Camp”

Target Audience: Individuals with 0 experience seeking to transition into AIGC product management.

Project Advantages:

1) Small class sizes, about 10 people per session. Personalized attention, 1-on-1 background diagnosis and project direction customization.

2) Strong course system: The course will deeply explain machine learning, reinforcement learning, deep learning, large model-related, so those without an algorithm background need not worry.

3) Hands-on projects focus on high recruitment volume, high success rates, and high salaries in dialogue (chatbot) and image-based (similar to the wonderful duck camera) projects; both projects are real projects, not virtual projects.

4) Senior AI interviewers from Baidu provide 1-on-1 resume editing and mock interview services at no extra cost.

5) Any questions can be freely asked in the 2V1 service exclusive group.

6) If you miss the live broadcast, there will be recordings available for playback after each live session.

7) Additional benefits:Free retake. If you feel you haven’t absorbed well from one session, we currently offer opportunities for free retakes.

Detailed Explanation:

We sincerely suggest everyone to dive into emerging fields with high salaries and promising prospects [AIGC Edition · 6th Iteration Update].

This is the 255th article in the series for job-seeking product managers.

Leave a Comment Cancel reply