Recently, AutoGPT and HuggingGPT have become extremely popular. They automatically make decisions using the ChatGPT large model and call upon other models to achieve a high degree of automated decision-making, expanding the application scope of large models.
However, the most critical aspect is understanding their specific implementation principles and how they operate internally, which is particularly important for us.
Therefore, this article mainly revolves around AutoGPT and HuggingGPT, discussing their working principles and code for everyone’s reference.
1. Overview of HuggingGPT Working Principles
Solving complex artificial intelligence tasks across different fields and modalities is a key step toward advanced artificial intelligence. Although there are many rich AI models available for various fields and modalities, they cannot handle complex AI tasks.
Given the outstanding capabilities of large language models (LLMs) in language understanding, generation, interaction, and reasoning, it can be argued that LLMs can serve as a controller to manage existing AI models to tackle complex AI tasks, with language acting as a universal interface for authorization.
Based on this concept, HuggingGPT was proposed, which is a framework that utilizes LLMs (like ChatGPT) to connect various AI models within the machine learning community (like Hugging Face) to solve AI tasks.
Specifically, ChatGPT is used for task planning upon receiving user requests, selecting models based on functionality descriptions from Hugging Face, executing each sub-task with the chosen AI model, and summarizing the response.
Benefiting from such a design, HuggingGPT can utilize external models, thus integrating multimodal perception capabilities to handle multiple complex AI tasks.
Moreover, this pipeline allows HuggingGPT to continue harnessing the expertise of specific task specialists, achieving scalable and expandable AI capabilities.
So far, HuggingGPT has integrated hundreds of models around ChatGPT on Hugging Face, covering 24 tasks such as text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video.
Experimental results demonstrate HuggingGPT’s capabilities in processing multimodal information and complex AI tasks.
Paper Title: “HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace”
Paper Link: https://arxiv.org/abs/2303.17580

1. Basic Steps
The entire process of HuggingGPT can be divided into four stages:

Given Input:

1) Task Planning
Utilizing ChatGPT to analyze user requests, understand their intentions, and decompose them into potentially solvable tasks through prompts.

In the first stage of HuggingGPT, the large language model receives requests from users and decomposes them into a series of structured tasks.
Complex requests often involve multiple tasks, and the large language model needs to determine the dependencies and execution order of these tasks.
To prompt the large language model for effective task planning, HuggingGPT employs both normative instructions and demonstration-based parsing in its prompt design.
Among them, normative instructions:
Task specifications provide a unified template for tasks and allow the large language model to perform task parsing through slot archiving.
HuggingGPT has designed four slots for task parsing: task type, task ID, task dependencies, and task parameters:
-
The task ID provides a unique identifier for task planning, used to reference dependent tasks and their generated resources. -
The task type encompasses different tasks across language, vision, video, audio, etc. -
The dependency of tasks defines prerequisite tasks needed for execution. A task will only start when all prerequisite dependent tasks are completed. -
Task parameters contain a list of parameters required for task execution. It includes three sub-fields filled with text, image, and audio resources based on task type, resolved from user requests or generated resources from subordinate tasks. - As shown in the table below

The final prompt is as follows:

HuggingGPT provides these task specifications as high-level instructions to the large language model for analyzing user requests and parsing tasks accordingly.
For example:
For text: “Look at /exp1.jpg, Can you tell me how many objects in the picture?”
The parsing result is:
[
{
"task": "image-to-text",
"id": 0, "dep": [-1],
"args": {"image": "/exp1.jpg"}
},
{
"task": "object-detection",
"id": 0,
"dep": [-1],
"args": {"image": "/exp1.jpg" }
}
]
For text: “In /exp2.jpg, what’s the animal and what’s it doing?”
The parsing result is:
[
{
"task": "image-to-text",
"id": 0, "dep":[-1],
"args": {"image": "/exp2.jpg" }
},
{
"task":"image-classification",
"id": 1, "dep": [-1],
"args": {"image": "/exp2.jpg" }
},
{
"task":"object-detection",
"id": 2,
"dep": [-1],
"args": {"image": "/exp2.jpg" }
},
{
"task": "visual-question-answering",
"id": 3,
"dep":[-1],
"args": {"text": "What’s the animal doing?",
"image": "/exp2.jpg" }
}]
Additionally, context management in dialogues is crucial for chatbots as it provides chat history to facilitate understanding of user requests.
2) Model Selection
To address the planned tasks, ChatGPT selects expert models hosted on Hugging Face based on model descriptions.

After parsing the task list, HuggingGPT next needs to match tasks and models, selecting appropriate models for each task in the task list.
To this end, it first obtains descriptions of expert models from the Hugging Face Hub and then dynamically selects models for tasks through a contextual task-model assignment mechanism.
This approach allows for progressive model access (only requiring descriptions of expert models) and is more open and flexible.

In terms of specific implementation, models are filtered based on task type, retaining only those that match the current task type, and then ranking the remaining models based on their download volumes on Hugging Face, finally selecting the top K models as candidates for HuggingGPT to choose from, as shown below:
https://raw.githubusercontent.com/microsoft/JARVIS/2c19142b56663a54b2c85f8622b38f98c5b2580f/server/data/p0_models.jsonl
This model stores some basic information about model interfaces and can serve as candidate models.
{
"downloads": 1677372,
"id": "ProsusAI/finbert",
"likes": 186,
"pipeline_tag": "text-classification",
"task": "text-classification",
"meta": {"language": "en", "tags": ["financial-sentiment-analysis", "sentiment-analysis"], "widget": [{"text": "Stocks rallied and the British pound gained."}]},
"description": "\n\nFinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification. [Financial PhraseBank](https://www.researchgate.net/publication/251231107_Good_Debt_or_Bad_Debt_Detecting_Semantic_Orientations_in_Economic_Texts) by Malo et al. (2014) is used for fine-tuning. For more details, please see the paper [FinBERT: Financial Sentiment Analysis with Pre-trained Language Models](https://arxiv.org/abs/1908.10063) and our related [blog post](https://medium.com/prosus-ai-tech-blog/finbert-financial-sentiment-analysis-with-bert-b277a3607101) on Medium.\n\nThe model will give softmax outputs for three labels: positive, negative or neutral.\n\n"
}
3) Task Execution
Invoke and execute each selected model, returning the results to ChatGPT.

Once a task is assigned to a specific model, the next step is to execute the task, i.e., perform model inference.
To enhance speed and computational stability, HuggingGPT runs these models on mixed inference endpoints. By providing the task parameters as input, the model computes inference results and then sends them back to the large language model.
def huggingface_model_inference(model_id, data, task):
task_url = f"https://api-inference.huggingface.co/models/{model_id}" # InferenceApi does not yet support some tasks
inference = InferenceApi(repo_id=model_id, token=config["huggingface"]["token"])
# NLP tasks
if task == "question-answering":
inputs = {"question": data["text"], "context": (data["context"] if "context" in data else "" )}
result = inference(inputs)
if task == "sentence-similarity":
inputs = {"source_sentence": data["text1"], "target_sentence": data["text2"]}
result = inference(inputs)
if task in ["text-classification", "token-classification", "text2text-generation", "summarization", "translation", "conversational", "text-generation"]:
inputs = data["text"]
result = inference(inputs)
# CV tasks
if task == "visual-question-answering" or task == "document-question-answering":
img_url = data["image"]
text = data["text"]
img_data = image_to_bytes(img_url)
img_base64 = base64.b64encode(img_data).decode("utf-8")
json_data = {}
json_data["inputs"] = {}
json_data["inputs"]["question"] = text
json_data["inputs"]["image"] = img_base64
result = requests.post(task_url, headers=HUGGINGFACE_HEADERS, json=json_data).json()
# result = inference(inputs) # not support
To further improve inference efficiency, models without resource dependencies can be parallelized, meaning multiple tasks that satisfy prerequisite dependencies can be initiated simultaneously.
# image to text
if model_id == "Salesforce/blip-image-captioning-large":
raw_image = load_image(request.get_json()["img_url"]).convert('RGB')
text = request.get_json()["text"]
inputs = pipes[model_id]["processor"](raw_image, return_tensors="pt").to(pipes[model_id]["device"])
out = pipe.generate(**inputs)
caption = pipes[model_id]["processor"].decode(out[0], skip_special_tokens=True)
result = {"generated text": caption}
# text to image
if model_id == "runwayml/stable-diffusion-v1-5":
file_name = str(uuid.uuid4())[:4]
text = request.get_json()["text"]
out = pipe(prompt=text)
out["images"][0].save(f"public/images/{file_name}.jpg")
result = {"path": f"/images/{file_name}.jpg"}
# VQA
if model_id == "dandelin/vilt-b32-finetuned-vqa":
question = request.get_json()["text"]
img_url = request.get_json()["img_url"]
result = pipe(question=question, image=img_url)
#DQA
if model_id == "impira/layoutlm-document-qa":
question = request.get_json()["text"]
img_url = request.get_json()["img_url"]
result = pipe(img_url, question)
4) Response Generation
Finally, use ChatGPT to integrate the predictions of all models and generate answers for the user.

The core is to list the results of the aforementioned stages and form a prompt, which is ultimately generated as a summary model.
Once all task executions are completed, HuggingGPT enters the response generation stage.
In this stage, HuggingGPT integrates all information from the previous three stages (task planning, model selection, and task execution) into a concise summary, including the planned task list, the models selected for the tasks, and the inference results from the models.
Among these, the inference results are the most important, as they support HuggingGPT’s final decision.
The inference results appear in a structured format, such as bounding boxes with detection probabilities in object detection models, answer distributions in question-answering models, etc.
HuggingGPT allows the LLM to receive these structured inference results as input and generate responses in a friendly human language form.
Moreover, the LLM does not merely summarize the results, but generates responses that positively address user requests, providing reliable decisions with confidence.
2. Implementation Effects
In specific experiments, the GPT models gpt-3.5-turbo and text-davinci-003 variants were used as large language models, which can be publicly accessed via the OpenAI API 4.
To make the LLM’s output more stable, the decoding temperature was set to 0, and to adjust the LLM’s output to meet expected formats, the logit_bias for format constraints was set to 0.2.
However, considering that user requests may contain multiple implicit tasks or require multi-faceted information, in such cases, relying on a single expert model cannot suffice.
To overcome this challenge, HuggingGPT organizes the collaboration of multiple models through task planning.
For example: In Figures 3, 4, and 5, several dialogue demonstrations are shown, where in each demonstration, the user inputs a request that may contain multiple tasks or multimodal resources. HuggingGPT then relies on the LLM to organize the collaboration of multiple expert models to produce responses to the user.

We can find that even with multiple resources, HuggingGPT can decompose the main task into several basic tasks and finally integrate the inference results from multiple models to arrive at the correct answer.
Figure 3 shows HuggingGPT’s workflow in situations where there are resource dependencies between tasks. In this case, HuggingGPT can parse specific tasks from the user’s abstract requests, including pose detection, image description, and pose-conditioned image generation tasks.
Additionally, HuggingGPT successfully identified the dependencies between task 3 and tasks 1 and 2, injecting the inference results of tasks 1 and 2 into the input parameters of task 3 after the dependent tasks were completed.
In both cases, it demonstrates that HuggingGPT accomplished the user’s requested text-to-audio and text-to-video tasks through expert models. In the previous case, both models were executed in parallel (generating audio and video simultaneously), while in the latter case, the two models were executed serially (first generating text from an image, then generating audio based on the text).
This further validates that HuggingGPT can organize collaboration between models and the resource dependencies between tasks.
Figures 6 and 7 showcase HuggingGPT tested on a wide range of multimodal tasks;

Figures 8, 9, and 10 test to evaluate HuggingGPT’s effectiveness in complex tasks. Users break down a complex request into several steps, reaching the final goal through multiple rounds of requests.
It can be found that HuggingGPT can track the contextual state of user requests through dialogue context management in the task planning stage and can effectively address the resources and task planning mentioned by users.

3. A Simple Reproduction Project
ALL-IN-ONE: A Reproduction of HuggingGPT
Due to the legal inaccessibility of chatGPT in China, and the difficulty in accessing Hugging Face repositories, as well as the considerable resources required to download and run these models, this project replaces chatGPT with chatGLM and uses Baidu AI’s API to replace various models from Hugging Face, achieving a reproduction of HuggingGPT.
The advantage is low resource consumption and fast response speed. The disadvantage is relatively fewer functions, as Baidu AI’s models are evidently fewer than those of Hugging Face.
https://github.com/SolarWindRider/All-In-One
2. Overview of AutoGPT Working Principles
A blog (link: https://www.zhihu.com/question/595382995/answer/2985265957) has a vivid analogy for AutoGPT:
If the scenario of using models is likened to driving a car.
The first stage’s basic model is like a road atlas, containing a wealth of information, but you still have to manually flip through the index to check it;
The second stage’s dialogue model is akin to navigation software, which tells you how to go after you input your desired destination, but you still have to drive the car yourself;
The third stage’s application integration is equivalent to an autonomous vehicle; just input your destination, and it can take you directly there.
AutoGPT is this autonomous vehicle.
The core of AutoGPT lies in sending our commands to GPT-4, allowing GPT-4 to choose operations based on specified COMMANDS.
For example, AutoGPT sends a question, such as “Find the hottest AI tweet on Twitter today” to GPT-4, and requires GPT-4 to select the most suitable way to obtain the answer based on these COMMANDS, providing every COMMAND with the necessary parameters, including URL and execution code,then AutoGPT executes the commands based on the returned results according to GPT-4’s suggestions.
The diagram below shows the entire data flow after the user provides a goal and tasks.

In terms of implementation principles, AutoGPT consists of several steps:
1. Pose a question, i.e., set a goal;
2. AutoGPT decomposes the main task based on the set goal;
3. Execute each main task separately and obtain results;
4. After the main tasks are completed, execute additional added sub-tasks.

The above diagram provides the overall source code, where a core aspect is the corresponding prompt,
which can be seen in the “./data/prompt.txt” file.
CONSTRAINTS:
1. ~4000 word limit for short term memory. Your short term memory is short, so immediately save important information to files.
2. If you are unsure how you previously did something or want to recall past events, thinking about similar events will help you remember.
3. No user assistance
4. Exclusively use the commands listed in double quotes e.g. "command name"
COMMANDS:
1. Google Search: "google", args: "input": "<search>"
5. Browse Website: "browse_website", args: "url": "<url>", "question": "<what_you_want_to_find_on_website>"
6. Start GPT Agent: "start_agent", args: "name": "<name>", "task": "<short_task_desc>", "prompt": "<prompt>"
7. Message GPT Agent: "message_agent", args: "key": "<key>", "message": "<message>"
8. List GPT Agents: "list_agents", args: ""
9. Delete GPT Agent: "delete_agent", args: "key": "<key>"
10. Write to file: "write_to_file", args: "file": "<file>", "text": "<text>"
11. Read file: "read_file", args: "file": "<file>"
12. Append to file: "append_to_file", args: "file": "<file>", "text": "<text>"
13. Delete file: "delete_file", args: "file": "<file>"
14. Search Files: "search_files", args: "directory": "<directory>"
15. Evaluate Code: "evaluate_code", args: "code": "<full_code_string>"
16. Get Improved Code: "improve_code", args: "suggestions": "<list_of_suggestions>", "code": "<full_code_string>"
17. Write Tests: "write_tests", args: "code": "<full_code_string>", "focus": "<list_of_focus_areas>"
18. Execute Python File: "execute_python_file", args: "file": "<file>"
19. Task Complete (Shutdown): "task_complete", args: "reason": "<reason>"
20. Generate Image: "generate_image", args: "prompt": "<prompt>"
21. Do Nothing: "do_nothing", args: ""
RESOURCES:
1. Internet access for searches and information gathering.
2. Long Term memory management.
3. GPT-3.5 powered Agents for delegation of simple tasks.
4. File output.
PERFORMANCE EVALUATION:
1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.
2. Constructively self-criticize your big-picture behavior constantly.
3. Reflect on past decisions and strategies to refine your approach.
4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.
You should only respond in JSON format as described below
RESPONSE FORMAT:
{
"thoughts":
{
"text": "thought",
"reasoning": "reasoning",
"plan": "- short bulleted\n- list that conveys\n- long-term plan",
"criticism": "constructive self-criticism",
"speak": "thoughts summary to say to user"
},
"command": {
"name": "command name",
"args":{
"arg name": "value"
}
}
}
Ensure the response can be parsed by Python json.loads
The main workflow:
For example, given
You are 我是GPT, LangChain如何使用 Your decisions must always be made independently without seeking user assistance. Play to your strengths as an LLM and pursue simple strategies with no legal complications.
GOALS:
1. 告诉我LangChain如何使用.
2. 帮我安装LangChain.
3. 告诉我一些可以直接在我电脑上运行LangChain的例子.
4. 所有的问题都用中文回答.
Will receive the following answer:
{
"thoughts":{
"text": "LangChain is a programming language that can be used for blockchain development. To use it, we need to first download and install it on our computer. After installation, we can run some code examples to test it out. ",
"reasoning": "I reviewed the purpose and usage of LangChain and suggested a plan for installation and testing",
"plan": "- Download and install LangChain.\n- Run code examples to test it out.",
"criticism": "I should have provided more specific detailed steps on how to download and install LangChain",
"speak": "We need to download and install LangChain before we can use it. Do you have any experience with this?"
},
"command":{
"name": "browse_website",
"args":{
"url": "https://langchain.org/docs/getting-started/",
"question": "How to download and install LangChain?"
}
}
}
The specific complete process is as follows: Essentially a loop.
# Make a constant:
user_input = "Determine which next command to use, and respond using the format specified above:"
# Initialize memory and make sure it is empty.
# this is particularly important for indexing and referencing pinecone memory
memory = get_memory(cfg, init=True)
print('Using memory of type: ' + memory.__class__.__name__)
# Interaction Loop
while True:
# Send message to AI, get response
with Spinner("Thinking... "):
assistant_reply = chat.chat_with_ai(
prompt,
user_input,
full_message_history,
memory,
cfg.fast_token_limit) # TODO: This hardcodes the model to use GPT3.5. Make this an argument
# Print Assistant thoughts
print_assistant_thoughts(assistant_reply)
# Get command name and arguments
try:
command_name, arguments = cmd.get_command(assistant_reply)
except Exception as e:
print_to_console("Error: \n", Fore.RED, str(e))
if not cfg.continuous_mode and next_action_count == 0:
### GET USER AUTHORIZATION TO EXECUTE COMMAND ###
# Get key press: Prompt the user to press enter to continue or escape
# to exit
user_input = ""
print_to_console(
"NEXT ACTION: ",
Fore.CYAN,
f"COMMAND = {Fore.CYAN}{command_name}{Style.RESET_ALL} ARGUMENTS = {Fore.CYAN}{arguments}{Style.RESET_ALL}")
print(
f"Enter 'y' to authorise command, 'y -N' to run N continuous commands, 'n' to exit program, or enter feedback for {ai_name}...",
flush=True)
while True:
console_input = utils.clean_input(Fore.MAGENTA + "Input:" + Style.RESET_ALL)
if console_input.lower() == "y":
user_input = "GENERATE NEXT COMMAND JSON"
break
elif console_input.lower().startswith("y -"):
try:
next_action_count = abs(int(console_input.split(" ")[1]))
user_input = "GENERATE NEXT COMMAND JSON"
except ValueError:
print("Invalid input format. Please enter 'y -n' where n is the number of continuous tasks.")
continue
break
elif console_input.lower() == "n":
user_input = "EXIT"
break
else:
user_input = console_input
command_name = "human_feedback"
break
if user_input == "GENERATE NEXT COMMAND JSON":
print_to_console(
"-=-=-=-=-=-=-= COMMAND AUTHORISED BY USER -=-=-=-=-=-=-=",
Fore.MAGENTA,
"")
elif user_input == "EXIT":
print("Exiting...", flush=True)
break
else:
# Print command
print_to_console(
"NEXT ACTION: ",
Fore.CYAN,
f"COMMAND = {Fore.CYAN}{command_name}{Style.RESET_ALL} ARGUMENTS = {Fore.CYAN}{arguments}{Style.RESET_ALL}")
# Execute command
if command_name is not None and command_name.lower().startswith( "error" ):
result = f"Command {command_name} threw the following error: " + arguments
elif command_name == "human_feedback":
result = f"Human feedback: {user_input}"
else:
result = f"Command {command_name} returned: {cmd.execute_command(command_name, arguments)}"
if next_action_count > 0:
next_action_count -= 1
memory_to_add = f"Assistant Reply: {assistant_reply} " \
f"\nResult: {result} " \
f"\nHuman Feedback: {user_input} "
memory.add(memory_to_add)
# Check if there's a result from the command append it to the message
# history
if result is not None:
full_message_history.append(chat.create_chat_message("system", result))
print_to_console("SYSTEM: ", Fore.YELLOW, result)
else:
full_message_history.append(
chat.create_chat_message(
"system", "Unable to execute command"))
print_to_console("SYSTEM: ", Fore.YELLOW, "Unable to execute command")
Summary
This article mainly introduces the two decision models, AutoGPT and HuggingGPT.
Finally, let’s look at the differences:
The purpose of HuggingGPT is to complete a complex specific task using all AI model interfaces, more like a solution to a technical problem.
AutoGPT is more like a decision-making robot, with a broader range of actions it can perform than AI models, as it integrates capabilities such as Google search, web browsing, and code execution.
However, the ideas behind these two models are quite flexible. I previously thought about incorporating knowledge graphs as strong rules, which seems feasible as well.
References
1. https://arxiv.org/pdf/2303.17580.pdf 2. https://zhuanlan.zhihu.com/p/621217077
About Us
Old Liu, Liu Huanyong, an NLP open-source enthusiast and practitioner, homepage: https://liuhuanyong.github.io.
Old Liu says NLP will regularly publish language resources, engineering practices, technical summaries, etc., welcome to follow.
For those who want to join higher-quality knowledge graph, event graph practices, and related sharing, please follow the public account, and click on the member community in the background menu to join.