ChatGPT has become the manager of hundreds of models.
In recent months, the surge in popularity of ChatGPT and GPT-4 has showcased the extraordinary capabilities of large language models (LLMs) in language understanding, generation, interaction, and reasoning. This has drawn significant attention from both academia and industry, revealing the potential of LLMs in constructing general artificial intelligence (AGI) systems.
To achieve AGI, LLMs face numerous challenges, including:
-
Limited to the input and output forms of text generation, current LLMs lack the ability to handle complex information such as visual and auditory data;
-
In real-world scenarios, some complex tasks often consist of multiple subtasks, requiring the scheduling and collaboration of multiple models, which exceeds the capabilities of language models;
-
For some challenging tasks, LLMs exhibit excellent results in zero-shot or few-shot scenarios, but they are still inferior to some specialized fine-tuned models.
Among these challenges, the most crucial is that achieving AGI requires solving complex AI tasks across different domains and modalities, while existing AI models are mostly designed for specific tasks within specific domains.
Based on this, researchers from Zhejiang University and Microsoft Research Asia recently proposed a new method that allows LLMs to act as controllers, managing existing AI models to solve complex AI tasks, using language as a universal interface. The proposed system, HuggingGPT, connects various AI models in the machine learning community (such as HuggingFace) to tackle complex AI tasks.
Paper link: https://arxiv.org/abs/2303.17580
Project link: https://github.com/microsoft/JARVIS
Specifically, when HuggingGPT receives a user request, it uses ChatGPT for task planning, selects models based on the available functionality descriptions in HuggingFace, executes each subtask with the selected AI model, and summarizes the response based on the execution results. With ChatGPT’s powerful language capabilities and HuggingFace’s rich AI models, HuggingGPT can accomplish complex AI tasks covering different modalities and domains, achieving impressive results in challenging tasks involving language, vision, and audio. HuggingGPT paves a new path toward general artificial intelligence.
Let’s first look at some examples of tasks completed by HuggingGPT, including document question answering, image transformation, video generation, and audio generation:
Additionally, generating complex and detailed text descriptions for images:

To handle complex AI tasks, LLMs need to coordinate with external models to leverage their capabilities. Therefore, the key issue is how to select the appropriate middleware to bridge the connection between LLMs and AI models.
The study noted that each AI model can be represented in a linguistic form by summarizing its model functionality. Therefore, the research proposed a concept: “Language is the universal interface connecting LLMs and AI models.” By incorporating the text descriptions of AI models into the prompts, LLMs can be seen as the “brain” managing (including planning, scheduling, and collaborating) AI models.
Another challenge is that solving a large number of AI tasks requires collecting a vast amount of high-quality model descriptions. In this regard, the study observed that some public ML communities often provide various models suitable for specific AI tasks, and these models have clearly defined descriptions. Therefore, the research decided to connect LLMs (such as ChatGPT) with public ML communities (such as GitHub, HuggingFace, Azure, etc.) to solve complex AI tasks through a language-based interface.
As of now, HuggingGPT has integrated hundreds of models around ChatGPT on HuggingFace, covering 24 tasks including text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video. Experimental results demonstrate HuggingGPT’s powerful capabilities in handling multimodal information and complex AI tasks. Furthermore, HuggingGPT will continue to add AI models for specific tasks, realizing scalable and expandable AI functionalities.
Introduction to HuggingGPT
HuggingGPT is a collaborative system where large language models (LLMs) act as controllers and numerous expert models serve as collaborative executors. Its workflow consists of four stages: task planning, model selection, task execution, and response generation.
-
Task Planning: LLMs like ChatGPT first parse the user request, decompose it into tasks, and plan the sequence and dependencies based on their knowledge;
-
Model Selection: LLM assigns the parsed tasks to expert models;
-
Task Execution: Expert models execute the assigned tasks on inference endpoints and log the execution information and inference results back to the LLM;
-
Response Generation: The LLM summarizes the logs of the execution process and inference results, returning the consolidated results to the user.
Next, let’s look at the specific implementation process of these four steps.
Task Planning
In the first stage of HuggingGPT, the large language model receives the user request and decomposes it into a series of structured tasks. Complex requests often involve multiple tasks, and the large language model needs to determine the dependencies and execution order of these tasks. To facilitate effective task planning by the large language model, HuggingGPT employs both specification-based instructions and demonstration-based parsing in its prompt design.
By injecting several demonstrations into the prompt, HuggingGPT allows the large language model to better understand the intent and standards of task planning. Currently, the list of tasks supported by HuggingGPT is shown in Tables 1, 2, 3, and 4. It is evident that HuggingGPT covers tasks in NLP, CV, speech, video, and more.
Model Selection
After parsing the task list, HuggingGPT selects the appropriate model for each task in the list. To achieve this, the study first obtains descriptions of expert models from the HuggingFace Hub (model descriptions generally include model functionality, architecture, supported languages and domains, licensing, etc.). Then, it dynamically selects models for tasks through a task-model allocation mechanism based on context.
Task Execution
Once tasks are assigned to specific models, the next step is to execute the tasks, i.e., perform model inference. To enhance speed and computational stability, HuggingGPT runs these models on hybrid inference endpoints. The task parameters are input to the model, which computes the inference results and feeds the information back to the large language model.
Response Generation
After all tasks are executed, HuggingGPT enters the response generation stage. In this stage, HuggingGPT integrates all information from the previous three stages (task planning, model selection, and task execution) into a concise summary, including the planned task list, model selections, and inference results. Most importantly, the inference results form the basis for HuggingGPT’s final decisions. These inference results appear in a structured format, such as bounding boxes with detection probabilities in object detection models, answer distributions in question-answering models, etc.
Experiments
This study utilized two variants of GPT models, gpt-3.5-turbo and text-davinci-003, as large language models, which can be accessed publicly through the OpenAI API. Table 5 provides detailed prompt designs for the task planning, model selection, and response generation stages.
HuggingGPT dialog demonstration example: In the demonstration, the user inputs a request that may involve multiple tasks or multimodal resources. Then, HuggingGPT relies on the LLM to organize the collaboration of multiple expert models, generating feedback for the user.
Figure 3 shows HuggingGPT’s workflow when there are resource dependencies between tasks. In this case, HuggingGPT can parse specific tasks from the user’s abstract request, including posture detection, image description, etc. Additionally, HuggingGPT successfully identifies the dependencies between Task 3 and Tasks 1 and 2, injecting the inference results of Tasks 1 and 2 into the input parameters of Task 3 after the dependent tasks are completed.
Figure 4 showcases HuggingGPT’s dialog capabilities in audio and video modalities.
Figure 5 illustrates HuggingGPT’s integration of multiple user input resources to perform simple inference.
The study also tested HuggingGPT on multimodal tasks, as shown in the following figure. With the cooperation of large language models and expert models, HuggingGPT can solve various modalities such as language, images, audio, and video, encompassing detection, generation, classification, and question-answering tasks.
In addition to the aforementioned simple tasks, HuggingGPT can also handle more complex tasks. Figure 8 demonstrates HuggingGPT’s ability to deal with complex tasks in multi-turn dialog scenarios.
Figure 9 shows that for a simple request to describe an image as detailed as possible, HuggingGPT can expand it into five related tasks, including image captioning, image classification, object detection, segmentation, and visual question answering. HuggingGPT assigns expert models for each task, which provide relevant information related to the image from different aspects of the LLM. Finally, the LLM integrates this information and produces a comprehensive and detailed description.
The release of this study has also led netizens to exclaim that AGI seems to be on the verge of breaking out of the open-source community.
Some have likened it to a company manager, commenting that “HuggingGPT is somewhat like a real-world scenario where a company has a group of exceptionally skilled engineers, each with outstanding expertise, and now there is a manager overseeing them. When someone has a need, this manager analyzes the requirement, then assigns it to the appropriate engineer, and finally consolidates the results before returning them to the user.”
Others praise HuggingGPT as a revolutionary system that leverages the power of language to connect and manage existing AI models from different domains and modalities, paving the way for achieving AGI.
Reference link: https://twitter.com/search?q=HuggingGPT&src=typed_query&f=top
For more information, please scan the QR code below to follow the Machine Learning Research Association.
Source: Machine Heart