HuggingGPT: From Multimodal to AGI

GPT

Source: Machine Heart

ChatGPT has become the manager of hundreds of models.

In recent months, the successive popularity of ChatGPT and GPT-4 has showcased the extraordinary capabilities of large language models (LLMs) in language understanding, generation, interaction, and reasoning, attracting significant attention from both academia and industry, and revealing the potential of LLMs in constructing general artificial intelligence (AGI) systems.

To achieve AGI, LLMs face numerous challenges, including:

Limited by the input and output forms of text generation, current LLMs lack the ability to process complex information such as visual and auditory data;
In real-world scenarios, some complex tasks often consist of multiple sub-tasks, requiring the scheduling and collaboration of multiple models, which exceeds the capabilities of language models;
For some challenging tasks, LLMs show excellent results in zero-shot or few-shot scenarios, but they still lag behind some specialized fine-tuned models.

Among these, the most crucial point is that achieving AGI requires solving complex AI tasks across different domains and modalities, while most existing AI models are designed for specific domains and tasks.

Based on this, researchers from Zhejiang University and Microsoft Research Asia recently proposed a novel method that allows LLMs to act as controllers to manage existing AI models to solve complex AI tasks, using language as a universal interface. The research presents HuggingGPT, a system that utilizes LLMs to connect various AI models in the machine learning community (e.g., HuggingFace) to solve complex AI tasks.

HuggingGPT: From Multimodal to AGI

Paper link: https://arxiv.org/abs/2303.17580

Project link: https://github.com/microsoft/JARVIS

Specifically, when HuggingGPT receives user requests, it uses ChatGPT for task planning, selects models based on available functionality descriptions in HuggingFace, executes each sub-task with the chosen AI model, and summarizes responses based on execution results. With ChatGPT’s powerful language capabilities and HuggingFace’s rich AI models, HuggingGPT can complete complex AI tasks across different modalities and domains, achieving impressive results in challenging tasks involving language, vision, and audio. HuggingGPT paves a new path toward general artificial intelligence.

Let’s first look at examples of tasks completed by HuggingGPT, including document question answering, image transformation, video generation, and audio generation:

HuggingGPT: From Multimodal to AGI

Also for generating complex and detailed textual descriptions for images:

To handle complex AI tasks, LLMs need to coordinate with external models to leverage their capabilities. Therefore, the key issue is how to select appropriate middleware to bridge the connection between LLMs and AI models.

The study notes that each AI model can be represented in a linguistic form by summarizing its model capabilities. Thus, the research proposes a concept: “Language is the universal interface connecting LLMs and AI models.” By incorporating textual descriptions of AI models into prompts, LLMs can be seen as the “brain” managing (including planning, scheduling, and collaborating) AI models.

Another challenge is that solving a large number of AI tasks requires collecting a substantial amount of high-quality model descriptions. In this regard, the study notes that some public ML communities often provide various models suitable for specific AI tasks, and these models have clearly defined descriptions. Therefore, the study decided to connect LLMs (e.g., ChatGPT) with public ML communities (e.g., GitHub, HuggingFace, Azure, etc.) to solve complex AI tasks through language-based interfaces.

HuggingGPT: From Multimodal to AGI

As of now, HuggingGPT has integrated hundreds of models around ChatGPT on HuggingFace, covering 24 tasks including text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video. Experimental results demonstrate HuggingGPT’s powerful capabilities in handling multimodal information and complex AI tasks. Furthermore, HuggingGPT will continue to add AI models for specific tasks, enabling scalable and expandable AI functionalities.

Introduction to HuggingGPT

HuggingGPT is a collaborative system where large language models (LLMs) act as controllers and numerous expert models serve as collaborative executors. Its workflow consists of four stages: task planning, model selection, task execution, and response generation.

Task Planning: LLMs such as ChatGPT first parse user requests, decompose tasks, and plan the order and dependencies of tasks based on their knowledge;
Model Selection: LLMs assign parsed tasks to expert models;
Task Execution: Expert models execute the assigned tasks at inference endpoints and record execution information and inference results back to the LLM;
Response Generation: LLMs summarize the logs of the execution process and inference results, returning the summarized results to the user.

HuggingGPT: From Multimodal to AGI

Next, let’s take a look at the specific implementation process of these four steps.

Task Planning

In the first stage of HuggingGPT, the large language model receives user requests and decomposes them into a series of structured tasks. Complex requests often involve multiple tasks, and the large language model needs to determine the dependencies and execution order of these tasks. To facilitate effective task planning by the large language model, HuggingGPT employs specification-based instructions and demonstration-based parsing in its prompt design.

By injecting several demonstrations into the prompt, HuggingGPT allows the large language model to better understand the intent and standards of task planning. Currently, the list of tasks supported by HuggingGPT is shown in Tables 1, 2, 3, and 4. It can be seen that HuggingGPT covers tasks in NLP, CV, speech, video, etc.

HuggingGPT: From Multimodal to AGI

Model Selection

After parsing the task list, HuggingGPT selects appropriate models for each task in the list. To achieve this, the study first obtains descriptions of expert models from the HuggingFace Hub (model descriptions generally include model functions, architecture, supported languages and domains, licensing information, etc.). Then, it dynamically selects models for tasks through a task-model assignment mechanism in context.

Task Execution

Once tasks are assigned to specific models, the next step is to execute the tasks, which involves performing model inference. To enhance speed and computational stability, HuggingGPT runs these models on hybrid inference endpoints. The task parameters are inputted, and the model computes the inference results, then feeds the information back to the large language model.

Response Generation

After all task executions are completed, HuggingGPT enters the response generation phase. In this phase, HuggingGPT consolidates all information from the previous three stages (task planning, model selection, and task execution) into a concise summary, including the planned task list, model selections, and inference results. Most importantly, the inference results form the basis for HuggingGPT’s final decision. These inference results appear in a structured format, such as bounding boxes with detection probabilities in object detection models, or answer distributions in question-answering models.

Experiments

The study used variants of the GPT models, gpt-3.5-turbo and text-davinci-003, as large language models, which are publicly accessible via the OpenAI API. Table 5 provides detailed prompt designs for the task planning, model selection, and response generation phases.

HuggingGPT: From Multimodal to AGI

HuggingGPT dialogue demonstration example: In the demonstration, the user inputs a request that may involve multiple tasks or multimodal resources. HuggingGPT then relies on the LLM to organize the collaboration of multiple expert models to generate feedback for the user.

Figure 3 shows HuggingGPT’s workflow when there are resource dependencies between tasks. In this case, HuggingGPT can parse specific tasks from the user’s abstract request, including pose detection, image description, etc. Additionally, HuggingGPT successfully identifies the dependencies between Task 3 and Tasks 1 and 2, injecting the inference results of Tasks 1 and 2 into the input parameters of Task 3 after the dependent tasks are completed.

HuggingGPT: From Multimodal to AGI

Figure 4 demonstrates HuggingGPT’s dialogue capabilities in audio and video modalities.

HuggingGPT: From Multimodal to AGI

Figure 5 shows HuggingGPT integrating multiple user input resources to perform simple reasoning.

HuggingGPT: From Multimodal to AGI

The study also tested HuggingGPT on multimodal tasks, as shown in the following figure. With the collaboration of large language models and expert models, HuggingGPT can tackle various modalities such as language, images, audio, and video, encompassing various forms of tasks including detection, generation, classification, and question answering.

HuggingGPT: From Multimodal to AGI

In addition to the simple tasks mentioned above, HuggingGPT can also complete more complex tasks. Figure 8 demonstrates HuggingGPT’s ability to handle complex tasks in multi-turn dialogue scenarios.

HuggingGPT: From Multimodal to AGI

Figure 9 shows that for a request to describe an image as detailed as possible, HuggingGPT can expand it into five relevant tasks: image captioning, image classification, object detection, segmentation, and visual question answering. HuggingGPT assigns expert models for each task, which provide information related to the image from different aspects of the LLM. Finally, the LLM integrates this information and provides a comprehensive and detailed description.

HuggingGPT: From Multimodal to AGI

The release of this study has also made netizens exclaim that AGI seems to be on the verge of breaking out of the open-source community.

HuggingGPT: From Multimodal to AGI

Some likened it to a company manager, commenting, “HuggingGPT is somewhat like a real-world scenario, where a company has a group of super engineers, each excelling in their respective specialties, and now there is a manager who organizes them. When someone has a demand, this manager analyzes the demand, assigns it to the appropriate engineers, and finally merges the results to return to the user.”

HuggingGPT: From Multimodal to AGI

Others praised HuggingGPT as a revolutionary system that harnesses the power of language to connect and manage existing AI models from different domains and modalities, paving the way for AGI.

HuggingGPT: From Multimodal to AGI

Reference link: https://twitter.com/search?q=HuggingGPT&src=typed_query&f=top

Additionally, to gather more people to participate in AI productivity tools, I recently formed a knowledge circle named ChatGPT Laboratory, which currently has over 120 prominent members. The circle’s main focus includes:

1. How to improve work and learning efficiency based on ChatGPT.

2. Tracking the cutting-edge dynamics and latest developments in NLP, LLM, AIGC, and AGI.

3. Sharing the latest applications and uses of ChatGPT.

HuggingGPT: From Multimodal to AGI

Leave a Comment Cancel reply