HuggingGPT: Managing AI Models with ChatGPT

In recent months, the successive rise of ChatGPT and GPT-4 has showcased the extraordinary capabilities of large language models (LLM) in language understanding, generation, interaction, and reasoning, garnering significant attention from both academia and industry. This has highlighted the potential of LLMs in constructing general artificial intelligence (AGI) systems.

To achieve AGI, LLMs face numerous challenges, including:

Limited by the input and output forms of text generation, current LLMs lack the ability to process complex information such as visual and auditory data;
In real-world scenarios, some complex tasks often consist of multiple sub-tasks, requiring coordination and collaboration among multiple models, which exceeds the capabilities of language models;
For some challenging tasks, LLMs perform excellently in zero-shot or few-shot scenarios, but they still fall short compared to some specialized fine-tuned models.

Among these, the most critical point is that achieving AGI requires solving complex AI tasks across different domains and modalities, while existing AI models are mostly designed for specific tasks in particular domains.

Based on this, researchers from Zhejiang University and Microsoft Research Asia recently proposed a novel approach where LLMs act as controllers, managing existing AI models to solve complex AI tasks, using language as a universal interface. The research introduces HuggingGPT, a system that utilizes LLMs to connect various AI models in the machine learning community (e.g., HuggingFace) to tackle complex AI tasks.

Paper link: https://arxiv.org/abs/2303.17580

Project link: https://github.com/microsoft/JARVIS

Specifically, when HuggingGPT receives a user request, it uses ChatGPT for task planning, selects models based on available functionality descriptions in HuggingFace, executes each sub-task with the chosen AI models, and compiles responses based on the execution results. Leveraging ChatGPT’s powerful language capabilities and HuggingFace’s rich AI model offerings, HuggingGPT can accomplish complex AI tasks spanning different modalities and domains, achieving impressive results in challenging tasks such as language, vision, and speech. HuggingGPT paves a new path towards general artificial intelligence.

Let’s first look at examples of tasks completed by HuggingGPT, including document question answering, image transformation, video generation, and audio generation:

Also, for generating complex and detailed text descriptions for images:

HuggingGPT: Managing AI Models with ChatGPT

To handle complex AI tasks, LLMs need to coordinate with external models to leverage their capabilities. Therefore, the key issue is how to select the appropriate middleware to bridge the connection between LLMs and AI models.

The study notes that each AI model can be represented in a linguistic form by summarizing its model functionality. Hence, the research proposes a concept: “language is the universal interface for LLMs to connect AI models.” By integrating the textual descriptions of AI models into prompts, LLMs can be viewed as the “brains” managing (including planning, scheduling, and collaborating) AI models.

Another challenge is that solving a large number of AI tasks requires collecting high-quality model descriptions. In this regard, the study notes that some public ML communities often provide various models suitable for specific AI tasks, and these models come with well-defined descriptions. Therefore, the research decided to connect LLMs (e.g., ChatGPT) with public ML communities (e.g., GitHub, HuggingFace, Azure, etc.) to address complex AI tasks through a language-based interface.

As of now, HuggingGPT has integrated hundreds of models around ChatGPT on HuggingFace, covering 24 tasks including text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video. Experimental results demonstrate HuggingGPT’s powerful capabilities in handling multimodal information and complex AI tasks. Furthermore, HuggingGPT will continue to add AI models for specific tasks, achieving scalable and expandable AI functionalities.

Introduction to HuggingGPT

HuggingGPT is a collaborative system where large language models (LLMs) act as controllers, and numerous expert models serve as collaborative executors. Its workflow is divided into four stages: task planning, model selection, task execution, and response generation.

Task Planning: LLMs like ChatGPT first parse user requests, decompose tasks, and plan the sequence and dependencies of tasks based on their knowledge;
Model Selection: LLMs allocate parsed tasks to expert models;
Task Execution: Expert models execute the assigned tasks at inference endpoints and log execution information and inference results to the LLM;
Response Generation: LLMs summarize the logs of the execution process and inference results, returning the compiled results to the user.

Next, let’s take a look at the specific implementation of these four steps.

Task Planning

In the first stage of HuggingGPT, the large language model receives the user request and decomposes it into a series of structured tasks. Complex requests often involve multiple tasks, and the large language model needs to determine the dependencies and execution order of these tasks. To facilitate effective task planning by the large language model, HuggingGPT employs instruction-based norms and demonstration-based parsing in its prompt design.

By injecting several demonstrations into the prompts, HuggingGPT allows the large language model to better understand the intent and standards of task planning. Currently, the list of tasks supported by HuggingGPT is shown in Tables 1, 2, 3, and 4. It can be seen that HuggingGPT covers tasks in NLP, CV, speech, video, etc.

Model Selection

After parsing the task list, HuggingGPT selects appropriate models for each task in the list. To achieve this process, the research first obtains descriptions of expert models from the HuggingFace Hub (model descriptions generally include model functionality, architecture, supported languages and domains, licensing, etc.). Then, it dynamically selects models for the tasks through a task-model allocation mechanism in context.

Task Execution

Once tasks are assigned to specific models, the next step is to execute the tasks, i.e., perform model inference. To enhance speed and computational stability, HuggingGPT runs these models on hybrid inference endpoints. The task parameters are input, the model computes the inference results, and then feedback is provided to the large language model.

Response Generation

After all task executions are completed, HuggingGPT enters the response generation phase. In this phase, HuggingGPT consolidates all information from the previous three stages (task planning, model selection, and task execution) into a concise summary, including the planned task list, model selections, and inference results. Most importantly, the inference results form the basis for HuggingGPT’s final decision. These results appear in a structured format, such as bounding boxes with detection probabilities in object detection models, answer distributions in question-answering models, etc.

Experiments

The research utilized gpt-3.5-turbo and text-davinci-003 variants of GPT models as the large language models, which are publicly accessible through the OpenAI API. Table 5 provides detailed prompt designs for the task planning, model selection, and response generation stages.

HuggingGPT dialogue demonstration example: In the demonstration, the user inputs a request that may involve multiple tasks or multimodal resources. HuggingGPT then relies on LLMs to organize the collaboration of multiple expert models, generating feedback for the user.

Figure 3 shows the workflow of HuggingGPT when there are resource dependencies between tasks. In this case, HuggingGPT can parse the user’s abstract request into specific tasks, including pose detection, image description, etc. Moreover, HuggingGPT successfully identifies the dependencies between Task 3 and Tasks 1 and 2, injecting the inference results of Tasks 1 and 2 into the input parameters of Task 3 after the dependent tasks are completed.

Figure 4 demonstrates HuggingGPT’s dialogue capabilities in audio and video modalities.

Figure 5 shows HuggingGPT integrating multiple user input resources to perform simple inference.

The research also tested HuggingGPT on multimodal tasks, as shown in the following figure. With the collaboration of large language models and expert models, HuggingGPT can address various modalities, including language, images, audio, and video, encompassing detection, generation, classification, and question answering tasks.

In addition to the simple tasks mentioned above, HuggingGPT can also handle more complex tasks. Figure 8 demonstrates HuggingGPT’s ability to manage complex tasks in multi-turn dialogue scenarios.

Figure 9 shows that for a simple request to describe an image as detailed as possible, HuggingGPT can expand it into five related tasks: image captioning, image classification, object detection, segmentation, and visual question answering. HuggingGPT assigns expert models for each task, providing relevant information about the image from different aspects of the LLM. Finally, the LLM integrates this information to produce a comprehensive and detailed description.

The release of this research has led netizens to exclaim that AGI seems to be on the verge of breaking out from the open-source community.

Some have likened it to a company manager, commenting, “HuggingGPT is somewhat like a real-world scenario where there is a group of exceptionally skilled engineers, each with outstanding expertise, and now a manager organizes them. When someone has a need, this manager analyzes the requirement, assigns it to the corresponding engineer, and finally merges it together to return to the user.”

Others have praised HuggingGPT as a revolutionary system that utilizes the power of language to connect and manage existing AI models from different domains and modalities, paving the way for AGI.

Reference link:

https://twitter.com/search?q=HuggingGPT&src=typed_query&f=top

Source: Machine Heart

Copyright Statement: Part of the content of this account comes from the Internet. Please indicate the original link and author when reprinting. If there is any infringement or incorrect source, please contact us.

Leave a Comment Cancel reply