HuggingGPT: From Multimodal to AGI!

PanChuang AI Sharing

Source | GPT

Reprinted from | Machine Heart

【Introduction】ChatGPT has become the manager of hundreds of models this time.

In recent months, the successive popularity of ChatGPT and GPT-4 has showcased the extraordinary capabilities of large language models (LLMs) in language understanding, generation, interaction, and reasoning, which has garnered significant attention from academia and industry, while also revealing the potential of LLMs in constructing general artificial intelligence (AGI) systems.

To achieve AGI, LLMs face numerous challenges, including:

Constrained by the input and output forms of text generation, current LLMs lack the ability to handle complex information such as visual and auditory data;
In real-world scenarios, some complex tasks often consist of multiple subtasks, requiring the scheduling and collaboration of multiple models, which exceeds the capabilities of language models;
For some challenging tasks, LLMs exhibit excellent results in zero-shot or few-shot situations, but they still lag behind some specialized fine-tuned models.

Among these, the most crucial point is that achieving AGI requires solving complex AI tasks across different domains and modalities, whereas existing AI models are mostly designed for specific domains and tasks.

Based on this, researchers from Zhejiang University and Microsoft Research Asia recently proposed a new method for LLMs to act as controllers, managing existing AI models to solve complex AI tasks, using language as a universal interface. The HuggingGPT system proposed in this research utilizes LLMs to connect various AI models in the machine learning community (such as HuggingFace) to tackle complex AI tasks.

Paper link: https://arxiv.org/abs/2303.17580

Project link: https://github.com/microsoft/JARVIS

Specifically, when HuggingGPT receives a user request, it uses ChatGPT for task planning, selects models based on the available function descriptions in HuggingFace, executes each subtask with the chosen AI model, and summarizes the response based on the execution results. With the powerful language capabilities of ChatGPT and the rich AI models from HuggingFace, HuggingGPT can accomplish complex AI tasks across different modalities and fields, achieving impressive results in challenging tasks involving language, vision, and speech. HuggingGPT opens a new path towards general artificial intelligence.

Let’s first look at examples of HuggingGPT completing tasks, including document question answering, image transformation, video generation, and audio generation:

Also generating complex and detailed text descriptions for images:

To handle complex AI tasks, LLMs need to coordinate with external models to leverage their capabilities. Therefore, the key question is how to select the appropriate middleware to bridge the connection between LLMs and AI models.

The research notes that each AI model can be represented as a language form by summarizing its model functionality. Thus, the research proposes a concept: “Language is the universal interface for LLMs to connect AI models.” By incorporating text descriptions of AI models into prompts, LLMs can be viewed as the “brain” managing (including planning, scheduling, and collaborating) AI models.

Another challenge is that solving a large number of AI tasks requires collecting a significant amount of high-quality model descriptions. In this regard, the research observes that some public ML communities typically provide various models suitable for specific AI tasks, and these models have well-defined descriptions. Therefore, the research decided to connect LLMs (such as ChatGPT) with public ML communities (such as GitHub, HuggingFace, Azure, etc.) to solve complex AI tasks through a language-based interface.

As of now, HuggingGPT has integrated hundreds of models around ChatGPT on HuggingFace, covering 24 tasks including text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video. Experimental results demonstrate HuggingGPT’s powerful capabilities in handling multimodal information and complex AI tasks. Furthermore, HuggingGPT will continue to add AI models for specific tasks, thereby achieving scalable and extensible AI functionalities.

Introduction to HuggingGPT

HuggingGPT is a collaborative system where large language models (LLMs) serve as controllers, and numerous expert models act as collaborative executors. Its workflow is divided into four stages: task planning, model selection, task execution, and response generation.

Task Planning: LLMs such as ChatGPT first parse the user request, decompose the task, and plan the task order and dependencies based on their knowledge;
Model Selection: LLM assigns the parsed tasks to expert models;
Task Execution: Expert models execute the assigned tasks at inference endpoints and record execution information and inference results back to the LLM;
Response Generation: The LLM summarizes the execution process logs and inference results, returning the summarized results to the user.

Next, let’s take a look at the specific implementation process of these four steps.

Task Planning

In the first phase of HuggingGPT, the large language model receives the user request and breaks it down into a series of structured tasks. Complex requests often involve multiple tasks, and the large language model needs to determine the dependencies and execution order of these tasks. To facilitate effective task planning by the large language model, HuggingGPT employs normative instructions and demonstration-based parsing in its prompt design.

By injecting several demonstrations into the prompts, HuggingGPT allows the large language model to better understand the task planning intent and criteria. Currently, the list of tasks supported by HuggingGPT is shown in Tables 1, 2, 3, and 4. It can be seen that HuggingGPT covers tasks in NLP, CV, speech, video, and more.

Model Selection

After parsing the task list, HuggingGPT selects appropriate models for each task in the list. To achieve this, the research first obtains descriptions of expert models from the HuggingFace Hub (model descriptions generally include model functionalities, architectures, supported languages and domains, licenses, etc.). Then, it dynamically selects models for tasks through a contextual task-model assignment mechanism.

Task Execution

Once tasks are assigned to specific models, the next step is to execute the tasks, i.e., perform model inference. To accelerate and ensure computational stability, HuggingGPT runs these models on mixed inference endpoints. The task parameters are input, the model computes the inference results, and then feeds back the information to the large language model.

Response Generation

After all tasks are executed, HuggingGPT enters the response generation phase. In this phase, HuggingGPT integrates all information from the previous three stages (task planning, model selection, and task execution) into a concise summary, including the planned task list, model selections, and inference results. Most importantly, the inference results form the basis for HuggingGPT’s final decisions. These inference results appear in a structured format, such as bounding boxes with detection probabilities in object detection models or answer distributions in question-answering models.

Experiments

The research used two variants of GPT models, gpt-3.5-turbo and text-davinci-003, as large language models, which are publicly accessible via the OpenAI API. Table 5 provides detailed prompt designs for the task planning, model selection, and response generation stages.

HuggingGPT dialogue demonstration example: In the demonstration, the user inputs a request that may contain multiple tasks or multimodal resources. Then, HuggingGPT relies on LLMs to organize the collaboration of multiple expert models and generate feedback to the user.

Figure 3 shows HuggingGPT’s workflow when there are resource dependencies between tasks. In this case, HuggingGPT can parse specific tasks from the user’s abstract request, including pose detection, image description, etc. Moreover, HuggingGPT successfully identifies the dependencies between Task 3 and Tasks 1 and 2, injecting the inference results of Tasks 1 and 2 into the input parameters of Task 3 after the dependent tasks are completed.

Figure 4 showcases HuggingGPT’s dialogue capabilities in audio and video modalities.

Figure 5 demonstrates HuggingGPT integrating multiple user input resources to perform simple inference.

The research also tested HuggingGPT on multimodal tasks, as shown in the following figure. With the collaboration of large language models and expert models, HuggingGPT can solve various modalities such as language, image, audio, and video, encompassing various forms of tasks including detection, generation, classification, and question answering.

In addition to the aforementioned simple tasks, HuggingGPT can also accomplish more complex tasks. Figure 8 demonstrates HuggingGPT’s ability to handle complex tasks in multi-turn dialogue scenarios.

Figure 9 shows that for a simple request to describe an image as detailed as possible, HuggingGPT can expand it into five related tasks: image captioning, image classification, object detection, segmentation, and visual question answering. HuggingGPT assigns expert models for each task, which provide image-related information from different aspects of the LLM. Finally, the LLM integrates this information and produces a comprehensive and detailed description.

The publication of this research has also led netizens to sigh that AGI seems to be about to break out of the open-source community.

Some have compared it to a company manager, commenting that “HuggingGPT is somewhat like a real-world scenario where a company has a group of super engineers, each excelling in their specialties, and now there is a manager to oversee them. When someone has a request, this manager analyzes the need, assigns it to the appropriate engineers, and ultimately combines the results to return to the user.”

Others praise HuggingGPT as a revolutionary system that utilizes the power of language to connect and manage existing AI models from different fields and modalities, paving the way for achieving AGI.

Reference link: https://twitter.com/search?q=HuggingGPT&src=typed_query&f=top

Scan the QR code for good books, and enjoy discounts of 50 yuan for every 100 spent!

✄———————————————–

If you have read this far, it means you like this article. Please click “Read” or casually “Share” and “Like“.

Welcome to search for “panchuangxx” on WeChat, and add the editor Pan Xiaoxiao on WeChat. Daily updates of high-quality articles (no ads) are provided to you.

▼ ▼ Scan the QR code to add the editor ▼ ▼

PanChuang AI Sharing

【Introduction】ChatGPT has become the manager of hundreds of models this time.

Leave a Comment Cancel reply