
Solving complex artificial intelligence tasks is a key step towards achieving Artificial General Intelligence (AGI). Despite the abundance of AI models targeting different domains and modalities, they struggle to handle complex AI tasks.
Given the remarkable capabilities of large language models (LLMs) in language understanding, generation, interaction, and reasoning, the authors advocate that LLMs can serve as controllers to manage existing AI models, addressing complex AI tasks, with language serving as a universal interface to achieve this goal.
Based on this concept, the authors introduced the HuggingGPT system, which connects LLMs (like ChatGPT) with various AI models in the machine learning community (like HuggingFace) to solve AI tasks.
Specifically, the authors use ChatGPT to plan tasks upon receiving user requests, select models based on the functionality descriptions provided in HuggingFace, execute each sub-task with the chosen AI model, and summarize the responses based on the execution results.
By leveraging ChatGPT’s powerful language capabilities and the rich AI models available in HuggingFace, HuggingGPT can cover numerous complex digital AI tasks across different modalities and domains, achieving remarkable results in challenging tasks such as language, vision, and speech, paving a new path towards AGI.
Background
Large language models (LLMs), such as ChatGPT, have garnered significant attention from both academia and industry due to their outstanding performance in various natural language processing (NLP) tasks. The powerful capabilities of LLMs have given rise to many emerging research topics to further explore their immense potential and bring infinite possibilities for constructing AGI systems.
Despite the tremendous success, current LLM technology still has shortcomings and faces several challenges on the path to building AGI systems.
The main challenges include:
-
Limited by the input and output forms of text generation, current LLMs, despite significant achievements in NLP tasks, cannot handle complex information such as visual and auditory data; -
In real-world scenarios, some complex tasks often consist of multiple sub-tasks that require the scheduling and collaboration of multiple models, which exceeds the capabilities of language models; -
For some challenging tasks, LLMs demonstrate weaker results than experts in zero-shot / few-shot scenarios.
Method
The authors used two GPT models, gpt-3.5-turbo and text-davinci-003, as representatives of large language models, which can be accessed publicly via the OpenAI API.
The authors employed an object detection model named facebook/detr-resnet-101 to identify and locate zebras in images. This model is known for its high-precision object detection with a ResNet-101 backbone. The model generated images containing zebras within predicted bounding boxes and saved these images at the following locations: /images/9831.jpg, /images/be11.jpg.
The authors used a text classification model named cardiffnlp/twitter-xlm-roberta-basesentiment to analyze the generated titles and predicted bounding boxes to confirm the presence of zebras in the images. This model is a multilingual XLM-roBERTa-base model trained on sentiment analysis. The model confirmed the presence of 4 zebras in the collection of images A, B, and C.

LLMs (e.g., ChatGPT) connect with various AI models (e.g., those in HuggingFace) using API interfaces to solve complex AI tasks. In this concept, LLMs serve as controllers, managing and organizing the collaboration of expert models. LLMs first plan a series of tasks based on user needs and then assign expert models for each task. After the experts perform their tasks, LLMs collect the results and respond to the user.
Results
HuggingGPT relies on LLMs to organize the collaboration of multiple expert models, generating answers for users.
HuggingGPT can parse specific tasks from abstract user requests, including pose detection, image description, and pose-conditioned image generation tasks.
It demonstrates that HuggingGPT successfully completed text-to-audio and text-to-video tasks requested by users through expert models.
These two models operate in parallel, while in the bottom model, the two models execute sequentially. This further confirms that HuggingGPT can organize collaboration between models and manage resource dependencies between tasks.
The authors found that HuggingGPT can decompose the main task into multiple basic tasks and integrate multiple model inferences to arrive at the correct answer, even when involving various resources.
Limitations
Despite the strong capabilities demonstrated by HuggingGPT, the authors acknowledge some limitations that need to be addressed.
Firstly, one concern is efficiency. This is mainly due to the need for HuggingGPT to interact multiple times with large language models when processing user requests, including task planning, model selection, and response generation phases. These interactions lead to increased response delays, thus reducing user experience.
Secondly, due to the maximum token limit acceptable by LLMs, HuggingGPT also faces limitations on maximum context length. To alleviate this issue, the authors adopted a dialogue window and only tracked the dialogue context during the task planning phase.
Thirdly, system stability is another factor to consider. It includes two aspects: one is the rebellion phenomenon during the inference process of large language models. Sometimes, these models may not fully follow instructions during inference, leading to output formats that do not meet expectations, causing exceptions in the program workflow. The second is the unstable state of expert models on HuggingFace during inference endpoints. Due to network latency or service status, these expert models may experience errors during task execution.
In summary, despite the limitations of HuggingGPT, the authors still see its immense potential in solving complex AI tasks. Continuous research and improvements will help the authors overcome these challenges and achieve more advanced artificial intelligence systems.
Conclusion
The authors proposed a system named HuggingGPT to solve AI tasks.
Essentially, it treats LLMs as controllers managing AI models and utilizes models from machine learning communities like HuggingFace to address various user requests. By leveraging the advantages of LLMs in understanding and reasoning, HuggingGPT can dissect user intentions and break down tasks into multiple sub-tasks. With the capabilities of numerous AI models in the machine learning community, HuggingGPT shows great potential in tackling challenging AI tasks.
References
-
Data Code: https://github.com/microsoft/JARVIS.
-
Original Paper: Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang (2023). HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace. arXiv:2303.17580v1