GLM-PC Base Model, CogAgent-9B Open Source

On November 29, Zhipu officially proposed the concept of GLM-OS and released two agent products: AutoGLM and GLM-PC. To promote the development of the large model agent ecosystem, Zhipu decided to open source the base model of GLM-PC—— CogAgent-9B, for further community development. CogAgent-9B has been launched on the MoLe community for immediate experience!

🔗 Model Link: https://modelers.cn/models/zhipuai/cogagent-9b-20241220 (adapted for Ascend card)

CogAgent-9B-20241220 is a dedicated agent task model trained based on GLM-4V-9B.This model requires only a screenshot as input (no HTML or other text representations) to predict the next GUI operation based on any user-specified task, combined with historical actions. Thanks to the universality of screenshots and GUI operations, CogAgent can be widely applied in various GUI interaction scenarios, such as personal computers, mobile phones, and in-car devices.

Compared to the first version of the CogAgent model open-sourced in December 2023, CogAgent-9B-20241220 has achieved significant improvements in GUI perception, reasoning prediction accuracy, action space completeness, task universality, and generalization, and supports bilingual (Chinese and English) screenshot and language interaction.

CogAgent-9B

Paper:　

https://arxiv.org/abs/2312.08914

Code:　

https://github.com/THUDM/CogAgent

Model:　

Huggingface: https://huggingface.co/THUDM/cogagent-9b-20241220
MoLe Community:https://modelers.cn/models/zhipuai/cogagent-9b-20241220 (adapted for Ascend card)

Technical Documentation:

https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report

Execution Process

GLM-PC Base Model, CogAgent-9B Open Source

CogAgent uses the GUI screenshot as the only environmental input, combined with completed action history, to calculate the most appropriate action in the current GUI screenshot. This action is injected into the GUI through the CogAgent client application (such as GLM-PC and CogAgent Demo App), the GUI responds and updates the image content; at the same time, this action is added to the action history. CogAgent calculates subsequent operations based on the updated historical actions and screenshots. This process repeats until CogAgent determines that the instructions have been executed.

The input to CogAgent consists of only three parts: the user’s natural language instructions, the record of executed historical actions, and the GUI screenshot, without any textual representation of layout information or additional element tag (set of marks) information.

Its output covers the following four aspects:　

Thinking Process (Status & Plan): CogAgent explicitly outputs the understanding of the GUI screenshot and the thinking process for deciding the next action, including status (Status) and plan (Plan), with output content controllable through parameters.
Natural Language Description of the Next Action (Action): The natural language description of the action will be added to the historical action record to facilitate the model’s understanding of the executed action steps.
Structured Description of the Next Action (Grounded Operation): CogAgent describes the next operation and its parameters in a structured manner similar to a function call, making it easier for the client application to parse and execute the model output. Its action space includes GUI operations (basic actions, such as left-clicking, text input, etc.) and anthropomorphic behaviors (advanced actions, such as launching applications, calling language models, etc.).
Sensitivity Judgment of the Next Action: Actions are divided into “general operations” and “sensitive operations”, the latter referring to actions that may lead to irreparable consequences, such as clicking the “send” button in an “email sending” task.

Model Upgrade

Model Base and Structure Upgrade: We have adopted a more powerful visual language model GLM-4V-9B as the base, significantly enhancing the model’s image understanding performance.　

Visual Processing Module Optimization: Achieved a more efficient, unified visual processing module, supporting 1120*1120 native high-resolution image input. By using a parameterized downsampling method, the model efficiency has been improved with almost no loss of model capability. CogAgent structurally supports images of any ratio or size, but during training and inference, input images are uniformly scaled to 1120*1120. Although the input image size is fixed, test results show that even on 2K or higher resolution screens, the model can still maintain accurate understanding capability. To achieve better performance, it is recommended that users appropriately increase the relative size of icons and text to ensure that the content in the scaled screenshots is clear and discernible.　

Dataset Enrichment and Improvement: A wide variety of datasets have been collected and integrated, including unsupervised data and GUI instruction fine-tuning datasets. Unsupervised data includes open-source GUI layout datasets, self-collected application and web page datasets; the GUI instruction fine-tuning dataset includes longer chains, more applications, and cross-application GUI agent task datasets, etc. In addition, using CogAgent’s self-generated data further expanded and improved the dataset.　

Pre-training Strategy Optimization: VLM and GUI pre-training aim to enhance the model’s basic understanding of visual inputs and GUI interfaces. We propose GUI Grounding pre-training for the first time, using screenshots and layout pairs to construct the correspondence between interface sub-regions and layout representations (such as DOM elements), thereby constructing the GUI REG and REC tasks:　

GUI Referring Expression Generation (REG): Predicting the layout representation corresponding to a certain area on the screenshot.
GUI Referring Expression Comprehension (REC): Predicting the position of a certain element in the screenshot. This method has been applied in multiple GUI understanding data constructions and GUI agent work. In the original paper, we used 400,000 web page data to construct 140 million REC & REG training samples. Based on this, we further expanded and optimized the training data, adding layout data from desktop and mobile applications to make the model more adaptable to real application scenarios.

Post-training Strategy Improvement: Post-training is crucial for enhancing the model’s GUI agent analysis, reasoning, and prediction capabilities. We adopted a more scientific post-training strategy, divided into two progressively difficult stages:　

GUI instruction tuning: Integrating GUI-related multi-task data, deepening the model’s understanding of GUI content and functions, and enabling initial Q&A capabilities. A wide range of open-source data and privately collected data was used.
GUI agent SFT: Equipping the model with comprehensive GUI agent reasoning capabilities, training data includes open-source datasets (such as Mind2Web) and additional collected cross-platform multi-application data.

Model Reasoning and Thinking Chain Optimization: Decomposing the thinking chain into Status (current screen state), Plan (global plan), Action (next natural language description), Operation (next formal language description). By randomly sampling and mixing various modes of training data (such as Action-Operation, Status-Action-Operation, etc.), the actual output during the reasoning process can be flexibly adjusted and controlled based on interaction scenarios, computing resources, and accuracy requirements.　

Action Space Improvement: Clearly defined the basic action space and added advanced actions such as LLM, QUOTE_TEXT, LAUNCH, enhancing the model’s tool usage and interaction capabilities.　

Evaluation Results

We tested the performance of CogAgent-9B-20241220 and similar models on the following four datasets.

We compared the performance of API-based commercial models (GPT-4o-20240806, Claude-3.5-Sonnet), commercial API + GUI Grounding models (GPT-4o + UGround, GPT-4o + OS-ATLAS), and open-source GUI Agent models (Qwen2-VL, ShowUI, SeeClick).

Results show that CogAgent achieved leading results in GUI positioning (Screenspot), single-step operation (OmniAct), Chinese step-wise leaderboard (CogAgentBench-basic-cn), and multi-step operation (OSWorld), only slightly lagging behind Claude-3.5-Sonnet specialized for Computer Use and GPT-4o combined with an external GUI Grounding Model on OSworld.

Click to read the original text, directly access the CogAgent-9B open-source model address

👇 Follow us for more exciting content

END

Execution Process

Model Upgrade

Evaluation Results

Leave a Comment Cancel reply