Experience the New Version of Zhipu GLM-PC: Upgrading Multimodal Agents for Autonomous Computer Operation

Introduction to GLM-PC

GLM-PC, based on Zhipu‘s leading multimodal large model CogAgent, is the world’s first plug-and-play computer intelligent agent available to the public. It possesses human-like computer “observation” and “operation” capabilities, assisting users in efficiently handling various computer tasks. Since the release of GLM-PC v1.0 on November 29, 2024, and the commencement of internal testing, we have continuously iterated on technology and enhanced functionalities. The latest version introduces an innovative “Deep Thinking” mode and adds features specifically for logical reasoning and code generation. Additionally, we have now expanded comprehensive support for the Windows operating system.

GLM-PC Architecture

In recent years, discussions around agents (Agents) at the model architecture level have deepened. The tool-calling capabilities exhibited by large language models (LLMs) have revealed for the first time how LLMs can closely integrate with human workflows as Agents, demonstrating excellent generalization and few-shot learning abilities. However, their application scope is limited to text interaction and publicly accessible tool types.

Driven by visual language models (VLMs) like CogAgent, graphical user interface intelligent agents (GUI Agents) have opened new technical paths, achieving interaction across the entire GUI space through multimodal perception. These GUI Agents can perform actions such as clicking and keyboard input by visually perceiving interface elements and layouts like humans, greatly broadening the application scope of Agents in virtual interactions.

At the same time, multi-Agent systems like SWE-agent have also demonstrated collaborative potential, integrating multiple model advantages to explore planning, reflection, and self-iteration based on multiple models. We firmly believe that the key to the development of Agents lies in enhancing model capabilities and optimizing collaborative architectures.

An ideal Agent should possess the following characteristics: at the perception level, it can handle various signals such as text, images, video, and audio; at the cognitive level, it has logical thinking and task planning abilities (similar to the left brain), as well as efficient perception and flexible operation abilities (similar to the right brain); at the execution level, it can operate across the entire GUI space, receive environmental feedback, and perform self-correction.

Guided by this philosophy, we launched the open-source CogAgent model in 2023, filling the gap in multimodal perception for GUI Agents; in November 2024, GLM-PC v1.0 further enhanced perception, planning, and creative capabilities, achieving limited self-correction functionality.

The new version of GLM-PC draws on the division of labor between the human “left brain” and “right brain,” achieving a deep integration of logical reasoning and perceptual cognition through code generation and graphical interface understanding, balancing logic and creativity to assist humans in completing complex tasks. This is thanks to the deep integration of Zhipu’s self-developed multimodal model CogAgent and code model CodeGeex. The new GLM-PC commands workflows and tool calls in code form, enhancing planning, reasoning, and reflection abilities under the deep thinking mode, and stably and efficiently responding to complex scenarios and tasks. In practical operations, GLM-PC can perceive multi-layer environmental feedback, assist in reflection, and effectively perform self-correction and optimization.

Notably, to promote research on pre-trained GUI Agents, we open-sourced the fully upgraded model—CogAgent-9B-20241220 in December 2024.

Related materials for CogAgent-9B-20241220:

Paper: Hong et al. “Cogagent: A visual language model for gui agents.” (CVPR 2024 Highlight 🏆)

Blog: https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report

Huggingface: https://huggingface.co/THUDM/cogagent-9b-20241220

GitHub: https://github.com/THUDM/CogAgent

GLM-PC Features

In terms of functionality, GLM-PC continues to innovate. The newly launched “Deep Thinking” mode further strengthens its planning, reasoning, and reflection capabilities. Users can easily instruct GLM-PC to complete the entire process from information extraction and data processing to task execution with simple commands. Whether it’s automatically extracting product data from images and storing it in Excel, or searching and extracting data on Xiaohongshu and writing code to store information, GLM-PC can handle it effortlessly. Additionally, it has added support for the Windows system, allowing more users to enjoy the convenience brought by this intelligent agent.

Agent’s Left Brain and Right Brain Functions:

Left Brain of the Agent: Logical Reasoning and Task Execution

GLM-PC’s “left brain” focuses on precise logical reasoning and task execution, encompassing core functions such as:

1. Task Planning

Based on the user’s specific needs, GLM-PC can quickly construct detailed task plans. It comprehensively considers goal requirements and available resources, generating action roadmaps and automatically breaking down complex tasks into manageable subtasks, establishing a clear execution path.

2. Looping Execution

Once the planning phase is complete, GLM-PC activates the code generation module and enters a logical execution loop, steadily advancing the task progress. This looping execution mechanism ensures precise and automated task execution, forming a seamless closed loop from input to output, reducing the need for human intervention.

3. Deep Thinking Ability: Dynamic Reflection, Error Correction, and Strategy Optimization

GLM-PC’s “left brain” functionality is not limited to generating static plans; it also possesses the ability to respond in real-time to environmental changes during execution, allowing for dynamic reflection, error correction, and strategy optimization. This capability is reflected in several aspects:

Agile Handling of Interruptions: In the face of external factors causing process interruptions, GLM-PC can quickly re-plan logical paths, ensuring task continuity and smoothness.

Proactive Information Enhancement: When encountering insufficient information, GLM-PC will actively interact with users to supplement the required information by asking questions, optimizing the task execution plan.

Right Brain of the Agent: Image and GUI Cognition

GLM-PC’s “right brain” is dedicated to enhancing deep perception and interactive experiences. Its main functions include:

Graphical User Interface Understanding: Accurately identifying graphical user interface elements, including buttons, icons, layouts, etc., and mastering their functions and interaction logic.

User Behavior Recognition: Smartly recommending suitable operation options for the current interface by learning the structure of the user interface and historical operation data.

Image Semantic Analysis: Deeply analyzing complex image content to extract key information, such as text, symbols, and key trends and data indicators from data visualization charts.

Multimodal Information Integration: Combining image and text information to form a comprehensive perception understanding. For example, simultaneously recognizing button positions and related text descriptions in the user interface to assist the “left brain” in formulating precise operation strategies.

Practical Applications of GLM-PC

In practical applications, GLM-PC demonstrates strong adaptability and creativity. It can assist users in efficiently processing information and engaging in social interactions on Xiaohongshu, automatically extracting required data and storing it in designated locations. Additionally, it can serve as a Level 6 English vocabulary learning assistant, automatically extracting vocabulary from designated websites and conducting sentence practice, saving the results in Word documents for user review. In personalized WeChat greetings and image mass sending, GLM-PC also performs excellently, capable of customizing personalized New Year greetings and images/videos, and achieving mass sending through one-click operation, efficiently completing holiday greetings.

Besides the aforementioned applications, GLM-PC can also intelligently query flight information, filter tickets, and set calendar reminders, achieving one-stop service from flight inquiries to schedule arrangements. For users who need to handle large amounts of PDF files, GLM-PC is also a powerful assistant. It can automatically open PDF files, extract specified content, and organize and store information in Word documents, significantly improving users’ work efficiency.

Collaboration and Outlook

Looking ahead, GLM-PC is engaging in in-depth collaborations with well-known PC manufacturers such as Lenovo and ASUS to jointly promote innovation and development in AIPC (AI Personal Computer). With continuous technological advancements and expanding applications, AIPC will not just be a simple computer, but a new application of AI Agents in the personal computing field. It will bring users a more efficient and intelligent work and life experience, becoming an important part of the future personal computing landscape.

In summary, GLM-PC, as a computer intelligent agent based on Zhipu’s multimodal large model, demonstrates powerful capabilities in logical reasoning, perceptual cognition, planning execution, and self-correction. Its unique multimodal interaction functions and left-right brain collaboration model provide users with an unprecedented computer usage experience. With ongoing technological advancements and expanding applications, we have reason to believe that GLM-PC will play an increasingly important role in the future personal computing domain.