Smart GLM-PC Open Experience: Upgraded Multimodal Agent

Smart GLM-PC Open Experience: Upgraded Multimodal Agent

GLM-PC is based on the intelligent multimodal large model CogAgent, the world’s first public computer agent that can be used immediately. It can “observe” and “operate” computers like a human, assisting users in efficiently completing various computer tasks.

Since the release of GLM-PC v1.0 on November 29, 2024, and the opening of its internal testing, we have continuously optimized and upgraded it, recently launching the “Deep Thinking” mode, which adds functionalities specifically for logical reasoning and code generation. Moreover, we also provide support for the Windows system.

πŸŒ€

Download & Experience: https://cogagent.aminer.cn (or click to read the original article)

GLM-PC Architecture

In recent years, discussions on agents at the model and architecture levels have become increasingly in-depth.

The tool-calling capability of large language models (LLMs) has demonstrated for the first time how LLMs can organically combine with human production as agents, exhibiting good generalization and few-shot learning capabilities, although their application range is limited to tools that can interact in text form and are publicly accessible.

A series of graphical user interface agents (GUI Agents) based on visual language models (VLM), represented by CogAgent, propose a new path, achieving full GUI space interaction through multimodal perception. These GUI Agents can perceive interface elements and layouts visually, simulating human actions such as clicking and keyboard input, greatly expanding the application boundaries of agents in virtual interaction spaces.

At the same time, multi-agent systems like SWE-agent have also demonstrated the potential for multi-agent collaboration, integrating the advantages of various models to explore planning, reflection, and self-iteration based on multiple models.

We believe that the development of agents can be summarized as improvements in model capabilities and optimizations in collaborative architecture.

A complete agent must meet the following conditions:

  • On the perception level, it can receive diverse signals such as text, images, video, and audio;

  • On the thinking level, it possesses logical thinking and task planning abilities (similar to the left brain) and efficient perception and flexible operation capabilities (similar to the right brain);

  • On the execution level, it can perform operations across the full GUI space, receive environmental feedback, and self-correct.

Based on this thinking, in 2023, we launched the CogAgent open-source model, filling the gap of GUI agents in multimodal perception; in November 2024, GLM-PC v1.0 further strengthened perception, planning, and creativity capabilities, achieving limited self-correction.

Now, the new version of GLM-PC draws on the human “left brain” and “right brain” division of labor, achieving a deep integration of logical reasoning and perceptual cognition through code generation and graphical interface understanding, granting it the ability to balance logic and creativity, thus assisting humans in completing complex tasks.

Behind this is the deep integration of the multimodal model CogAgent and the code model CodeGeex developed independently by Zhipu. The new GLM-PC directs workflows and tool calls in code form and enhances planning, reasoning, and reflection capabilities in deep thinking mode, enabling it to stably and efficiently handle complex scenarios and tasks. During actual execution, GLM-PC can perceive multi-layer environmental feedback, assisting in reflection for effective self-correction and optimization.

It is worth mentioning that to promote the research of pre-trained GUI agents, we open-sourced the comprehensively enhanced model CogAgent-9B-20241220 in December 2024.

CogAgent-9B-20241220:γ€€

  • Paper: Hong et al. “Cogagent: A visual language model for gui agents.” (CVPR 2024 Highlight πŸ†)

  • Blog:https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report

  • Huggingface:https://huggingface.co/THUDM/cogagent-9b-20241220

  • GitHub:https://github.com/THUDM/CogAgent

Agent Left Brain: Code Generation and Logical Execution

The “left brain” of GLM-PC is responsible for rigorous logical reasoning and task execution. Its main functions include:

1、Planning

GLM-PC can quickly formulate detailed task planning based on the user’s task requirements. It comprehensively analyzes the goals and available resources, generates an execution roadmap, and automatically breaks down large tasks into manageable sub-tasks to construct a clear execution path.

2、Looping Execution

After the planning phase, GLM-PC will activate the code generation module to execute logical loops, gradually advancing task completion. This looping mechanism ensures precise task execution and high automation, achieving a complete closed loop from input to output without manual intervention.

Example Demonstration: One-Stop Shopping Process

For example, to obtain product information, GLM-PC can automatically extract product data from images, store it in Excel, and automatically add the products to the Taobao shopping cart, thus achieving a one-stop shopping process.

Operation Command: Retrieve product information from the image, create a new Excel file on the desktop to store the information, and add the product information to the Taobao shopping cart.

(The video in the text has been accelerated)

3、Long Thinking Ability: Dynamic Reflection,Correction, and Optimization

The “left brain” function of GLM-PC can not only generate static plans but also make real-time adjustments, reflective corrections, and self-corrections based on new environmental information during execution, thus continuously optimizing solutions. Specific manifestations include:

  • Flexibly Responding to Interruptions: When the process is interrupted by external factors, GLM-PC can quickly reconstruct the logical path to ensure the task proceeds smoothly.

  • Proactively Completing Information: When encountering information gaps, GLM-PC will actively interact with the user, asking questions to refine the task execution plan.

Example Demonstration: Efficient Information Processing and Social Interaction

For instance, when helping users process information about “Spring Festival Movies” on Xiaohongshu, GLM-PC can quickly search for and extract relevant data while writing code to store the information on the computer. If there is an error in the generated code, it can self-correct based on the error message.

Operation Command: Search for “Spring Festival Movies” on Xiaohongshu, reference the first image post, send the image to the WeChat group {GGG}, and ask them which movie they want to watch.

Agent Right Brain: Image and GUI Cognition

The “right brain” of GLM-PC focuses on deep perception and interactive experience. Its core functions include:

  • GUI Image Understanding: Accurately identifying graphical interface elements (such as buttons, icons, layouts, etc.) and understanding their functions and interaction logic.

  • User Behavior Cognition: Combining understanding of user interface learning and historical operation information to provide intelligent recommended operations for the current interface.

  • Image Semantic Analysis: Conducting in-depth semantic analysis of complex images, extracting key information such as text, identifiers, and trends and indicators from data visualization charts.

  • Multimodal Information Fusion: Integrating image and text information to form comprehensive perception results. For example, simultaneously identifying button positions and text labels in the user interface, assisting the “left brain” in formulating precise operation plans.

Example Demonstration: Efficient Data Organization and Archiving

For example, GLM-PC can search for and extract “AI Rankings” related textual and graphical content on Xiaohongshu. Subsequently, through self-written code, it stores company information in a newly created Excel file on the desktop while saving the post’s textual content in a designated Word document, ensuring high efficiency in organizing and archiving user data.

Operation Command: Search for the first image post on Xiaohongshu regarding “New Energy Vehicle Rankings”, reference the image content and text content of the first post, obtain the information list from the image, and store it in a newly created Excel file on the desktop, while placing the post’s text content into a new Word document named new-energy on the desktop.

Agent of Agents: Left and Right Brain Collaboration

This model, inspired by the collaboration of the left and right brain, allows GLM-PC not only to handle complex logical tasks but also to exhibit higher adaptability, creativity, and generalization in open-ended problems. It can also assist users in exploring more efficient solutions through dynamic optimization and situational awareness, especially in loop task processing, multi-step reasoning execution, and long-chain task management.

Example Demonstration: CET-6 Vocabulary Learning Assistant

GLM-PC can automatically extract CET-6 vocabulary from specified websites, create sentences based on these words, and automatically save the vocabulary and their sentences into a newly created Word document named “CET-6 Vocabulary Learning”.

Find 3 words from this “https://www.dxsbb.com/news/277.html” CET-6 vocabulary list, create sentences for each word, and paste the vocabulary and corresponding sentences into a new Word document named “CET-6 Vocabulary Learning”.

Example Demonstration: Personalized WeChat Greetings and Spring Festival Greeting Image Group Sending

GLM-PC can automatically customize personalized New Year greetings and congratulatory images/videos for WeChat group friends, achieving group sending efficiently with one-click operation.

Operation Command: Reference the member list of the “GGG” group on WeChat, send each person a New Year greeting for 2025 and a snake-themed image.

Example Demonstration: Intelligent Flight Inquiry and Schedule Arrangement

GLM-PC can quickly query flight information for users, filter the most economical tickets, and simultaneously set reminders in Feishu calendar, achieving one-stop service from flight inquiry and ticket selection to schedule arrangement.

Operation Command: Help me find the cheapest ticket from Shanghai to Beijing on January 21 on Ctrip; help me set a Feishu calendar reminder 6 hours before the flight, with the subject being “Departure to Airport”, lasting half an hour.

Example Demonstration: PDF Math Problem Extraction and Organization Process

GLM-PC can automatically open PDF files, extract specified content, and organize the information into a Word document.

Operation Command: Help me open the “Arrangements and Binomial Theorem Practice.pdf” file on the desktop, reference the first few math problems on the current interface, and place them into a newly created Word document on the desktop.

Collaboration

We are currently engaging in deep collaborative discussions with well-known PC manufacturers such as Lenovo and ASUS to jointly promote the innovation and development of AIPC (AI Personal Computer).

Logic-driven execution, perception empowered decision-making. AIPC is not just a computer; it is a new application of AI agents in the personal computing field, capable of providing users with a more efficient and intelligent working and living experience.

Smart GLM-PC Open Experience: Upgraded Multimodal Agent

Leave a Comment