A Small Step for AutoGLM, A Giant Leap for Human-Machine Interaction

A Small Step for AutoGLM, A Giant Leap for Human-Machine Interaction

With just a voice command, AutoGLM can simulate human operations on a mobile phone to complete tasks. AI is evolving from a chatbot with only conversational capabilities to a self-sufficient agent that has “hands, brain, and eyes.”

Written byEdited byShen Feifei
55 years ago, as he stepped onto the moon, Armstrong said a simple phrase: “That’s one small step for man, one giant leap for mankind.”
Over the past few decades, many have quoted this phrase to mark a historic moment. Today, we want to apply this phrase to Zhipu’s AutoGLM.
On November 29, during Zhipu Agent OpenDay, three products were publicly tested or internally tested: the browser plugin AutoGLM Web, the computer intelligent agent model GLM-PC, and AutoGLM, which was “spoiled” over a month ago, marking Zhipu’s first productized intelligent agent.
In just a month, the capabilities of AutoGLM are no longer limited to ordering takeout or liking posts, bringing multiple new advancements:
AutoGLM can autonomously execute over 50 steps of long operations and can perform tasks across apps;
AutoGLM opens a new experience for fully automated internet browsing, supporting dozens of websites for autonomous operation;
The GLM-PC, which operates like a human on a computer, has started internal testing, exploring technology for a general agent based on visual multimodal models.
At the same time, AutoGLM has initiated large-scale internal testing and will soon be launched as a product for end users, announcing the start of a plan for “10 billion-level APP free Auto upgrades.”
01.
What Can AutoGLM Do?
At this point, many might wonder: What is AutoGLM?
From its name, it easily evokes thoughts of autonomous driving, as almost every car’s control panel has an AUTO button, indicating that the feature or setting is in automatic mode.
As the name suggests, AutoGLM’s scenario is to control a mobile phone with AI,with just a voice command, AutoGLM can simulate human operations on a mobile phone to complete tasks. AI is evolving from a chatbot with only conversational capabilities to a self-sufficient agent that has “hands, brain, and eyes.”
To facilitate understanding, we conducted some preliminary tests:
For those who cannot watch the video, here are the four scenarios we tested: searching for weekend travel guides on Xiaohongshu, commenting on the latest content of a Xiaohongshu influencer, buying a box of Yungquan honey oranges on Pinduoduo, and booking a flight from Ningbo to Beijing.
To cut to the chase, AutoGLM accurately completed all tasks, and for the purchasing scenarios, we only needed to make the final payment. A slight shortcoming is that when faced with pop-ups or steps requiring manual confirmation, AutoGLM currently cannot handle them and requires human intervention to continue the process.
Due to time constraints, our testing did not delve too deeply; for upgrades to AutoGLM’s capabilities, refer to the information conveyed during Zhipu Agent OpenDay:
Long Tasks: Understanding long instructions and executing long tasks. For example, in the case of purchasing hot pot ingredients, AutoGLM autonomously executed 54 uninterrupted steps. Moreover, in these multi-step, cyclical tasks, AutoGLM’s speed outperformed manual operations.
Cross-App: AutoGLM supports task execution across apps. Users will get used to AI handling tasks automatically rather than switching back and forth between multiple apps. Currently, AutoGLM’s form is more like a scheduling layer executing between users and applications, making cross-app capability a crucial step.
Short Commands: AutoGLM can support custom phrases for long tasks. Today, you no longer need to tell AutoGLM, “Help me buy a cup of Luckin coffee, coconut latte, at the Wudaokou store, large, hot, with a little sugar”—you only need to say, “Order coffee.”
Casual Mode: We all experience choice paralysis; AutoGLM can now help you make decisions. In casual mode, all steps are decided by AI, bringing a surprise akin to a blind box. Would you like to try the coffee flavor AI orders for you?
Similarly, AutoGLM Web and GLM-PC have capabilities similar to AutoGLM, targeting browser and computer scenarios, respectively, with some functions that are not possible on smartphones.
A Small Step for AutoGLM, A Giant Leap for Human-Machine Interaction
For example, AutoGLM Web can understand user commands, automatically search within web pages, summarize multiple links, and even further generate arXiv daily reports, build GitHub repositories, and check in on Weibo topics, among other personalized functions.
Another example is remote mobile command; GLM-PC can autonomously complete computer operations and can be set to execute tasks at a future time while in the powered-on state.
Imagine: even if you’re goofing off, drinking coffee, or using the restroom, your computer is still working, without affecting work progress.
02.
Human-Machine Interaction Enters the AI Era
Of course, what impresses us is not just the capabilities achieved by AutoGLM but the significant impact on human-machine interaction patterns; human-machine interaction based on natural language is already a work in progress.
In childhood, during “microcomputer classes,” teachers often reminded us: “You need to learn how to use a computer.”
The reason for the word “learn” is that operating a computer requires learning to use the keyboard and mouse, mastering input methods, and adapting to the complex interfaces of each application. To write programs, one must also start from scratch to learn a programming language. Although these tools are constantly advancing, collaboration between humans and machines remains a high-threshold endeavor, especially with professional software, where completing a task requires many steps filled with mechanical repetitive labor.
AutoGLM’s current functions are still basic, yet they mark the beginning of the evolution of human-machine interaction: leveraging the powerful capabilities of large models, with just a sentence, AI can automatically help us handle complex tasks, further lowering the threshold for human-machine collaboration.
It’s no longer about humans passively adapting to machines, but about machines understanding humans.
Attempts to break the deadlock in human-machine interaction are not limited to domestic Zhipu; Apple’s Apple Intelligence, Anthropic’s Computer Use, Google’s Jarvis, and OpenAI’s upcoming Operator are all innovating in the same direction.
The question arises: how far are large models from reshaping the paradigm of human-machine interaction?
In the field of autonomous driving, there are L1 to L5 capability classifications; companies like OpenAI and Zhipu have proposed similar technical stages:L1 is language capability, L2 is logical capability (multimodal capability), L3 is tool usage capability, L4 is self-learning capability, ultimately achieving human-like understanding of interfaces, task planning, tool usage, and task completion.
“The bad news” is that the current capabilities of large models are still at an early stage. According to Zhipu CEO Zhang Peng, “Agents will greatly enhance L3 tool usage capability while opening exploration for L4 self-learning capability.”
“The good news” is that during Zhipu Agent OpenDay, companies like Honor, ASUS, Xiaopeng, Qualcomm, and Intel shared their practices and outlooks on smart terminals from different scenarios.
In other words, reshaping the paradigm of human-machine interaction with large models is not just a vision of large model companies, but a consensus across the industry chain, including terminal and chip manufacturers. As AutoGLM’s capabilities improve, it will be able to invoke more applications, adapt to more systems, and achieve increasingly complex coherent autonomous operations.
A Small Step for AutoGLM, A Giant Leap for Human-Machine Interaction
Another piece of information that should not be overlooked is that edge computing power is continuously improving. Zhipu has subsequently launched models adapted for AI-native devices and a collaborative architecture with the same source for edge and cloud, meaning that agents will not only achieve a transformation in user experience through applications but can also be promoted across various smart devices, such as mobile phones + AI, PCs + AI, cars + AI, etc.
03.
Final Thoughts
When the concept of large models first became popular, some compared it to an “operating system.”
At least from the performance of AutoGLM, even if it only adds an intelligent scheduling layer between users and applications, it already resembles the prototype of GLM-OS (a general computing system centered around large models). If it can further achieve native human-machine interaction, it will fundamentally change the mode of human-machine interaction, allowing everyone to operate mobile phones, computers, cars, glasses, etc., using natural language.
What’s exciting is that the renowned research organization Gartner has already listed agentic AI as one of the top ten technology trends for 2025, predicting that by 2028, at least 15% of daily work decisions will be autonomously completed by agentic AI.

Previous Recommendations

01
The Fully Upgraded “New Clear Shadow” Brings New Features to AI-Generated Videos
02
We Used GLM-4-Plus to Create a “Reading Agent” That Boosted Work Efficiency by 300%

Leave a Comment