A Large Model: 2300 Days with Engineers

A Large Model: 2300 Days with Engineers

When sensors are combined with large models, they help you manage your phone, enhance the network, and provide smart recommendations… When you ask it, “Why is my phone so hot?” it doesn’t give a text answer, but instead helps you clear memory and optimize battery.

In short, it will act like a real human assistant, skillfully using over 4.6 million applications on your phone. As the surrounding people say, “It’s not about creating a large model with all capabilities, but teaching the large model how to use the hundreds or thousands of applications on your phone.”

However, the process of creating it is like climbing a mountain. They are still walking through dense forests, gazing at the snow-capped peaks, and looking for the path that sometimes appears and sometimes disappears.

What is it like to engage in such creation?

Author: Jin Zhong
Editor: Li Li

Finding “The Girl Laughing with a Trash Bag on a Rainy Day”

On the morning of November 1, 2023, at the Shenzhen International Convention and Exhibition Center, in a venue filled with thousands of people, everyone held their breath, waiting for an answer.

A gentleman on stage raised his phone; he wanted to find a photo—years ago, he and his wife encountered a heavy rain while out and both forgot to bring an umbrella. They had to ask the cleaning staff by the roadside for two trash bags to cover themselves from the rain. His wife looked “messy” yet funny in the rain. He snapped a photo, but due to the time that had passed, it was too troublesome to find.

The gentleman typed in the chat box, “Find the photo of the girl laughing with a trash bag on a rainy day.” A second later, the photo appeared: his wife with a ponytail, wearing a black plastic bag, laughing on the bus. The photo preserved that moment of joy and love.

The gentleman continued to demonstrate. There was another photo, which was taken outdoors with a messy background. He typed in “Erase passersby,” and immediately all the bystanders disappeared. Zooming in on the photo, after the passersby vanished, the background streets and walls magically restored.

The one executing these operations isn’t a person, but a smart assistant on the phone. Its name is “Lanxin Xiao V.”

He kept operating, letting Lanxin Xiao V read papers, summarize the main points of the papers; let it edit a social media post based on an image; create a character relationship map for “Romance of the Three Kingdoms”; develop a marketing plan for Double Eleven; and create a meeting agenda based on chat records… In a second, the answers appeared.

This was the scene at the 2023 vivo Developer Conference.

While waiting for the answers from Xiao V, the engineers in the audience were both anxious and proud. Sitting in the front row was a programmer in his 40s, with a round face and a youthful appearance. His name is Zhou Wei, one of the creators of Lanxin Xiao V, who calls himself an “old coder.” His other identity is the Vice President of vivo and the head of the Global AI Research Institute.

Their pride is well-founded. What people see on the surface is Lanxin Xiao V searching for photos, organizing text, and conversing with people, but the foundation behind it is a large model, a series of global challenges that the technical team has overcome: how machines accurately understand complex human semantics, how to possess reasoning abilities, how to provide precise responses, and how to converse naturally like humans, receiving and feeding back information. It is no longer just an app; you can understand it as a human-like assistant in your phone.

Input a command, and in a second, you get an answer, supported by Zhou Wei and his more than 1,000 colleagues, who spent over 2,300 days building thousands of terabytes of data, continuously improving algorithms, publishing over 70 papers, and applying for more than 700 patents. This is what led to that brief moment of one second.

In 2023, large models are the hottest topic in the technology field. Artificial intelligence has developed for many years, but large models represent a revolutionary change. It means humanity can finally abstract thousands of years of civilization into knowledge that can be accessed at any time, usable by everyone. Beyond knowledge, more importantly, it is gradually approaching humanity, possessing human logic, emotions, and values.

Humans are born thinkers, and after thinking comes decision-making. But how this process occurs is a “black box,” perhaps one of the most challenging problems in the world. For centuries, the brightest minds have been battling this question.

As early as the 17th century, Descartes pondered how humans make decisions and how the mind controls behavior. In 1958, John von Neumann, known as the “father of computers,” published “Computers and the Brain,” attempting to provide an answer. His exploration of this topic spanned the latter half of his life.

Zhou Wei and his companions are also part of this endeavor. They have experienced great adventures and today present their answers.

A Large Model: 2300 Days with Engineers

Ambition

This is a long story, perhaps starting from 2018.

Over five years ago, in March 2018, in Wuzhen, Zhejiang, the author of “People” interviewed Zhou Wei once. At that time, his position was head of vivo’s artificial intelligence. He had worked at vivo for over a decade, developing mobile operating systems and smart devices. At that time, he had just taken office, and his first task was to build vivo’s artificial intelligence team. The company’s support was evident in one detail: he was given a hiring quota of 1,000 people.

Looking further back, at that moment, the world was undergoing tremendous changes: not long before, in Wuzhen, the world’s number one Chinese chess player, Ke Jie, lost to the AI robot AlphaGo. This young chess player had once hidden behind a billboard and cried during the match until the chief referee found him. The significance of this match was monumental; it changed humanity’s perception of technology, the future, and self-identity, and announced the arrival of the AI era.

In the same year, Google’s research team published a paper titled “Attention is All You Need,” introducing a new model called Transformer. Previous models could only learn small-scale data, while this model had powerful learning capabilities in language, “capable of encoding vast knowledge.”

Ordinary people could only vaguely see the changes occurring in the world, but as industry insiders, Zhou Wei and his colleagues knew that this change was fundamental, and they had to participate.

The founder and president of vivo, Shen Wei, called all executives together to watch Spike Jonze’s film “Her,” where a lonely man falls in love with his operating system, Samantha. In the following year, Zhou Wei and his colleagues visited top universities both domestically and abroad, recruiting talents according to various lists, such as machine vision and semantic understanding, and developed an AI assistant named Jovi. The name means “Enjoy vivo’s AI,” as he hoped vivo users would enjoy it.

The first meeting between “People” and Zhou Wei happened when Jovi was launched. It was also his first media interview in his career, filled with nervousness and youthful vigor. At that time, we discussed that Jovi was just getting started and was far from true general artificial intelligence, but Zhou Wei was confident, “In half a year, a year, it will have a brand new look.”

In Hong Kong, just a border away from Shenzhen, Yang Su submitted his resignation to the Hong Kong Polytechnic University, where he worked, deciding to join vivo. His research direction was intelligent perception of time and space. When we met in a café on the streets of Shenzhen, he was wearing a T-shirt, with short hair close to his scalp and a bit of stubble. He spoke quickly, with calmness and caution, but also with undeniable enthusiasm.

That was a pivotal moment in his career. Yang Su still remembers that when the full-screen trend emerged, everyone was looking forward to better solutions, while vivo took a different route, launching a full-screen phone with a pop-up camera. Yang Su was curious and specifically went to the store, where he saw a tiny phone with a camera popping up with a mechanical sound. “You would feel, wow, very charming, “there’s such a pursuing and innovative company.”

Career choices cannot be made lightly based on just one camera; more importantly, he had an overall judgment about the industry. He knew that the era of artificial intelligence had arrived, but AI could only truly understand users if it was embodied in a phone—phones have a dozen sensors and follow users 24 hours a day; “AI must bring real value and be an assistant, and only phone manufacturers can do it well.”

In Beijing, Chen Jie’an, a recent PhD graduate from Tsinghua University, also began his work at vivo. He is an expert in the field of data, and when he chose his career direction around 2018, he also realized that working at a mobile company would face hundreds of millions of users, providing a larger space for “bringing different increments.” Many other search experts from major internet companies joined him at the same time.

With the personnel in place, everyone began to work. The world is vast, but sometimes it also means desolate; they had to start from scratch.

Xiao Fangxu, the general manager of vivo’s artificial intelligence department, remembers that when they first formed the AI team, they didn’t have a clear idea of what to do and were exploring continuously. They trained AI to play Honor of Kings, play Go, and play Gomoku, with some explorations based on the personal interests of engineers, wanting to see what sparks could fly.

The knowledge graph team was also one of the first teams to start work. Everyone knew that the three elements of artificial intelligence are data, algorithms, and computing power, and data is fundamental. Building a database is hard work and foundational construction. Chen Jie’an and his colleagues collected massive amounts of Chinese internet data, performed data cleaning, and knowledge construction, ultimately forming what is called a knowledge graph. This took time and resources, testing people’s confidence and patience.

For this group of engineers, it was a hopeful time filled with ambition.

They dreamed of creating an AI phone to achieve “three full and three self”: full scene, full connection, full interaction, self-learning, self-indexing, self-suggestion. This is a beautiful ideal that remains relevant today.

A Large Model: 2300 Days with Engineers

Gaining Strength

But soon, the brilliant minds realized that the journey from dream to reality was further than expected. As Zhou Wei put it, “Full of passion, but actually hitting a wall.”

What does hitting a wall mean? Simply put, the technology at that time could not support Jovi to converse like a human—it could only support very simple conversations, could not understand context, could not understand complex language, and could not comprehend sentences with two commands. “Users expect intelligence to be able to barely mimic a person or a ten-year-old,” but at that time, it was impossible to achieve.

This was also a common dilemma in the AI industry during those years. In 2018, a tech journalist tested several smart assistants on the market with a seemingly simple request: “Recommend a restaurant, but not Japanese food.” The result was that all the assistants recommended Japanese restaurants. The word “not” was universally ignored by them.

Many people instinctively think that since an AI can defeat world champions, creating an AI that handles human daily affairs should be a piece of cake. But the reality is just the opposite: “We can create AI that beats Ke Jie in Go, but we cannot create AI that can manage Ke Jie’s daily life.”

At its core, defeating Ke Jie only requires the AI to learn Go, which is knowledge accumulation and rule matching. However, conversing naturally with humans requires understanding semantics, context, and logic, involving deep communication understanding and complex inquiries and guidance, which is a “black box” that humanity has yet to unlock.

Yang Su faced similar challenges in his field. He originally planned to provide convenient services based on users’ spatial and temporal locations. For instance, when users commute by subway, where to get on and off, the phone could predict in advance and bring up the subway ticket code. However, at that time, the technology could only do this: capture when users were near the subway and immediately present the ticket code. So often, colleagues would ask him, “I went downstairs for lunch at noon and passed by the subway station; why did it still push the ticket code?” That was because the phone had not evolved to understand human life—they were still at work, and they couldn’t take the subway home at noon.

Seeing the limitations, what should they do? This is a mobile company that has survived for nearly 30 years in fierce competition, and pragmatism is its unchanging essence.

Zhou Wei and his colleagues quickly decided that at least they could continue to strengthen artificial intelligence technology and apply it in various aspects of the phone to provide users with the most extreme experience. For example, technology related to vision could improve photography. For example, voice recognition could assist the hearing-impaired.

Zhang Cheng, head of vivo’s AI algorithms, has done a lot for accessibility. Two or three years ago, he saw a news report about an emergency call that repeatedly dialed the police but only made vague sounds. Later, the police located the caller to find out that it was a lost hearing-impaired person.

This is just one of countless troubles in the lives of hearing-impaired individuals. In families with hearing-impaired couples, when a child falls from bed and cries, the parents cannot hear. Hearing-impaired delivery personnel find it very inconvenient. During their visits, a severely hearing-impaired girl showed them a sentence typed on her phone, “I am powerless to communicate with able-bodied people.”

These stories touched them, and they were the ones with the tools—through AI sound detection algorithms, they could recognize surrounding sounds, such as a child’s cry, the doorbell, and alarm sounds, turning those sounds into information and pushing it to those who cannot hear.

In this process, people would understand each other better. Zhou Wei mentioned that they later found that although they could use algorithms to help hearing-impaired individuals “hear,” they still preferred to use sign language because it is more natural and efficient. Later, they thought, could they create a sign language solution? There was none globally, so they did it themselves.

The engineers learned sign language, allowing machines to recognize gestures and the meanings of coherent segments of gestures. At this point, Zhou Wei quoted a famous saying by former South African President Nelson Mandela: “If you communicate with someone in a language they understand, they will remember it; if you communicate with them in their own language, they will keep it in their heart.”

Yang Su and his colleagues also began to solve user pain points related to time and space. The first pain point for users is the network; sometimes the network is poor, actually due to the phone connecting to a bad base station. If the phone could intelligently choose the optimal base station, the problem could be solved, and they made that solution; another scenario is the so-called “subway black hole” where some subway sections indeed have no network, which is painful. However, if users frequently pass through this route, the phone will gradually recognize it and eventually tell the apps in the phone to load more content before entering the “black hole”; the third scenario is when flying, the phone desperately searches for a network, consuming battery quickly. They made the phone intelligently recognize this scenario, disabling the network during takeoff and quickly restoring it during descent, which has also been achieved today.

These improvements are invisible to users; “You do it, and no one will applaud, but if you don’t do it, users will suffer greatly.” Behind this, they did a lot of work, studying users’ habits, locations, preferences, scenarios, and usage states.

Far away in Beijing, Chen Jie’an and his colleagues in the vivo knowledge graph team have been collecting data for five years. A huge team of experts, countless roaring machines, and an unceasing crawling system have clarified, screened, and organized all information on the Chinese internet, updating it monthly. To date, they have accumulated over 2,000 terabytes of data, cleaning out 15 terabytes for model training. When the data reaches a certain scale, it is actually very difficult to compare; 15 terabytes is equivalent to over 20 million copies of “Romance of the Three Kingdoms,” or 2.5 national libraries…

Such hard work was exhausting at the time but proved that all the effort was not in vain.

A Large Model: 2300 Days with Engineers

All In

The moment of true transformation came when Zhou Wei experienced it in his study at home.

About a year ago, Zhou Wei was writing code at home. He is an old coder, managing thousands of people; writing code is no longer work but a weekend pastime. At this point, his tone became lively, “On Saturdays and Sundays, the happiest thing for me is, wow! Tonight I can code all night long.”

His colleagues did not know that he set up servers by himself and dived into “crazy algorithm updates” during long holidays. Kaggle is a top global machine learning competition site where programmers worldwide compete, completing the same tasks and competing for rankings. The better the algorithm, the higher the ranking, which they call “climbing the ladder.” Zhou Wei also participates. In his view, writing code is still the most enjoyable thing; programs can turn desired ideas into reality, and winning in competitions provides the most direct feedback and the strongest dopamine stimulation.

Last winter, when ChatGPT was launched, he began using ChatGPT and a large model from GitHub (the world’s largest programmer forum) to write code. When he entered a command to write a framework, “it just popped out all at once,” and upon seeing that code, Zhou Wei was surprised, “At that moment, I felt it was like a super strong team.” Previously, there were some algorithms he lacked confidence in completing, or they would take months, but with the large model, his program has been running for months, “a hundredfold increase in productivity.”

The emergence of ChatGPT also caused a stir globally. In just two months after its launch, over 100 million people used it. More than four months later, GPT-4 was released, considered an early version of general artificial intelligence. Users can ask it questions across all fields, engaging in countless rounds of Q&A, experiencing the feel of conversing with a human.

Zhou Wei and his colleagues realized that if the Jovi of 2018 was their beautiful imagination, then this time, the moment had truly arrived.

At the beginning of 2023, everyone closed the door and held high-intensity discussions for two months, ultimately reaching a consensus internally: ChatGPT is like the steam engine, a revolutionary change, a tool that brings immense productivity improvements, and they must invest in it. Together, they visited top large model teams in Beijing and Hangzhou, gaining more confidence that their thinking and technology were not inferior to others.

Another question that needed answering was, with so many large models already available, why should vivo create its own? Some foreign large models, including OpenAI, are not open-source, some are not suitable for domestic conditions, and some do not fit vivo’s products. The domestic large models are also not yet mature, and from a cost perspective, it is a significant challenge. After much thought, they had no choice but to create their own. Moreover, they decided to go “all in.”

I asked Zhou Wei how to understand “all in”? He said, “The only and complete choice strategically.” Over 1,000 employees were directly shifted to the large model direction, and less important tasks were all halted.

The previous accumulation of strength erupted at this moment—the thousands of terabytes of data and knowledge graphs they had accumulated over five years form the foundation of the large model; the latest algorithms they have closely tracked over five years, dozens of top conference papers published, and over 700 patents provide technical support. If the large model is the brain, then their accumulations in images, sound, sensors, and so on are what give the phone its limbs.

The large model they created is essentially a “dictionary” compressing thousands of years of human knowledge, familiar with human history, culture, and civilization. No matter what question you ask, it will provide an answer. Its preciousness lies in its possession of human-like logic, emotions, and values; it can understand language, has logical reasoning capabilities, and can generate expressions. Compared to Jovi from 2018, the current Lanxin Xiao V is more like a truly “intelligent assistant.”

User demands are complex and changeable, and they must change accordingly. Zhou Wei described their design: for the simplest user needs, like asking about the weather or having Xiao V summarize a document, using a billion-parameter large model, without going to the cloud, can be completed on the phone quickly and safely; for more complex scenarios, such as multi-turn conversations, if users need to book tickets, check the weather, and manage itineraries, they developed a 7 billion-parameter large model; for even more complex tasks, such as solving math and physics problems or having the large model write code, as well as specialized knowledge in law and medicine, they continued to create large models of 70 billion, 130 billion, and 175 billion parameters.

This summer, they tried to have the embryonic large model take the rankings, and soon it achieved first place on the Chinese large model evaluation list C-Eval.

The C-Eval list is considered the most authoritative Chinese large model evaluation list, covering humanities, social sciences, and natural sciences, testing knowledge and reasoning abilities, with all questions processed and manually cleaned.

Seeing this ranking, everyone felt a weight lifted from their hearts.

A Large Model: 2300 Days with Engineers

The Road Ahead

Of course, this is not a completely idealistic article; it is not describing a smooth fairy tale. In fact, during the interview, Zhou Wei spent half the time discussing that the large model is still not perfect and has some minor flaws.

The vivo large model we see today can accomplish many functions in daily life, such as imparting knowledge, managing phones. According to background data, the most common scenarios where everyone uses the Xiao V assistant now are processing photos, writing poetry, drawing, and casual chatting. These scenarios have become mature.

However, as a version 1.0 large model, it also faces the common dilemma of the industry—the so-called “hallucination of large models.” The so-called hallucination of large models refers to two issues: first, its logical thinking ability is still not strong, and second, it sometimes “talks nonsense seriously,” and “it does not know what it does not know.”

Logical thinking ability can be illustrated with a simple example, such as the “chicken and rabbit in the same cage problem.” A middle school student can solve it, but a machine may not necessarily do well, the key lies in whether the thinking chain is complete. This problem can be extended to various situations in life: for example, there is a potted plant by the window in the office, next to it is an automatic watering bucket; if the bucket breaks, what will happen? Humans understand that if the bucket breaks, it cannot water the plant, and the plant will die of thirst. For instance, if there are three candies of different sizes in a pocket and the pocket has a hole, what will happen? Humans will first ask how big the hole is; if it’s small, perhaps a small candy will drop out; if the hole is big enough, all three candies will fall out.

How to enable machines to solve this logical reasoning process is what Zhou Wei feels there is “huge room for improvement”; this is the internal “number one problem.”

The second issue is the “nonsense” problem. Human knowledge is so vast and rapidly changing; data is never new enough, complete enough, or deep enough. While updating data and enhancing search capabilities, they still handle certain topics with vivo’s pragmatic style: when users raise highly specialized questions, such as a specific medical condition and its symptoms, the large model will suggest seeking resources from professional medical institutions.

Zhou Wei candidly discusses these issues, but there is also a sense of certainty and conviction—there is no doubt about the productivity improvements brought by large models; they are on the right path, breaking through mountains and building bridges when encountering water.

His colleague Yang Su is responsible for a more advanced version 2.0, an intelligent agent based on the large model.

The core problem he is trying to solve is “context.” “Context” has many meanings; it can refer to the context in a conversation. For example, if a user makes a request, “Please continue the story from yesterday,” there are a series of questions behind it: what was the story yesterday? Where did it leave off? This process requires storing and retrieving memories and understanding text. This is an ability humans are born with, and they want to impose it on machines, making machines understand humans; this process is crucial.

“Context” can also be understood as environmental perception. When we chat with ChatGPT, we input text, and it replies with text. But the advantage of a phone is that it has dozens of sensors—yes, many people are unaware that the phone’s front and rear cameras, GPS system, WIFI functionality, gyroscope, accelerometer, and gravity sensors continuously perceive your state, whether you are walking, biking, or in a car, at work, on the subway, or outdoors. When you answer a call, the phone will turn off the screen to prevent accidental touches.

When sensors are combined with large models, they will help you manage your phone, enhance the network, and provide intelligent recommendations… When you ask it, “Why is my phone so hot?” it doesn’t give a text answer but helps you clear memory and optimize battery.

In short, it will act like a real human assistant, skillfully using over 4.6 million applications on your phone. As Zhou Wei said, “It’s not about creating a large model with all capabilities, but teaching the large model how to use the hundreds or thousands of applications on your phone.”

However, the process of creating it is like climbing a mountain; they are still walking through dense forests, gazing at the snow-capped peaks, and looking for the path that sometimes appears and sometimes disappears.

What is it like to engage in such creation?

Yang Su’s feelings are quite complex. He said that at first, he was afraid the company would be unwilling to pursue it; he felt he had to participate, which was an instinctive sense of mission as a technical person. But once they truly began, there was fear of the unknown, a desire for success, and a freshness that he hadn’t experienced in a long time…

As for Zhou Wei, he used to look forward to weekends and long holidays when he could write code and play games, finding complete relaxation. But now, he does none of that—because the challenges, stimulation, and joy brought by large models exceed anything else.

In the work group, he once reflected on this long road: After the year 2017, they began forming the AI team, in March 2018, Jovi was launched, and in the same year, the operating system team was established. Six years later, “Calculating, it has been over 2,300 days; life is less than 30,000 days.”

Essentially, this is a story about realizing dreams—a group of people found what they loved and dedicated their time and life to it.

Five years ago, in an interview, Zhou Wei mentioned an idea: when a family member of his fainted due to illness, it made him think, if a phone could detect a person’s heartbeat, heart rate, cough sounds, and snoring through its camera and microphone sensors, “it could completely remind you to seek medical help in advance.” At that time, this was an exciting future he envisioned.

Today, they have realized that future with their own hands. Based on the large model, they developed a “family health manager” feature that not only detects a person’s health data and sends medication reminders but also transmits abnormal data to family members in real time. Even when an elderly person’s phone accidentally installs malicious applications, children far away can see it and uninstall it remotely.

This is the large model they want—it is an assistant, a family member, occasionally healing, often helping, and always comforting.

(The names Yang Su, Chen Jie’an, and Zhang Cheng are pseudonyms.)

A Large Model: 2300 Days with Engineers
People: “The Power of Time” Series Books
Click the image to purchase↓↓↓
A Large Model: 2300 Days with Engineers

Dear readers, pleasestar the “People” public account, or you will not only miss ourlatest updates, but also the carefully selected cover images! Star “People” to not miss every exciting story.We hope to accompany you every day as before.

A Large Model: 2300 Days with Engineers

A Large Model: 2300 Days with Engineers

Leave a Comment