He Xiaodong: How AI Understands Humans Through NLP Technology

GBAS

Integrating into the Global Economy to Drive Bay Area Development

Author | Zhang Li

Source | Lei Feng Network

In recent years, the development of deep learning has profoundly propelled artificial intelligence. The next major breakthrough in artificial intelligence lies in understanding natural language.

On June 23, the China Computer Federation held a seminar themed “Industrial Applications and Technological Development of Human-Machine Dialogue.” Dr. He Xiaodong, Executive Vice President of JD AI Research Institute, delivered a keynote report on “Breakthroughs in Natural Language Understanding Technology.”

In this report, Dr. He Xiaodong first briefly reviewed the driving force of deep learning technology in areas such as speech, language, and vision, and then focused on two aspects of cutting-edge research in natural language processing (NLP): how to enable AI to understand humans through NLP technology, such as understanding intent, parsing semantics, recognizing emotions, and search recommendations; and how to make AI’s results comprehensible and acceptable to humans, such as text summarization, content generation, topic expansion, and emotional dialogue. Finally, he discussed the latest research progress in cutting-edge directions such as multimodal intelligence, long text generation, emotional and stylistic expression, and human-machine dialogue.

Dr. He Xiaodong delivering a speech

In March of this year, Dr. He Xiaodong joined JD and became the Executive Vice President of JD AI Research Institute and Director of the Deep Learning and Speech and Language Laboratory. He has made significant contributions to deep learning, natural language processing, speech recognition, computer vision, and information retrieval. His work includes DSSM (Deep Structured Semantic Model) and the image captioning bot Caption Bot. Before joining JD, Dr. He Xiaodong worked at Microsoft Research Redmond as a Principal Researcher and Head of the Deep Learning Technology Center (DLTC). He received his bachelor’s degree from Tsinghua University in 1996, a master’s degree from the Chinese Academy of Sciences in 1999, and a Ph.D. from the University of Missouri-Columbia in 2003.

The following is the specific content of Dr. He Xiaodong’s report.

Development History of Deep Learning

The predecessor of deep learning, then called “neural networks,” was popular in the 1980s. In the 1990s, expectations for neural networks peaked, but it was found that they could not solve many problems, such as some speech recognition issues, where neural networks did not perform better than other statistical-based models.

In the 2000s, deep learning was not widely recognized. In 2008, I and my Microsoft colleague, Li Deng, organized a seminar at NIPS, inviting Geoff Hinton and others to introduce the latest advances in deep learning. It was not until around 2010 that deep neural network models began to make significant breakthroughs in large-scale speech recognition. From that time on, confidence in neural networks and deep learning was renewed, and with further promotion, neural networks made significant breakthroughs in image recognition in 2012, and in machine translation in 2014 and 2015. Subsequently, neural network technology began to have a greater impact in more and more AI fields.

Taking speech recognition as an example, before 2000, there were continuous improvements in speech recognition, resulting in a significant downward trend in the error rate of speech recognition.

Starting in 2000, speech recognition technology entered a bottleneck period. Although new technological inventions continued to emerge each year, in general, from 2000 to 2010, the error rate on large-scale test sets did not decrease significantly; technically, this decade was essentially stagnant.

Beginning in 2010, Geoff Hinton and Microsoft started to use deep learning for speech recognition technology research. In 2011, it was found that on some large-scale speech recognition datasets, there was a 20%-30% reduction in error rates. With more researchers getting involved, the error rate dropped rapidly. In the important telephone speech test set, Switchboard, Microsoft’s speech recognition error rate was only around 5% last year, equivalent to that of a professional stenographer. Therefore, it can be said that starting in 2017, machines reached human-level performance on Switchboard.

Not only in speech but also in image recognition, deep learning has made significant progress. Around 2009, Fei-Fei Li’s team proposed a dataset called ImageNet, and starting in 2010, Fei-Fei Li and her team members have held a challenge every year. In 2010 and 2011, the best systems had an error rate of around 25% on this dataset. In 2012, Hinton and his students first proposed a deep convolutional neural network, which, although not originally proposed by them, was scaled up significantly and incorporated some new technologies, reducing the error rate from 25% to 16%.

In 2015, my former colleague Sun Jian’s team proposed a new model that pushed deep learning to a new height, increasing the number of network layers to 152, reducing the error rate to 3.57%. Since humans sometimes make mistakes, with a human error rate of about 5%, machines can achieve 3.5%. Therefore, from that point on, the ability of computers to perform image recognition on this specific dataset surpassed that of the average person.

While we have seen clear breakthroughs in speech and image recognition, the next expectation is whether we can achieve deeper breakthroughs in natural language. Language is a unique intelligence of humans; many higher animals also have strong visual and auditory capabilities, but language is a unique human wisdom. Therefore, we hope that one day computers or artificial intelligence can fully understand language like humans.

Cutting-edge Research in Natural Language Processing

When it comes to natural language processing, it can basically be divided into two units:

1. AI Understanding Humans

For example, intent recognition and search; humans express various intents and emotions through text. Therefore, we say we need to enable AI to understand humans.

The first step in language understanding is slot value extraction.

If you say a sentence, the computer needs to understand the intent behind it. For example, if you are looking for a flight, you need to specify the city and time. In 2013, we collaborated with Yoshua Bengio to successfully apply RNN to this problem for the first time.

Another task is how to classify intent.

Human speech is very complex; when you say a paragraph, it may describe your opinion of a restaurant, but we want to know your true opinion from that paragraph and why you said that. This means that within a segment, we need to know which sentences are more important and which are less important. Therefore, we designed a dual-layer attention-based neural network called the Hierarchical Attention Net. It extracts the most important information at the word level within a sentence and simultaneously extracts the more important sentences between sentences. By combining both, we can achieve a complete expression of the segment.

This segment itself conveys that the person likes the restaurant; we can even highlight the important words or sentences. The deeper the color, the more important it is to understand the intent of the segment. Thus, it can explain that the overall meaning of the segment is that the person likes the restaurant, and it will also tell you why they like it.

Semantic representation is a core issue in natural language understanding.

Natural language can be infinitely varied, but there is actually a semantic core within it. Semantic understanding is a very difficult problem, and we hope to design a deep neural network that can extract abstract semantic features from raw or relatively preliminary natural language descriptions. This semantic feature will eventually form a semantic space, where the meaning of each sentence will be mapped to a point in this space. Different sentences describing similar semantics will be close in this space, even if their literal meanings are different. We hope this neural network, after learning, can recognize that these two sentences are close together in the space.

Two sentences may have a high degree of overlap but completely different meanings; we also hope that this neural network can learn to recognize that different sentences, although they may be literally similar, have completely different meanings. This is a core problem in language understanding.

To solve this problem, we proposed a model in 2013 called the Deep Structured Semantic Model (DSSM). It addresses an essential issue where several sentences, such as sports cars, may be translated as racing cars or running cars. Although the literal overlap between sports cars and running cars may be higher, they are two very different concepts. After learning, we can know that sports cars and racing cars should have a high overlap in vector space, while the overlap between sports cars and running cars should be minimized. This way, we can better separate the relationships between sports cars and running cars in the semantic space.

Deep Structured Semantic Model (DSSM)

The entire training is quite complex; we do not care about the absolute values between vectors, but rather the relative relationships between vectors. Only this relative relationship defines semantics. Because semantics itself is a virtual concept; you may have seen this image or this object, but semantics has always been a concept in the human mind. Therefore, all semantics are expressed through a relative meaning. We know that A and B are very similar, so we train this model with a relative training objective to obtain such a semantic model.

Another important issue is the knowledge graph.

Knowledge Graph Schematic

The points represent important objects and entities, and the lines often describe the relationships between things. For example, Obama has many relationships; his birthplace is Hawaii, he is a Democrat, and his wife and daughter’s names. Many times, we hope to perform knowledge calculations in a continuous space, determining who is similar to whom and discovering new relationships that were previously unknown. In 2015, I published a paper on how to represent knowledge graphs in a continuous space. For example, we can use a semantic vector to represent each entity while using a matrix to represent the relationships between entities. In this way, when calculating whether entity A and entity B have a specific relationship M, we only need to calculate the vector of A × the vector of B and check if this value is high, performing a similarity measurement.

With a knowledge graph expressed in a continuous space, many other things can be done. For example, we can infer many things in the continuous space; we know Obama’s birthplace is in Hawaii, which is in the United States, so we can infer that his nationality is American. All of these can be computed within the knowledge space. By calculating the distance between two matrices based on nationality and birthplace relationships, if the distance is small enough, we can consider these two relationships equivalent.

With a knowledge graph, many tasks can be accomplished, such as knowledge answering. For instance, asking who Justin Bieber’s sister is can be answered through semantic parsing and search matching.

2. Enabling AI to Express in a Human-Understandable Way

This means enabling AI to generate content and understand humans. For example, we all know that artificial intelligence writing poetry is no longer news; AI can also paint, create text summaries, and even generate recipes. Therefore, our expectations for artificial intelligence are continuously rising; we hope it can not only understand what we say but also provide some feedback.

Moreover, how to apply reinforcement learning in natural language is also crucial. In reinforcement learning, AlphaGo is an excellent example; we believe that natural language understanding is a more challenging problem than AlphaGo playing Go, as the language space is infinite. Although the Go space is vast, the language space is even larger. The language space is infinite, especially in terms of action space, while in Go, every move can only be made on one of 361 points.

To solve this problem, we cannot directly judge like AlphaGo. In language dialogue issues, an H represents a machine saying a sentence or selecting one from a range of expressions, as this expression can vary infinitely. Thus, we do not select a specific H; instead, we map all Hs through a neural network into a semantic space, calculating when to select which H, allowing that H to perform deep learning in a language environment.

Human understanding differs from computer understanding; often, computer understanding logic resembles keyword matching and semantic analysis. Regardless, many of the latest models have already surpassed human levels on this dataset. We have even proposed new models for transfer learning, i.e., how to transfer a model from one domain to another to quickly achieve high levels of performance.

Another major area of work is machine reading; we hope that AI can answer any questions about an article after reading it. The computer must have a thorough understanding of the relationships within the text to provide accurate answers. Significant progress has been made in this area, with Stanford releasing a 2.0 version of their dataset recently, and many companies like iFlytek, Google, Microsoft, and Alibaba excelling in this field.

Next Breakthroughs

1. Multimodal Intelligence

Multimodal intelligence is also a cross-disciplinary concept. We know that humans acquire intelligence from multiple sources, not just through sight or sound.

We know who Obama is and his background, but it’s not comprehensive; seeing a picture helps us understand that Obama looks like this. Visual information greatly supplements our knowledge of language. The same goes for auditory information; hearing Obama’s speech helps us understand his choice of words and gain deeper insights into him. All these modalities together lead to a deeper understanding of the entire knowledge. Therefore, based on deep learning models, we hope to gradually extract the invariant semantic signals and concepts from different modal inputs and ultimately unify them into a multimodal semantic space. In this semantic space, we can perform cross-modal tasks, such as reasoning between images and text and even generating content from one modality to another.

2. Creating Complex Content

This is another recent work, generating poetry. Generating poetry is also content creation; suppose a scientist inputs keywords for his daughter, the computer understands what kind of emotion he wants to express and generates a poem to express that emotion.

There is still a core issue that has not been solved: writing logic. Therefore, this involves establishing a model that allows the theme and sub-theme structures to unfold and be reflected in the model, ultimately enabling the creation of a logically coherent and meaningful article.

3. Emotional Intelligence

For example, how to generate emotionally resonant dialogues. Suppose a lady buys a T-shirt; if she posts it on social media, we can know that she’s a lady. The computer might describe her as “a lady wearing a blue T-shirt.” However, the actual reason she posted the picture might just be to show off her new purchase. Therefore, we hope the computer understands her emotional needs; it should focus on the beauty of the picture rather than her clothing and actions. Thus, we hope computers can do better in understanding users’ emotions and demands, leading to a deeper understanding of users.

4. Multi-turn Human-Machine Dialogue

Ultimately, intelligent technology returns to one question: What is AI? In the 1950s, Turing proposed the Turing test, which states that if a computer can engage in extensive conversation with a human, and after a long time, the human cannot determine whether they are conversing with a person or a computer, then the computer is considered intelligent. In other words, Turing believed that language and dialogue represent advanced intelligence. If a dialogue can pass the Turing test, it means that the computer truly possesses intelligence.

Since then, generations of scientists have conducted extensive research, and over the past 50 years, many dialogue systems have been published, ranging from acoustic recognition and speech recognition to semantic understanding.

Recently, at JD Research Institute, we developed an emotional dialogue service robot. One of the most important aspects of customer service is emotion, requiring precise emotional understanding. Customer service must empathize with users while possessing sufficient conversational skills and adhering to certain social values.

For example, if a person calls asking why their package hasn’t arrived, we hope the AI model can accurately sense that the customer is angry. The robot will then offer some consolation and apologize, calming their emotions. It will then ask for specifics: “What happened exactly?” If the customer responds with “yesterday,” the intelligent robot will check and find that the package has indeed arrived, finally informing them, “The system shows that the package has arrived.” At this point, the customer’s emotions shift from anger to anxiety, and the robot quickly captures this emotional change, telling them, “Don’t worry; we have insurance, so please rest assured.”

Next, it will say, “Your neighbor signed for it; it hasn’t been lost, so you can rest assured.” The customer’s emotions may change again; they might feel relieved and say, “I’m relieved, thank you.” The robot detects this happy emotion and wishes the customer well, resolving the issue.

Currently, the emotionally supportive dialogue robot has been launched at JD, providing online services for nearly 1 million inquiries.

Development History of Deep Learning

Cutting-edge Research in Natural Language Processing

Next Breakthroughs

Leave a Comment Cancel reply