In 1950, Alan Turing published a paper titled “Computing Machinery and Intelligence” proposing the famous “Turing Test”. This involved automatic interpretation and natural language generation, which became the starting point for the development of Natural Language Processing (NLP).
Natural Language Processing is a field of study within computer science and artificial intelligence (AI) that focuses on the processing of natural languages (such as English or Mandarin). This processing typically involves converting natural language into data that computers can use to understand the world (numbers). Meanwhile, this understanding of the world is sometimes used to generate natural language text that reflects this understanding (i.e., natural language generation).
The invention of language was to facilitate communication and is the foundation for humans to establish consensus. Now, programmers working hard on natural language processing technology share a common goal: to enable computers to understand human language.
The Charm of NLP – Creating Machines That Can Communicate
Since the invention of computers, machines have been processing language. However, these “formal” languages (like early languages Ada, COBOL, and Fortran) were designed to have only one correct interpretation (or compilation).
Currently, Wikipedia lists over 700 programming languages. In contrast, Ethnologue has confirmed that the total number of natural languages is ten times that of the natural languages used by people around the world. Google’s natural language document index far exceeds 100 billion gigabytes, and that’s just indexing; the actual size of natural language content online is certainly over 1 trillion gigabytes, and these documents do not fully cover the entire internet.
The term “natural language” has the same meaning as the word “natural” in the “natural world”. Natural, evolved things in the world are different from the mechanical, artificial things designed and manufactured by humans. The ability to design and build software to read and process the language we are currently reading, which is about how to build software to process natural language, is very advanced and quite magical.
Initially, search engines like Google required some tricks to find what we were looking for, but soon they became smarter, accepting an increasing number of vocabulary searches. Then, the text auto-completion feature on smartphones began to advance, with the middle button often suggesting the word we were looking for. This is the charm of natural language processing – enabling machines to understand our thoughts.
More and more entertainment, advertising, and financial report content can be generated without requiring human intervention. NLP robots can write entire movie scripts. Video games and virtual worlds often feature robots that converse with us, sometimes even discussing robots and artificial intelligence themselves. This “play within a play” will provide more metadata about movies, and then robots in the real world will write reviews based on this to help everyone decide which movie to watch.
With the development of NLP technology, the flow of information and computing power continues to increase. Now, we can simply input a few characters in the search bar to retrieve the precise information needed to complete a task. The first few auto-completion options provided by the search are often so appropriate that they make us feel as if someone is helping us search.
Basic Knowledge to Get Started with NLP
1. Regular Expressions
Regular expressions use a special type of formal language grammar called regular grammar. The behavior of regular grammar is predictable and provable, and it is flexible enough to support some of the most complex dialogue engines and chatbots on the market. Amazon Alexa and Google Now are both major pattern-based dialogue engines that rely on regular grammar. The complex rules of regular grammar can often be represented in one line of code called a regular expression. There are successful chatbot frameworks in Python, such as Will, that rely entirely on this language to produce useful and interesting behaviors. Amazon Echo, Google Home, and similar complex yet useful assistants also use this language to provide coding logic for most user interactions.
2. Word Order and Grammar
Word order is important. The rules that control the order of words in sequences (such as sentences) are called the grammar of the language. This is the information discarded in previous bag-of-words or word vector examples. Fortunately, for most short phrases and even many complete sentences, the aforementioned word vector approximation methods can work. If the goal is simply to encode the general meaning and sentiment of a short sentence, then word order is not very important. Let’s look at all the word order results in the example “Good morning Rosa”:
>>> from itertools import permutations>>> [