Getting Started: How Natural Language Processing Works

Selected from Medium

Author: Adam Geitgey

Translated by Machine Heart

Contributors: Weng Junjian, Liu Xiaokun

Computers are better at understanding structured data, so asking them to understand human language, which is primarily based on cultural habits, is quite challenging. So how has natural language processing achieved success? It is by structuring human language (as much as possible). This article demonstrates the workings of each stage of the natural language processing pipeline step by step with simple examples, which is the process of structuring language, from sentence segmentation, tokenization, …, to co-reference resolution. The author’s explanations are intuitive and easy to understand, making this article a rare gem for those new to NLP.

How do computers understand human language?

Computers excel at using structured data, such as spreadsheets and database tables. However, we humans usually communicate in words rather than using spreadsheets. This is not a good situation for computers.

Unfortunately, throughout history, we have never lived in a world filled with structured data.

Much of the information in the world is unstructured—such as raw text in English or other human languages. So how can we make computers understand unstructured text and extract data from it?

Natural language processing, or NLP for short, is a subfield of AI that focuses on enabling computers to understand and process human language. Next, let’s see how NLP works and learn how to use Python programming to extract information from raw text.

Note: If you are not interested in how NLP works and just want to copy and paste some code, feel free to skip to the section “Implementing NLP Pipeline in Python”.

Can computers understand language?

As long as computers have existed, programmers have been trying to write programs that can understand languages like English. The reason is obvious—humans have written information for thousands of years, and it would be very helpful if computers could read and understand all this data.

Although computers cannot truly understand English like humans do, they can already do many things! In certain specific domains, you can use NLP techniques to perform seemingly magical tasks and apply NLP techniques in your projects to save a lot of time.

Conveniently, the latest advancements in NLP technology can be accessed through open-source Python libraries (such as spaCy, textacy, neuralcoref, etc.), requiring just a few lines of Python code to implement NLP techniques.

Extracting meaning from text is not easy

The process of reading and understanding English is very complex, and this process does not even take into account that English sometimes does not follow logical and consistent rules. For example, what does this news headline mean?

“Environmental regulators grill business owner over illegal coal fires.“

Do regulators question the business owner about illegal coal burning? Or are they actually grilling the business owner with coal? As you can see, parsing English with a computer can become very complex.

Doing anything complex in machine learning usually means building a pipeline. The idea is to break your problem down into very small parts and then use machine learning to solve each part separately, ultimately connecting several machine learning models that feed results to each other, allowing you to solve very complex problems.

This is exactly the strategy we will apply in NLP. We will break down the process of understanding English into small chunks and then see how each chunk works.

Building the NLP pipeline step by step

Let’s look at a passage from Wikipedia:

London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the southeast of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium.

This passage contains some useful facts. If a computer could read this article and understand that London is a city, that London is in England, and that London was settled by the Romans, it would be fantastic. However, to achieve this, we must first teach the computer the most basic concepts of written language and then gradually build upon that.

Step 1: Sentence Segmentation

The first step in the pipeline is to break the text into individual sentences like this:

1. “London is the capital and most populous city of England and the United Kingdom.”

2. “Standing on the River Thames in the southeast of the island of Great Britain, London has been a major settlement for two millennia.”

3. “It was founded by the Romans, who named it Londinium.”

We can assume that each sentence in English expresses an independent thought or idea. Writing a program to understand a sentence is much easier than understanding an entire paragraph.

Encoding a sentence segmentation model can be as simple as splitting sentences whenever punctuation is seen. However, modern NLP pipelines often use more complex techniques to handle files that are not well-formatted.

Step 2: Tokenization

Now that we have split the document into sentences, we can process one at a time. Let’s start with the first sentence in the document:

“London is the capital and most populous city of England and the United Kingdom.”

Our next step is to break this sentence into different words or tokens, which is called tokenization. Here are the results of tokenization:

“London”, “is”, “the”, “capital”, “and”, “most”, “populous”, “city”, “of”, “England”, “and”, “the”, “United”, “Kingdom”, “.”

Tokenization in English is quite straightforward. As long as there are spaces between them, we separate them. We also treat punctuation as a separate token because punctuation is meaningful as well.

Step 3: Predicting the Part of Speech for Each Token

Next, we look at each token and try to guess its part of speech: noun, verb, adjective, etc. Knowing what each word does in the sentence will help us understand the meaning of the sentence.

We can input each word (and some extra surrounding words for context) into a pre-trained part-of-speech classification model:

The part-of-speech model was initially trained by providing it with millions of English sentences, each with its parts of speech already labeled, and letting it learn to replicate that behavior.

It is important to note that this model is entirely based on statistics; it does not truly understand the meaning of words (as humans think). It only knows how to guess the part of speech based on similar sentences and words.

After processing the entire sentence, we will get results like this:

With this information, we can start to extract some very basic meanings. For example, we can see that the nouns in the sentence include “London” and “capital”, so this sentence is likely about London.

Step 4: Text Lemmatization

In English (and most languages), words appear in different forms. Look at these two sentences:

I had a pony.
I had two ponies.

Both sentences are discussing a noun – pony, but they use different inflections (one in singular form, one in plural form). When processing text in a computer, understanding the base form of each word is helpful so that you know both sentences are discussing the same concept. Otherwise, to the computer, the string “pony” and “ponies” look like two completely different words.

In NLP, we call this process lemmatization—finding the most basic form or lemma of each word in the sentence.

The same applies to verbs. We can also convert verbs to their non-inflected forms by finding their roots through lemmatization. So “I had two ponies.” becomes “I [have] two [pony].”

Lemmatization is usually done through a lookup table based on parts of speech and may involve some custom rules for handling words you have never seen before.

Here is how our sentence looks after lemmatization, including the root forms of verbs:

The only change we made was turning “is” into “be.”

Step 5: Identifying Stop Words

Next, we need to consider the importance of each word in the sentence. English has many filler words that frequently appear, such as “and”, “the”, and “a”. When performing statistical analysis on text, these words introduce a lot of noise because they appear more frequently than others. Some NLP pipelines label them as “stop words”, meaning these are words you might want to filter out before any statistical analysis.

Here is our sentence with stop words turned gray:

Stop words are typically identified by checking a hard-coded list of known stop words. However, there is no standard list of stop words that applies to all applications. The list of words to ignore can vary based on the application.

For example, if you are building a rock band search engine, you want to make sure you do not ignore the word “The”. Because this word appears in many band names, and there is a famous rock band from 1980 called “The The”!

Step 6a: Dependency Parsing

The next step is to figure out how all the words in our sentence relate to each other, which is called dependency parsing.

Our goal is to build a tree that assigns a single parent word to each word in the sentence. The root of the tree is the main verb in the sentence. Here’s what the parsing tree for our sentence looks like at the beginning:

But we can go further. In addition to identifying the parent word for each word, we can also predict the type of relationship that exists between two words:

This parsing tree tells us that the subject of the sentence is the noun “London”, which has a “be” relationship with “capital”. We finally know something useful—London is a capital! If we follow the complete parsing tree of the sentence (excluding what is shown above), we would even find that London is the capital of England.

Just like we previously used machine learning models to predict parts of speech, dependency parsing can also work by inputting words into machine learning models and outputting results. However, parsing the dependencies of words is a particularly complex task that requires a full article to explain in detail. If you want to know how it works, a good starting point to read is Matthew Honnibal’s excellent article “Parsing English in 500 Lines of Python”.

However, although the author stated in an article from 2015 that this method is now standard, it has actually become outdated and is no longer used by the author. In 2016, Google released a new dependency parser called Parsey McParseface, which used a new deep learning approach and surpassed previous benchmarks, quickly spreading throughout the industry. A year later, they released a new model called ParseySaurus, which improved on even more aspects. In other words, parsing technology remains an active research area, constantly changing and improving.

It is also important to remember that many English sentences are ambiguous and difficult to parse. In such cases, the model will make guesses based on the parsing version of the sentence, but it is not perfect, and sometimes the model will lead to awkward errors. However, over time, our NLP models will continue to parse text in better ways.

Step 6b: Finding Noun Phrases

So far, we have treated each word in the sentence as an independent entity. However, sometimes it makes more sense to group together words that represent a single idea or thing. We can use the relevant information in the dependency parsing tree to automatically group all the words that discuss the same thing.

For example:

We can combine noun phrases to produce the following form:

Whether we do this step depends on our ultimate goal. If we do not need more detail to describe which words are adjectives but want to focus more on extracting complete ideas, this is a quick and simple way to do it.

Step 7: Named Entity Recognition (NER)

Now that we have completed all the hard work, we can finally move beyond elementary grammar and start extracting real ideas.

In our sentence, we have the following nouns:

Some of these nouns exist in reality. For example, “London”, “England”, and “United Kingdom” represent physical locations on the map. It is great to be able to detect this! With this information, we can automatically extract a list of real-world place names mentioned in the document using NLP.

The goal of named entity recognition (NER) is to detect and label these nouns with the real-world concepts they represent. Here’s what our sentence looks like after running each label with the NER tag model:

However, NER systems are not just simple dictionary lookups. Instead, they use the context of how a word appears in the sentence and a statistical model to guess what type of noun the word represents. A good NER system can distinguish between the name “Brooklyn Decker” and the location “Brooklyn” based on contextual clues.

Here are some typical object types that NER systems can label:

Person names
Company names
Geographic locations (physical and political)
Product names
Dates and times
Monetary amounts
Event names

NER has a lot of applications because it can easily extract structured data from text. This is one of the simplest ways to get valuable information from an NLP pipeline.

Step 8: Co-reference Resolution

At this point, we have a good representation of the sentence. We know the part of speech for each word, how the words relate to each other, and which words are discussing named entities.

However, we still have a big problem. English is filled with pronouns like he, she, and it. These are shortcuts we use to avoid writing names repeatedly in every sentence. Humans can track what these words refer to based on context. However, our NLP models do not know what pronouns mean because they only check one sentence at a time.

Let’s look at the third sentence in the document:

“It was founded by the Romans, who named it Londinium.”

If we parse this sentence with an NLP pipeline, we will know that “it” was established by the Romans. But knowing that “London” was established by the Romans is even more useful.

Humans reading this sentence can easily understand that “it” refers to “London”. The purpose of co-reference resolution is to identify the same references by tracking pronouns in the sentence. We want to find all the words that mention the same entity.

Here is the result of co-reference resolution for the word “London” in our document:

By combining co-reference information with parsing trees and named entity information, we can extract a wealth of information from the document.

Co-reference resolution is one of the most challenging steps in implementing NLP pipelines. It is more difficult than sentence parsing. Recent advancements in deep learning have developed more accurate new methods, but they are still not perfect. If you want to learn more about how it works, please check out: https://explosion.ai/demos/displacy-ent.

Implementation of the NLP Pipeline in Python

Here is an overview of our complete NLP pipeline:

Co-reference resolution is an optional step that does not necessarily have to be completed.

Wow, it looks like there are many steps!

Note: Before we continue, it is worth mentioning that these are typical steps in an NLP pipeline, but you can skip certain steps or reorder them depending on what you want to do and how you implement the NLP library. For example, some libraries like spaCy perform sentence segmentation in the pipeline only after using the results of dependency parsing.

So how should we code this pipeline? Thanks to magical Python libraries like spaCy, it has already been done! These steps are all coded and available for use.

First, assuming Python 3 is already installed, you can install spaCy like this:

# Install spaCy 
pip3 install -U spacy

# Download the large English model for spaCy
python3 -m spacy download en_core_web_lg

# Install textacy which will also be useful
pip3 install -U textacy

Then, running the NLP pipeline code on a block of text looks like this:

import spacy

# Load the large English NLP model
nlp = spacy.load('en_core_web_lg')

# The text we want to examine
text = """London is the capital and most populous city of England and 
the United Kingdom.  Standing on the River Thames in the south east 
of the island of Great Britain, London has been a major settlement 
for two millennia. It was founded by the Romans, who named it Londinium.
"""

# Parse the text with spaCy. This runs the entire pipeline.
doc = nlp(text)

# 'doc' now contains a parsed version of text. We can use it to do anything we want!
# For example, this will print out all the named entities that were detected:
for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")

If you run this, you will get a list of the named entities and their types detected in our document:

London (GPE)
England (GPE)
the United Kingdom (GPE)
the River Thames (FAC)
Great Britain (GPE)
London (GPE)
two millennia (DATE)
Romans (NORP)
Londinium (PERSON)

You can find this code here: https://spacy.io/usage/linguistic-features#entity-types.

Note that it made an error on “Londinium”, thinking it was a person’s name rather than a place. This may be because there was nothing similar in the training dataset, so it made the best guess. Named entity detection often requires a small amount of model fine-tuning (https://spacy.io/usage/training#section-ner) if you are parsing text with unique or proprietary terms.

Let’s detect entities and use it to build a data scrubber. Suppose you are trying to comply with new GDPR privacy rules (https://medium.com/@ageitgey/understand-the-gdpr-in-10-minutes-407f4b54111f), and you find that you have thousands of documents containing personally identifiable information, like people’s names. You are tasked with removing all names from the documents.

Searching through thousands of documents to find and remove all names could take years manually. But using NLP, it can be easily achieved. Here is a simple scrubber that removes all names it detects:

import spacy

# Load the large English NLP model
nlp = spacy.load('en_core_web_lg')

# Replace a token with "REDACTED" if it is a name
def replace_name_with_placeholder(token):
    if token.ent_iob != 0 and token.ent_type_ == "PERSON":
        return "[REDACTED] "
    else:
        return token.string

# Loop through all the entities in a document and check if they are names
def scrub(text):
    doc = nlp(text)
    for ent in doc.ents:
        ent.merge()
    tokens = map(replace_name_with_placeholder, doc)
    return "".join(tokens)

s = """
In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence". In 1957, Noam Chomsky’s 
Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.
"""

print(scrub(s))

If you run it, you will find that the result meets your expectations:

In 1950, [REDACTED] published his famous article "Computing Machinery and Intelligence". In 1957, [REDACTED] 
Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.

Extracting Facts

There are many things you can do with spaCy. However, you can also use the output parsed by spaCy as input for more complex data extraction algorithms. There is a Python library called textacy that implements several common data extraction algorithms on top of spaCy. This is a great starting point.

One of the algorithms it implements is called semi-structured statement extraction. We can use it to search the parsing tree for simple statements where the subject is “London” and the verb is a form of “be”. This will help us find facts about London.

Here is its code implementation:

import spacy
import textacy.extract

# Load the large English NLP model
nlp = spacy.load('en_core_web_lg')

# The text we want to examine
text = """London is the capital and most populous city of England and  the United Kingdom.  
Standing on the River Thames in the south east of the island of Great Britain, 
London has been a major settlement  for two millennia.  It was founded by the Romans, 
who named it Londinium.
"""

# Parse the document with spaCy
doc = nlp(text)

# Extract semi-structured statements
statements = textacy.extract.semistructured_statements(doc, "London")

# Print the results
print("Here are the things I know about London:")

for statement in statements:
    subject, verb, fact = statement
    print(f" - {fact}")

The results are as follows:

Here are the things I know about London:
 - the capital and most populous city of England and the United Kingdom.
- a major settlement for two millennia.

Maybe this is not very impressive. But if you run the same code on the entire Wikipedia article about London instead of just three sentences, you will get more impressive results:

Here are the things I know about London:
 - the capital and most populous city of England and the United Kingdom
 - a major settlement for two millennia
 - the world's most populous city from around 1831 to 1925
 - beyond all comparison the largest town in England
 - still very compact
 - the world's largest city from about 1831 to 1925
 - the seat of the Government of the United Kingdom
 - vulnerable to flooding
 - "one of the World's Greenest Cities" with more than 40 percent green space or open water
 - the most populous city and metropolitan area of the European Union and the second most populous in Europe
 - the 19th largest city and the 18th largest metropolitan region in the world
 - Christian, and has a large number of churches, particularly in the City of London
 - also home to sizeable Muslim, Hindu, Sikh, and Jewish communities
 - also home to 42 Hindu temples
 - the world's most expensive office market for the last three years according to world property journal (2015) report
 - one of the pre-eminent financial centres of the world as the most important location for international finance
 - the world top city destination as ranked by TripAdvisor users
 - a major international air transport hub with the busiest city airspace in the world
 - the centre of the National Rail network, with 70 percent of rail journeys starting or ending in London
 - a major global centre of higher education teaching and research and has the largest concentration of higher education institutes in Europe
 - home to designers Vivienne Westwood, Galliano, Stella McCartney, Manolo Blahnik, and Jimmy Choo, among others
 - the setting for many works of literature
 - a major centre for television production, with studios including BBC Television Centre, The Fountain Studios and The London Studios
 - also a centre for urban music
 - the "greenest city" in Europe with 35,000 acres of public parks, woodlands and gardens
 - not the capital of England, as England does not have its own government

Now things are getting more interesting! This is a wealth of information we have automatically collected.

For additional support, try installing the neuralcoref library and adding coreference parsing to the pipeline. This will give you more facts, as it will capture sentences that talk about “it” instead of directly mentioning “London”.

What else can we do?

By browsing the spaCy documentation and textacy documentation, you can see many examples of what can be done with parsed text. So far, we have only seen a small example.

Here is another practical example: suppose you are building a website that allows users to view information about every city in the world using the information extracted in the last example.

If you have a search feature on your website, you can autocomplete common search queries like Google:

Google’s autocomplete suggestions for “London”

Google’s self-completing suggestions for “London”

However, to do this, we need a list of possible completion suggestions to offer users. We can use NLP to quickly generate this data.

Here is one way to extract frequently mentioned noun chunks from the document:

import spacy
import textacy.extract

# Load the large English NLP model
nlp = spacy.load('en_core_web_lg')

# The text we want to examine
text = """London is [.. shortened for space ..]"

# Parse the document with spaCy
doc = nlp(text)

# Extract noun chunks that appear
noun_chunks = textacy.extract.noun_chunks(doc, min_freq=3)

# Convert noun chunks to lowercase strings
noun_chunks = map(str, noun_chunks)
noun_chunks = map(str.lower, noun_chunks)

# Print out any nouns that are at least 2 words long
for noun_chunk in set(noun_chunks):
    if len(noun_chunk.split(" ")) > 1:
        print(noun_chunk)

If you run this on the London Wikipedia article, you will get output like this:

westminster abbey
natural history museum
west end
east end
st paul's cathedral
royal albert hall
london underground
great fire
british museum
london eye
.... etc ....

Diving Deeper

This is just a tiny attempt to help you understand what can be done with NLP. In future articles, we will discuss other applications of NLP, such as text classification and how systems like Amazon Alexa parse questions.

But before that, install spaCy (https://spacy.io/) and start using it! You may not be a Python user, or you may end up using a different NLP library, but these ideas should be roughly the same. Getting Started: How Natural Language Processing Works

Original link: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e

This article is translated by Machine Heart, please contact this public account for authorization to reprint.

✄————————————————

Join Machine Heart (Full-time Reporter / Intern): [email protected]

Submissions or Reporting Requests: content@jiqizhixin.com

Advertising & Business Cooperation: [email protected]

Leave a Comment Cancel reply