The Data Science Behind Natural Language Processing

Produced by Big Data Digest

Source: medium

Translated by: Lu Zhen, Xia Yahui

Natural Language Processing (NLP) is a field within computer science and artificial intelligence.

NLP is the communication between humans and machines, allowing machines to interpret our language and respond effectively. This field has existed since the 1950s; you may have heard of the “Turing Test” pioneered by Alan Turing. The Turing Test measures how well a computer can respond to human questions.

If a third party cannot distinguish between a human and the computer, then the computer system is considered intelligent. Since the 1950s, humanity has worked hard on this, and today we have made significant progress in data science and linguistics.

This article will detail some of the basic functionalities of algorithms in the field of natural language processing, including some Python code examples.

Tokenization

Before starting with natural language processing, let’s look at a few very simple text parsing examples. Tokenization is the process of breaking down a stream of text (like a sentence) into its most basic constituent words. For example, the following sentence: “The red fox jumps over the moon.” contains 7 words.

Using Python to tokenize a sentence:

myText = 'The red fox jumps over the moon.'
myLowerText = myText.lower()
myTextList = myLowerText.split()
print(myTextList)
# OUTPUT: ['the', 'red', 'fox', 'jumps', 'over', 'the', 'moon']

Part-of-Speech Tagging

Part-of-speech tagging is used to determine the syntactic function of words. The main parts of speech in English are: adjectives, pronouns, nouns, verbs, adverbs, prepositions, conjunctions, and interjections. This is used to infer the meaning of a word based on its usage. For example, “permit” can be a noun and a verb. As a verb: “I permit you to go to the dance.” As a noun: “Did you get the permit from the country?”

Using Python to determine part of speech: (using the NLTK library)

You need to install NLTK, which is a Python library for natural language processing.

Instructions for NLTK:

https://www-geeksforgeeks-org.cdn.ampproject.org/c/s/www.geeksforgeeks.org/part-speech-tagging-stop-words-using-nltk-python/amp/

import nltk
myText = nltk.word_tokenize('the red fox jumps over the moon.')
print('Parts of Speech:', nltk.pos_tag(myText))
# OUTPUT: Parts of Speech: [('the', 'DT'), ('red', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('moon', 'NN'), ('.', '.')]

You can see how NLTK breaks the sentence down into individual words and identifies their parts of speech, such as (‘fox’, ‘NN’):

NN noun, singular ‘fox’

Stop Word Removal

Many sentences and paragraphs contain some words that are virtually meaningless, including “a”, “and”, “an”, and “the”. Stop word filtering refers to the removal of these words from a sentence or stream of words.

Using Python and NLTK for stop word filtering:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "a red fox is an animal that is able to jump over the moon."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
# filtered_sentence = []
# for w in word_tokens:
#     if w not in stop_words:
#         filtered_sentence.append(w)
print(filtered_sentence)
# OUTPUT: ['red', 'fox', 'animal', 'able', 'jump', 'moon', '.']

Stemming

Stemming is the process of reducing word noise, also known as dictionary normalization. It reduces variations of a word. For example, the stem of the word “fishing” is “fish”.

Stemming is used to simplify words to their basic meanings. Another good example is the word “like”, which is the stem of many words such as: “likes”, “liked”, and “liking”.

Search engines also use stems. In many cases, searching for one word to return documents containing another word in that set can be very useful.

Using Python and the NLTK library to implement stemming:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()
words = ["likes", "possible", "likes", "liking"]
for w in words:
    print(w, ":", ps.stem(w))
# OUTPUT: ('likes', ':', u'like') ('likely', ':', u'like') ('likes', ':', u'like') ('liking', ':', u'like')

Lemmatization

Stemming and lemmatization are very similar; both allow you to find root words. This is called word normalization, and both can produce the same output. However, they work very differently. Stemming tries to cut words, while lemmatization helps you to see whether a word is a noun, verb, or other part of speech.

For example, the word ‘saw’ returns ‘saw’ for stemming, while lemmatization returns ‘see’ and ‘saw’. Lemmatization usually returns a readable word, while stemming may not. You can see an example below to understand the differences.

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
ps = PorterStemmer()
words = ["corpora", "constructing", "better", "done", "worst", "pony"]
for w in words:
    print(w, "STEMMING:", ps.stem(w), "LEMMATIZATION", lemmatizer.lemmatize(w, pos='v'))
# OUTPUT: corporas STEMMING: corporas LEMMATIZATION corpora
# constructing STEMMING: construct LEMMATIZATION constructing
# better STEMMING: better LEMMATIZATION good
# done STEMMING: done LEMMATIZATION done
# worst STEMMING: worst LEMMATIZATION bad
# pony STEMMING: poni LEMMATIZATION pony

Conclusion

Linguistics is the study of language, morphology, syntax, phonetics, and semantics. The three fields including data science and computing have exploded over the past 60 years. We have only just begun to explore some very simple text analysis in NLP. Google, Bing, and other search engines utilize this technology to help you find information on the global web.

Think about how easy it is to have Alexa play your favorite song, or how Siri helps you find directions. This is all thanks to NLP. Natural language in computing systems is not a gimmick or a toy; it is the future of seamlessly integrating computing systems into our lives.

Arcadia Data has just released version 5.0, which includes a natural language query feature we call Search Based BI. It utilizes some of the data science and text analysis functionalities described above.

Check out this video with our Search Based BI tool:

http://watch.arcadiadata.com/watch/NSf1mMENjYfTY2cjpuGWPS?

Related reports:

https://medium.com/dataseries/the-data-science-behind-natural-language-processing-69d6df06a1ff

Intern/Full-Time Editor Recruitment

Join us to experience every detail of writing at a professional tech media outlet, and grow with a group of the best people in the most promising industry. Located at Tsinghua East Gate in Beijing, reply with “Recruitment” on the Big Data Digest homepage chat page for more details. Please send your resume directly to [email protected]

Volunteer Introduction

Reply with “Volunteer” to join us

The Data Science Behind Natural Language Processing

Everyone who clicks “View” becomes more attractive!

Leave a Comment Cancel reply