NLTK: An Elite Python Library for Natural Language Processing

· Click the blue text to follow us

Have you often seen those impressive artificial intelligences that can understand and answer human questions, and even write poetry? Actually, this is all thanks to natural language processing technology. Today, let’s talk about a super handy Python natural language processing toolkit – NLTK.

PART.1
What is NLTK?

NLTK stands for Natural Language Toolkit, which is essentially a natural language toolbox. It provides a plethora of ready-made tools and datasets that allow us to conveniently process human language. Whether it’s tokenization, part-of-speech tagging, semantic analysis, or sentiment analysis, NLTK can handle it all with ease.

Using NLTK is simple; just install it:

pip install nltk

Once installed, we can start playing around!

PART.2
Tokenization: Breaking Sentences into Words

Tokenization is the most fundamental and important step in natural language processing. NLTK provides several tokenizers, with the most commonly used being word_tokenize:

import nltk
from nltk.tokenize import word_tokenize
text = "NLTK is a super handy toolkit!"
tokens = word_tokenize(text)
print(tokens)

Output:

['NLTK', 'is', 'a', 'super', 'handy', 'toolkit', '!']

As you can see, a sentence has been broken down into individual words. This allows for further analysis of each word.

Tip: The first time you use word_tokenize, you might encounter an error because it requires downloading some data. Just run <span>nltk.download('punkt')</span> as prompted.

PART.3
Part-of-Speech Tagging: Labeling Each Word

Now that we have identified each word, the next step is to determine the part of speech for each word – whether it’s a noun, verb, or adjective. NLTK’s pos_tag function can perfectly handle this:

from nltk import pos_tag
tagged = pos_tag(tokens)
print(tagged)

The output will look something like this:

[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('super', 'JJ'), ('handy', 'JJ'), ('toolkit', 'NN'), ('!', '.')] 

Each word is followed by a tag, such as NNP for proper nouns and JJ for adjectives. These tags might look a bit confusing? No worries, you’ll get used to them over time.

PART.4
Named Entity Recognition: Identifying Important Nouns in Text

If you want to find all the names, locations, and organization names in a large block of text, you’ll need to use Named Entity Recognition (NER). NLTK also provides a ready-made NER tool:

from nltk import ne_chunk
text = "The headquarters of Apple Inc. is in Cupertino, California."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)
print(entities)

The output will be a tree structure that marks the recognized entities. For example, “Apple Inc.” will be labeled as an organization (ORGANIZATION), while “California” and “Cupertino” will be labeled as locations (GPE).

PART.5
Stemming and Lemmatization: Normalizing Different Forms of Words

In English, you often encounter different forms of the same word, like go, goes, gone, going. In many cases, we want to treat all of them as the same word “go.” NLTK provides two methods to achieve this:

  1. Stemming:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["go", "goes", "going", "gone"]
stems = [stemmer.stem(word) for word in words]
print(stems)

Output:

['go', 'goe', 'go', 'gone']
  1. Lemmatization:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word, pos='v') for word in words]
print(lemmas)

Output:

['go', 'go', 'go', 'go']

As you can see, lemmatization yields better results, but it may be slower. Which one to use depends on your needs.

PART.6
Stop Words Filtering: Removing Meaningless Words

When analyzing text, words like “的”, “了”, and “是” often affect the analysis results. NLTK provides a stop word list that helps us easily filter out these words:

from nltk.corpus import stopwords
text = "This is a very handy toolkit"
tokens = word_tokenize(text)
filtered = [word for word in tokens if word not in stopwords.words('chinese')]
print(filtered)

Output:

['very', 'handy', 'toolkit']

See, those meaningless words have been removed!

Tip: Remember to download the stop words dataset first by running <span>nltk.download('stopwords')</span>.

NLTK has many other fun features, such as sentiment analysis, text classification, syntax analysis, and more. We’ll stop here for today and explore more next time. In short, with the powerful NLTK toolkit, processing natural language is incredibly enjoyable! Go ahead and try it out; I believe you will discover even more interesting ways to play with it.

Leave a Comment