NLTK: The King of Natural Language Processing

▼ Click the card below to follow me

▲ Click the card above to follow me

Natural Language Processing (NLP) is a delicious cake! To tackle this tough nut, you need to get to know NLTK, this big shot. NLTK stands for Natural Language Toolkit and is the king of NLP in Python. Not only is it powerful, but it also comes with a huge collection of corpora and dictionaries, making it a treasure chest in the NLP field. Today, let’s take a look at what makes this big shot so special.

Installing NLTK

Installing NLTK is very simple, just one command:

pip install nltk

After installation, you also need to download the NLTK data. Open the Python interpreter and enter:

import nltk
nltk.download()

This will pop up a download manager where you can choose to download all the data or just the parts you need. If you find it troublesome, just download everything; it doesn’t take up much space anyway.

Tokenization

Tokenization is the foundation of NLP. The tokenizer in NLTK is very easy to use, let’s take a look:

from nltk.tokenize import word_tokenize
text = "NLTK is a powerful tool for natural language processing!"
tokens = word_tokenize(text)
print(tokens)

Output:

['NLTK', 'is', 'a', 'powerful', 'tool', 'for', 'natural', 'language', 'processing', '!']

See, tokenization for English is easy. However, this tokenizer’s support for Chinese is not perfect, so for Chinese NLP, it’s better to use a dedicated Chinese tokenizer like Jieba.

Part-of-Speech Tagging

Knowing the part of speech for each word helps in understanding sentence structure. NLTK’s POS tagger can help:

from nltk import pos_tag
from nltk.tokenize import word_tokenize
text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)

Output:

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

DT stands for determiner, JJ for adjective, NN for noun, VBZ for verb, and IN for preposition. These tags may look a bit confusing? No worries, you’ll remember them over time.

Tip: NLTK’s POS tagging is primarily for English; its effectiveness on Chinese is not quite satisfactory. For Chinese POS tagging, you should look for specialized tools.

Named Entity Recognition

If you could automatically find entities like names, locations, and organizations from a pile of text, that would be impressive. NLTK can do that:

from nltk import ne_chunk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
text = "Mark Zuckerberg is the CEO of Facebook."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)
print(entities)

Output:

(S
  (PERSON Mark/NNP Zuckerberg/NNP)
  is/VBZ
  the/DT
  CEO/NNP
  of/IN
  (ORGANIZATION Facebook/NNP)
  ./.)

See, Mark Zuckerberg is recognized as a person (PERSON), and Facebook is recognized as an organization (ORGANIZATION). This feature is very useful for information extraction.

Stemming and Lemmatization

In English, a word often has several forms. For example, “running”, “runs”, and “ran” are all forms of “run”. Stemming and Lemmatization normalize these forms.

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["cats", "trouble", "troubling", "troubled", "having"]
for word in words:
    print(f"Original: {word}")
    print(f"Stem: {stemmer.stem(word)}")
    print(f"Lemma: {lemmatizer.lemmatize(word)}")
    print()

Output:

Original: cats
Stem: cat
Lemma: cat
Original: trouble
Stem: troubl
Lemma: trouble
Original: troubling
Stem: troubl
Lemma: troubling
Original: troubled
Stem: troubl
Lemma: troubled
Original: having
Stem: have
Lemma: having

Stemming is more aggressive, simply chopping off the ends of words. Lemmatization is smarter, returning words to their dictionary form. Which is better? It depends; if you seek speed, use stemming; if you seek accuracy, use lemmatization.

Stop Words Removal

Stop words are those that occur frequently but contribute little meaning to the text, such as “is”, “the”, “and”. Removing these words can reduce noise and improve text analysis. NLTK comes with a large list of stop words:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "NLTK is a powerful tool for natural language processing!"
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

Output:

['NLTK', 'powerful', 'tool', 'natural', 'language', 'processing', '!']

See, words like “is”, “the”, “and” have been removed. However, the NLTK stop words list for Chinese is not very comprehensive, so you might need to supplement it yourself when using it in practice.

Text Similarity Calculation

To compare how similar two texts are, the simplest method is to see how many words they have in common. NLTK provides several methods for similarity calculation; let’s try the Jaccard similarity:

from nltk import word_tokenize
from nltk.metrics import jaccard_distance
text1 = "NLTK is a powerful tool for natural language processing"
text2 = "NLTK is a strong natural language processing tool"
set1 = set(word_tokenize(text1))
set2 = set(word_tokenize(text2))
similarity = 1 - jaccard_distance(set1, set2)
print(f"Similarity: {similarity}")

Output:

Similarity: 0.5

This similarity ranges from 0 to 1, with closer to 1 indicating more similarity. A similarity of 0.5 indicates that these two sentences are quite similar.

The world of NLP is vast and profound; NLTK is just one key to open this big door. With more practice and exploration, you too can become an NLP expert! Remember, with enough effort, even iron pestles can be ground into needles. Well, that’s all for today, see you next time!

NLTK: The King of Natural Language Processing

Past Issues

◆

Installing NLTK

Tokenization

Part-of-Speech Tagging

Named Entity Recognition

Stemming and Lemmatization

Stop Words Removal

Text Similarity Calculation

Leave a Comment Cancel reply