▼ Click the card below to follow me
▲ Click the card above to follow me
Natural Language Processing (NLP) is a delicious cake! To tackle this tough nut, you need to get to know NLTK, this big shot. NLTK stands for Natural Language Toolkit and is the king of NLP in Python. Not only is it powerful, but it also comes with a huge collection of corpora and dictionaries, making it a treasure chest in the NLP field. Today, let’s take a look at what makes this big shot so special.
Installing NLTK
Installing NLTK is very simple, just one command:
pip install nltk
After installation, you also need to download the NLTK data. Open the Python interpreter and enter:
import nltk
nltk.download()
This will pop up a download manager where you can choose to download all the data or just the parts you need. If you find it troublesome, just download everything; it doesn’t take up much space anyway.
Tokenization
Tokenization is the foundation of NLP. The tokenizer in NLTK is very easy to use, let’s take a look:
from nltk.tokenize import word_tokenize
text = "NLTK is a powerful tool for natural language processing!"
tokens = word_tokenize(text)
print(tokens)
Output:
['NLTK', 'is', 'a', 'powerful', 'tool', 'for', 'natural', 'language', 'processing', '!']
See, tokenization for English is easy. However, this tokenizer’s support for Chinese is not perfect, so for Chinese NLP, it’s better to use a dedicated Chinese tokenizer like Jieba.
Part-of-Speech Tagging
Knowing the part of speech for each word helps in understanding sentence structure. NLTK’s POS tagger can help:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)
Output:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
DT stands for determiner, JJ for adjective, NN for noun, VBZ for verb, and IN for preposition. These tags may look a bit confusing? No worries, you’ll remember them over time.
Tip: NLTK’s POS tagging is primarily for English; its effectiveness on Chinese is not quite satisfactory. For Chinese POS tagging, you should look for specialized tools.
Named Entity Recognition
If you could automatically find entities like names, locations, and organizations from a pile of text, that would be impressive. NLTK can do that:
from nltk import ne_chunk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
text = "Mark Zuckerberg is the CEO of Facebook."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)
print(entities)
Output:
(S
(PERSON Mark/NNP Zuckerberg/NNP)
is/VBZ
the/DT
CEO/NNP
of/IN
(ORGANIZATION Facebook/NNP)
./.)
See, Mark Zuckerberg is recognized as a person (PERSON), and Facebook is recognized as an organization (ORGANIZATION). This feature is very useful for information extraction.
Stemming and Lemmatization
In English, a word often has several forms. For example, “running”, “runs”, and “ran” are all forms of “run”. Stemming and Lemmatization normalize these forms.
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["cats", "trouble", "troubling", "troubled", "having"]
for word in words:
print(f"Original: {word}")
print(f"Stem: {stemmer.stem(word)}")
print(f"Lemma: {lemmatizer.lemmatize(word)}")
print()
Output:
Original: cats
Stem: cat
Lemma: cat
Original: trouble
Stem: troubl
Lemma: trouble
Original: troubling
Stem: troubl
Lemma: troubling
Original: troubled
Stem: troubl
Lemma: troubled
Original: having
Stem: have
Lemma: having
Stemming is more aggressive, simply chopping off the ends of words. Lemmatization is smarter, returning words to their dictionary form. Which is better? It depends; if you seek speed, use stemming; if you seek accuracy, use lemmatization.
Stop Words Removal
Stop words are those that occur frequently but contribute little meaning to the text, such as “is”, “the”, “and”. Removing these words can reduce noise and improve text analysis. NLTK comes with a large list of stop words:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "NLTK is a powerful tool for natural language processing!"
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)
Output:
['NLTK', 'powerful', 'tool', 'natural', 'language', 'processing', '!']
See, words like “is”, “the”, “and” have been removed. However, the NLTK stop words list for Chinese is not very comprehensive, so you might need to supplement it yourself when using it in practice.
Text Similarity Calculation
To compare how similar two texts are, the simplest method is to see how many words they have in common. NLTK provides several methods for similarity calculation; let’s try the Jaccard similarity:
from nltk import word_tokenize
from nltk.metrics import jaccard_distance
text1 = "NLTK is a powerful tool for natural language processing"
text2 = "NLTK is a strong natural language processing tool"
set1 = set(word_tokenize(text1))
set2 = set(word_tokenize(text2))
similarity = 1 - jaccard_distance(set1, set2)
print(f"Similarity: {similarity}")
Output:
Similarity: 0.5
This similarity ranges from 0 to 1, with closer to 1 indicating more similarity. A similarity of 0.5 indicates that these two sentences are quite similar.
The world of NLP is vast and profound; NLTK is just one key to open this big door. With more practice and exploration, you too can become an NLP expert! Remember, with enough effort, even iron pestles can be ground into needles. Well, that’s all for today, see you next time!

Past Issues
◆
◆
◆