Imagine if you could make a computer understand and generate language; how cool would that be? No longer relying solely on typing, you could speak directly to the computer, ask questions, and it would be able to ‘understand’ and respond. Moreover, this ability is not just a feature of science fiction movies; it has quietly become a reality in our lives. From smart assistants to text analysis, Natural Language Processing (NLP) is changing the way we work and entertain ourselves.
Today, let’s talk about a very popular NLP tool in Python—NLTK (Natural Language Toolkit). It is a powerful library that helps us perform text processing, lexical analysis, syntactic analysis, sentiment analysis, and more. The best part is that it is very beginner-friendly, suitable for starting from scratch. Are you ready? Let’s explore this new world together!

What is NLTK?
NLTK is a Python library that includes many features for natural language processing. It provides many useful modules to help us process text data and perform various language analysis tasks, such as tokenization, part-of-speech tagging, named entity recognition (NER), and more. You just need to import NLTK to easily use these features.
Installing NLTK:
pip install nltk
You can install the NLTK library using this command. After installation, enter the following code in Python to test:
import nltk
nltk.download('punkt') # Download some necessary tool data
This line of code will download some commonly used language data, such as tokenization tools.
NLTK Basics: Tokenization
Tokenization is a fundamental task in NLP that refers to splitting a piece of text into individual words or punctuation marks. You can think of it as cutting an article into small segments, each of which can stand alone for easier subsequent analysis.
Code Example:
from nltk.tokenize import word_tokenize
text = "Hello, how are you doing today?"
tokens = word_tokenize(text)
print(tokens)
Output:
['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']
See? The tokenization process breaks down the sentence “Hello, how are you doing today?” into individual words and symbols, making it easier for subsequent processing. Note that punctuation marks are also considered as a ‘word’, which is common in NLP processing.
Tip: Sometimes, we may not need to break down punctuation marks; in that case, we can use more intelligent tokenization tools, like sent_tokenize (sentence tokenization).
Part-of-Speech Tagging
Part-of-speech tagging (POS tagging) is another important task in NLP, aimed at labeling each word in a sentence with its grammatical role (such as noun, verb, adjective, etc.). In simple terms, the computer helps you tag each word to tell you what type of word it is.
Code Example:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
text = "NLTK is a great tool for NLP."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)
Output:
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('tool', 'NN'), ('for', 'IN'), ('NLP', 'NNP'), ('.', '.')]
The part after each word is its part-of-speech tag. For example, <span>'NNP'</span>
indicates a proper noun, while <span>'VBZ'</span>
indicates a verb (third person singular). These tags are based on English grammar rules.
Tip: In different application scenarios, you may need to choose different tagging sets for your data. NLTK defaults to using the Penn Treebank tagging set, and remember that some terms may vary slightly.
Named Entity Recognition (NER)
Named Entity Recognition (NER) is a technique in NLP used to identify entities in text that have specific meanings (such as names of people, places, organizations, etc.). Through NER, the computer can understand that “Obama” is a name of a person and “New York” is the name of a place.
Code Example:
from nltk import ne_chunk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
text = "Barack Obama was born in Hawaii and became the president of the United States."
tokens = word_tokenize(text)
tags = pos_tag(tokens)
tree = ne_chunk(tags)
print(tree)
Output:
(S
(GPE Barack/NNP)
Obama/NNP
was/VBD
born/VBN
in/IN
(GPE Hawaii/NNP)
and/CC
became/VBD
the/DT
president/NN
of/IN
the/DT
(GPE United/NNP States/NNPS)
./.)
As you can see, <span>(GPE Barack/NNP)</span>
and <span>(GPE Hawaii/NNP)</span>
are the recognized geographical entities (GPE). This method helps us extract key people and places from the article.
NLTK Advanced: Text Preprocessing
In natural language processing, text preprocessing is a crucial step. It helps us clean the data, remove noise, and make subsequent analyses more precise. Common preprocessing steps include: removing stop words, stemming, and lemmatization.
Removing Stop Words:
Stop words are words that carry little meaning in analysis (such as “的”, “了”, “在”, etc.). Common stop words in English include “the”, “and”, “is”, etc.
Code Example:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "This is an example sentence, showing the removal of stop words."
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)
Output:
['example', 'sentence', ',', 'showing', 'removal', 'stop', 'words', '.']
This little trick allows you to remove words that do not help in analysis, making the text data more concise and useful.
Practical Application Scenarios
Alright, we have learned how to use NLTK for basic text processing. So, how can we apply these skills in practice?
1.Sentiment Analysis: Determine the sentiment tendency of a piece of text (positive, negative, neutral). For example, you can use NLTK to analyze a comment or a post on social media to determine if it is positive or negative.
2.Text Classification: NLTK can be used for tasks like spam detection and news categorization. By tokenizing and tagging the text, you can extract features for classification.
3.Keyword Extraction: Through text analysis, we can extract the core keywords of an article, aiding in search engine optimization, information retrieval, and more.
4.Automated Question Answering System: NLTK can also be used to develop automated question answering systems, where users ask questions and the computer provides answers based on an existing text library.
Tip: For these types of applications, in addition to understanding basic text processing, you also need to have a certain understanding of algorithms (such as machine learning). If you want to delve deeper, remember to experiment more and gradually accumulate experience!
Conclusion
Today, we talked about the basics of the NLTK NLP tool. From tokenization to named entity recognition and text preprocessing, we practiced some classic techniques in natural language processing. Remember, NLP is not just about processing text; it is the key to enabling computers to understand human language. If you master these basic tools, you can further explore various complex language tasks, such as sentiment analysis and text generation.
Tip: NLTK is powerful, but there may be some areas where processing can be a bit slow or inflexible. If your tasks are more complex, don’t forget to also check out other libraries, like spaCy or transformers, which may offer greater advantages in performance and efficiency.