Hello everyone! I am Hao Ge. Today I want to introduce you to a super practical Python library – NLTK (Natural Language Toolkit). As a Python enthusiast, I have always found natural language processing particularly interesting. Imagine letting computers understand human language; isn’t that cool? NLTK is a powerful assistant to help us achieve this goal!
1. First Experience with NLTK
First, we need to install NLTK. Open the command line and enter:
bash copy
pip install nltk
After installation, we also need to download some necessary data:
python run copy
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
2. Tokenization – The First Step in Text Processing
Tokenization is the most basic operation in natural language processing. Just like we need to recognize characters when learning a language, computers also need to break sentences into individual words.
python run copy
from nltk.tokenize import word_tokenize, sent_tokenize
text = "NLTK is really useful! It can help us process various texts. Let's start learning!"
# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentence tokenization result:", sentences)
# Word tokenization
words = word_tokenize(text)
print("Word tokenization result:", words)
3. Stop Words Handling
When analyzing text, words like “的”, “了”, and “着” often do not contain substantive information, and we call them “stop words”. NLTK provides tools for handling stop words:
python run copy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "I am learning Python natural language processing, which is really interesting!"
words = word_tokenize(text)
# Get Chinese stop words
stop_words = set(stopwords.words('chinese'))
filtered_words = [word for word in words if word not in stop_words]
print("After removing stop words:", filtered_words)
4. Part-of-Speech Tagging
Part-of-speech tagging can help us understand the grammatical structure of a sentence, similar to learning verbs, nouns, and adjectives in elementary school:
python run copy
from nltk import pos_tag
from nltk.tokenize import word_tokenize
text = "The quick brown fox jumps over the lazy dog"
words = word_tokenize(text)
tagged = pos_tag(words)
print("Part-of-speech tagging result:", tagged)
💡 Tip:
-
NLTK’s part-of-speech tagging mainly supports English; for Chinese processing, consider using libraries like jieba. -
In the results of part-of-speech tagging, NN represents nouns, VB represents verbs, and JJ represents adjectives.
5. Text Frequency Analysis
Let’s see how to count the frequency of words appearing in the text:
python run copy
from nltk import FreqDist
from nltk.tokenize import word_tokenize
text = "Python is amazing! Python is powerful! I love Python programming!"
words = word_tokenize(text)
fdist = FreqDist(words)
# Check the most common words
print("Word frequency statistics:", fdist.most_common(3))
# Draw the word frequency distribution chart
import matplotlib.pyplot as plt
fdist.plot(30)
plt.show()
6. Practical Tips
When using NLTK, there are a few points to pay special attention to:
-
When processing Chinese, it is recommended to first use jieba for tokenization, then use NLTK for subsequent processing. -
When downloading data packages, you may encounter network issues; consider using mirror sources. -
When processing large-scale texts, be mindful of memory usage.
Practice Task
Try to complete this small task:
-
Count the number of adjectives in a piece of text. -
Find the five most commonly used nouns. -
Split a passage into sentences and calculate the average sentence length.
Friends, today’s Python learning journey ends here! Remember to code, and feel free to ask me in the comments if you have any questions. I wish you all happy studying and continuous progress in Python!