Hello everyone! I am Hao Ge. Today, I want to introduce you to a super useful Python library – NLTK (Natural Language Toolkit). It is a great assistant for natural language processing, whether it’s tokenization, part-of-speech tagging, or sentiment analysis, it can handle it all with ease. Let’s explore this magical toolkit together!
-
Installation and Setup of NLTK
First, we need to install the NLTK library and download the necessary data packages:
python code snippet
# Install NLTK
pip install nltk
# Download necessary data packages
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
-
Text Tokenization
Tokenization is the foundation of natural language processing, just like cutting a complete sentence into small pieces:
python code snippet
from nltk.tokenize import word_tokenize, sent_tokenize
text = "NLTK is really useful! It helps us process various texts. This is the third sentence."
# Sentence segmentation
sentences = sent_tokenize(text)
print("Sentence segmentation result:", sentences)
# Word segmentation
words = word_tokenize(text)
print("Word segmentation result:", words)
-
Part-of-Speech Tagging
Just like learning parts of speech in language classes, NLTK can also help us tag the parts of speech for each word:
python code snippet
from nltk import pos_tag
text = "The quick brown fox jumps over the lazy dog"
words = word_tokenize(text)
tagged = pos_tag(words)
print("Part-of-speech tagging result:", tagged)
-
Named Entity Recognition
Recognizing proper nouns such as names, places, and organizations in the text:
python code snippet
from nltk import ne_chunk
text = "Mark works at Google in New York"
words = word_tokenize(text)
tagged = pos_tag(words)
entities = ne_chunk(tagged)
print("Named entity recognition result:", entities)
💡 Tip:
-
When processing Chinese with NLTK, it is recommended to preprocess with the jieba library first. -
You may need to use a VPN to download data packages. -
Be mindful of memory usage when processing large-scale texts.
-
Word Frequency Statistics
Let’s see how to count the frequency of words appearing in the text:
python code snippet
from nltk import FreqDist
text = "python python nltk programming coding python nltk"
words = word_tokenize(text)
fdist = FreqDist(words)
print("Word frequency statistics:")
for word, frequency in fdist.most_common():
print(f"{word}: {frequency}")
-
Stop Words Removal
Remove meaningless words like “的”, “了”, “是” from the text:
python code snippet
from nltk.corpus import stopwords
nltk.download('stopwords')
text = "This is a sample sentence, removing stop words will make it more meaningful"
words = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
print("After removing stop words:", filtered_words)
⚠️ Notes:
-
You need to download the corresponding data packages when using NLTK for the first time. -
Pay attention to encoding issues when processing non-English texts. -
Some functions may require additional corpus support.
Practical Exercise:
Try using NLTK to analyze a piece of your favorite article and count the most commonly used words!
Friends, that’s it for today’s Python learning journey! Remember to code, and feel free to ask me any questions in the comments. Wish you all a happy learning experience, and may your Python skills improve steadily!