Getting Started with NLTK for Natural Language Processing

Hello everyone! I am Hao Ge. Today I want to introduce you to a super practical Python library – NLTK (Natural Language Toolkit). As a Python enthusiast, I have always found natural language processing particularly interesting. Imagine letting computers understand human language; isn’t that cool? NLTK is a powerful assistant to help us achieve this goal!

1. First Experience with NLTK

First, we need to install NLTK. Open the command line and enter:

bash copy

pip install nltk

After installation, we also need to download some necessary data:

python run copy

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

2. Tokenization – The First Step in Text Processing

Tokenization is the most basic operation in natural language processing. Just like we need to recognize characters when learning a language, computers also need to break sentences into individual words.

python run copy

from nltk.tokenize import word_tokenize, sent_tokenize

text = "NLTK is really useful! It can help us process various texts. Let's start learning!"
# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentence tokenization result:", sentences)

# Word tokenization
words = word_tokenize(text)
print("Word tokenization result:", words)

3. Stop Words Handling

When analyzing text, words like “的”, “了”, and “着” often do not contain substantive information, and we call them “stop words”. NLTK provides tools for handling stop words:

python run copy

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "I am learning Python natural language processing, which is really interesting!"
words = word_tokenize(text)

# Get Chinese stop words
stop_words = set(stopwords.words('chinese'))
filtered_words = [word for word in words if word not in stop_words]
print("After removing stop words:", filtered_words)

4. Part-of-Speech Tagging

Part-of-speech tagging can help us understand the grammatical structure of a sentence, similar to learning verbs, nouns, and adjectives in elementary school:

python run copy

from nltk import pos_tag
from nltk.tokenize import word_tokenize

text = "The quick brown fox jumps over the lazy dog"
words = word_tokenize(text)
tagged = pos_tag(words)
print("Part-of-speech tagging result:", tagged)

💡 Tip:

  • NLTK’s part-of-speech tagging mainly supports English; for Chinese processing, consider using libraries like jieba.
  • In the results of part-of-speech tagging, NN represents nouns, VB represents verbs, and JJ represents adjectives.

5. Text Frequency Analysis

Let’s see how to count the frequency of words appearing in the text:

python run copy

from nltk import FreqDist
from nltk.tokenize import word_tokenize

text = "Python is amazing! Python is powerful! I love Python programming!"
words = word_tokenize(text)
fdist = FreqDist(words)

# Check the most common words
print("Word frequency statistics:", fdist.most_common(3))

# Draw the word frequency distribution chart
import matplotlib.pyplot as plt
fdist.plot(30)
plt.show()

6. Practical Tips

When using NLTK, there are a few points to pay special attention to:

  1. When processing Chinese, it is recommended to first use jieba for tokenization, then use NLTK for subsequent processing.
  2. When downloading data packages, you may encounter network issues; consider using mirror sources.
  3. When processing large-scale texts, be mindful of memory usage.

Practice Task

Try to complete this small task:

  1. Count the number of adjectives in a piece of text.
  2. Find the five most commonly used nouns.
  3. Split a passage into sentences and calculate the average sentence length.

Friends, today’s Python learning journey ends here! Remember to code, and feel free to ask me in the comments if you have any questions. I wish you all happy studying and continuous progress in Python!

Leave a Comment