Hello everyone! I am Hao Ge. Today, I would like to share a practical and interesting topic: using Python for Natural Language Processing (NLP). NLP sounds sophisticated, but it is essentially a technology that enables computers to understand and process human languages. By the end of this tutorial, you will master the basic skills of text analysis using Python.
1. Preparation: Install Necessary Libraries
First, we need to install several commonly used NLP libraries. We will mainly use NLTK and jieba, where NLTK is used for English processing and jieba is used for Chinese word segmentation.
python run copy
# Install necessary libraries
pip install nltk
pip install jieba
# Import required libraries
import nltk
import jieba
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
2. Text Preprocessing
Before performing natural language processing, we need to do some basic text processing. Here’s a simple example:
python run copy
# Sample text
text = "Python is a simple and easy-to-learn programming language! Python is very popular in the field of artificial intelligence."
# Use jieba for Chinese word segmentation
words = jieba.lcut(text)
print("Word segmentation result:", words)
# Remove punctuation
import re
clean_words = [word for word in words if re.match(r'[\u4e00-\u9fa5a-zA-Z]+', word)]
print("After removing punctuation:", clean_words)
3. Word Frequency Statistics
Counting word frequency is a fundamental task in NLP. Let’s implement a simple word frequency counter:
python run copy
from collections import Counter
def count_words(text):
# Word segmentation
words = jieba.lcut(text)
# Count word frequency
word_counts = Counter(words)
# Return the 10 most common words
return word_counts.most_common(10)
# Sample text
sample_text = "Python is one of the most popular programming languages. Python is simple and easy to learn, and Python is powerful."
result = count_words(sample_text)
print("Word frequency statistics result:", result)
4. Sentiment Analysis
Next, we will implement a simple sentiment analyzer to determine the sentiment orientation of the text:
python run copy
def simple_sentiment_analysis(text):
# Define sentiment dictionary (example)
positive_words = ['like', 'good', 'great', 'excellent', 'outstanding']
negative_words = ['hate', 'bad', 'terrible', 'failure', 'poor']
# Word segmentation
words = jieba.lcut(text)
# Calculate score
score = 0
for word in words:
if word in positive_words:
score += 1
elif word in negative_words:
score -= 1
return "Positive" if score > 0 else "Negative" if score < 0 else "Neutral"
# Test
text1 = "This Python program is really good, very excellent!"
text2 = "This code is written too poorly, very terrible."
print("Text 1 sentiment:", simple_sentiment_analysis(text1))
print("Text 2 sentiment:", simple_sentiment_analysis(text2))
Tips:
-
When processing Chinese text, it is recommended to use jieba segmentation, as its segmentation effect is relatively accurate. -
Before processing text, remember to clean the data, removing unnecessary symbols and stop words. -
The accuracy of sentiment analysis depends on the completeness of the sentiment dictionary, which needs to be more comprehensive in practical applications.
5. Practical Project: News Text Classifier
Let’s integrate the knowledge we’ve learned to create a simple news classifier:
python run copy
def create_news_classifier():
# Example training data
news_data = {
'Technology': ['artificial intelligence', 'programming', 'algorithm', 'data', 'internet'],
'Sports': ['football', 'basketball', 'match', 'sports', 'player'],
'Finance': ['stocks', 'fund', 'investment', 'wealth management', 'market']
}
def classify_news(text):
words = jieba.lcut(text)
scores = {}
# Calculate matching degree for each category
for category, keywords in news_data.items():
score = sum(1 for word in words if word in keywords)
scores[category] = score
# Return the category with the highest score
return max(scores.items(), key=lambda x: x[1])[0]
return classify_news
# Test classifier
classifier = create_news_classifier()
test_news = "The latest artificial intelligence algorithm has made significant breakthroughs in data processing."
print("News category:", classifier(test_news))
Friends, today’s Python learning journey ends here! Remember to code along, and feel free to ask me any questions in the comments. Natural language processing is a very interesting field, and I hope this article helps you take your first step. We will learn more interesting NLP techniques in the future, so stay tuned! Wishing everyone a pleasant learning experience and continuous success in Python learning!