Practical Projects in Natural Language Processing with Python

Hello everyone! I am Hao Ge. Today, I would like to share a practical and interesting topic: using Python for Natural Language Processing (NLP). NLP sounds sophisticated, but it is essentially a technology that enables computers to understand and process human languages. By the end of this tutorial, you will master the basic skills of text analysis using Python.

1. Preparation: Install Necessary Libraries

First, we need to install several commonly used NLP libraries. We will mainly use NLTK and jieba, where NLTK is used for English processing and jieba is used for Chinese word segmentation.

python run copy

# Install necessary libraries
pip install nltk
pip install jieba

# Import required libraries
import nltk
import jieba
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

2. Text Preprocessing

Before performing natural language processing, we need to do some basic text processing. Here’s a simple example:

python run copy

# Sample text
text = "Python is a simple and easy-to-learn programming language! Python is very popular in the field of artificial intelligence."

# Use jieba for Chinese word segmentation
words = jieba.lcut(text)
print("Word segmentation result:", words)

# Remove punctuation
import re
clean_words = [word for word in words if re.match(r'[\u4e00-\u9fa5a-zA-Z]+', word)]
print("After removing punctuation:", clean_words)

3. Word Frequency Statistics

Counting word frequency is a fundamental task in NLP. Let’s implement a simple word frequency counter:

python run copy

from collections import Counter

def count_words(text):
    # Word segmentation
    words = jieba.lcut(text)
    # Count word frequency
    word_counts = Counter(words)
    # Return the 10 most common words
    return word_counts.most_common(10)

# Sample text
sample_text = "Python is one of the most popular programming languages. Python is simple and easy to learn, and Python is powerful."
result = count_words(sample_text)
print("Word frequency statistics result:", result)

4. Sentiment Analysis

Next, we will implement a simple sentiment analyzer to determine the sentiment orientation of the text:

python run copy

def simple_sentiment_analysis(text):
    # Define sentiment dictionary (example)
    positive_words = ['like', 'good', 'great', 'excellent', 'outstanding']
    negative_words = ['hate', 'bad', 'terrible', 'failure', 'poor']
    
    # Word segmentation
    words = jieba.lcut(text)
    
    # Calculate score
    score = 0
    for word in words:
        if word in positive_words:
            score += 1
        elif word in negative_words:
            score -= 1
            
    return "Positive" if score > 0 else "Negative" if score < 0 else "Neutral"

# Test
text1 = "This Python program is really good, very excellent!"
text2 = "This code is written too poorly, very terrible."
print("Text 1 sentiment:", simple_sentiment_analysis(text1))
print("Text 2 sentiment:", simple_sentiment_analysis(text2))

Tips:

  1. When processing Chinese text, it is recommended to use jieba segmentation, as its segmentation effect is relatively accurate.
  2. Before processing text, remember to clean the data, removing unnecessary symbols and stop words.
  3. The accuracy of sentiment analysis depends on the completeness of the sentiment dictionary, which needs to be more comprehensive in practical applications.

5. Practical Project: News Text Classifier

Let’s integrate the knowledge we’ve learned to create a simple news classifier:

python run copy

def create_news_classifier():
    # Example training data
    news_data = {
        'Technology': ['artificial intelligence', 'programming', 'algorithm', 'data', 'internet'],
        'Sports': ['football', 'basketball', 'match', 'sports', 'player'],
        'Finance': ['stocks', 'fund', 'investment', 'wealth management', 'market']
    }
    
    def classify_news(text):
        words = jieba.lcut(text)
        scores = {}
        
        # Calculate matching degree for each category
        for category, keywords in news_data.items():
            score = sum(1 for word in words if word in keywords)
            scores[category] = score
            
        # Return the category with the highest score
        return max(scores.items(), key=lambda x: x[1])[0]
    
    return classify_news

# Test classifier
classifier = create_news_classifier()
test_news = "The latest artificial intelligence algorithm has made significant breakthroughs in data processing."
print("News category:", classifier(test_news))

Friends, today’s Python learning journey ends here! Remember to code along, and feel free to ask me any questions in the comments. Natural language processing is a very interesting field, and I hope this article helps you take your first step. We will learn more interesting NLP techniques in the future, so stay tuned! Wishing everyone a pleasant learning experience and continuous success in Python learning!

Leave a Comment