This article is authored by Wang Siyu, School of Finance, Zhongnan University of Economics and Law

Edited by: He Jiang

Technical Editor: Wen Heming

Stata and Python Data Analysis

The Crawling Club will hold acustom training on Stata programming techniquesandPython programming techniquesfromAugust 1 to 9, 2023atHenan University (Kaifeng, Henan), with live streaming on online platforms to provide online learning opportunities.Both online and offline training will have a dedicated Q&A team. If you are interested, please click the tweet link“Crawling Club 2023 First Programming Camp Registration is Open!”orclick the end of the articleto read the original textfor course details and registration methods!

Introduction

NLTK (Natural Language Toolkit) is a Python library for natural language processing (NLP). It provides a range of tools and resources for processing human language data, including text preprocessing, language models, language analysis, classification, tagging, corpora, and language datasets.

Installation

pip install nltk

Next, you can use thenltk.download()function to download the required corpora and model files. If an error occurs during execution, you can visit https://github.com/nltk/nltk_data to download the relevant data, by downloading the files in the packages folder and copying all subfolders under the packages folder to the path shown in the image below to resolve the issue.

Specific Applications

Text Processing

Using the functions and tools provided by NLTK, text preprocessing can be performed, mainly including tokenization, stopword removal, and part-of-speech tagging. Below, we will detail the workflow of text processing using NLTK tools.

First, import the modules and functions, and input the text to be processed into Python:

import nltk 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk import pos_tag # Import modules and functions
text = "I really enjoy the movie called Lost In The Stars. It is amazing."

Next, process the imported text and view the preprocessed text data and results:

tokens = word_tokenize(text) # Tokenization
print(tokens)
stop_words = set(stopwords.words('english')) # Remove stopwords
tokens = [token for token in tokens if token.casefold() not in stop_words]
tags = pos_tag(tokens) # Part-of-speech tagging
print(tags)

Here, RB represents adverb, VB represents verb base form, NN represents noun, VBN represents verb past participle, NNP represents proper noun (singular), and . represents punctuation.

Information Extraction

After basic text processing, NLTK can be used for noun phrase information extraction. First, we define the grammar rules:

from nltk import RegexpParser
grammar = r"""
    NP: {<dt>?<jj>*<nn.*>+}
""" # Define grammar rules
</nn.*></jj></dt>

Here,<dt></dt>represents determiners,<jj></jj>represents adjectives, and<nn.*></nn.*>represents any type of noun. This rule indicates that a noun phrase consists of an optional determiner, any number of adjectives, and a noun.

Next, we create a regular expression chunker, use it to parse the text, and combine them into a string as the extracted noun phrases:

chunk_parser = RegexpParser(grammar)
tree = chunk_parser.parse(tags)
noun_phrases = []
for subtree in tree.subtrees():
    if subtree.label() == 'NP':
        np = ' '.join([word for word, tag in subtree.leaves()])
        noun_phrases.append(np)
print("Extracted noun phrases:", noun_phrases)

Finally, the extracted noun phrases are “movie”, “Lost Stars”, and “amazing”.

Sentiment Analysis

NLTK can also be used for sentiment analysis of text. First, import the NLTK library and sentiment lexicon in the Python code, and use the NLTK-provided sentiment analyzerSentimentIntensityAnalyzerto create a sentiment analysis object:

from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

Next, perform sentiment analysis on the text using the sentiment analyzer’spolarity_scores()method and view the sentiment analysis results:

scores = analyzer.polarity_scores(text)
print(scores)

Here,'neg'represents the score for negative sentiment,'neu'represents the score for neutral sentiment,'pos'represents the score for positive sentiment, and'compound'represents the overall sentiment score. In this example, the overall sentiment score is 0.7178, indicating that this text is positive.

Natural Language Generation

NLTK also provides some tools and techniques for natural language generation, including rule-based methods, template methods, and machine learning-based methods. Below, we will generate sentences based on a template method, first defining a sentence generation template:

template = "The {adjective}{noun}{verb}{adverb}."

Next, define a function that randomly selects a word from the given vocabulary lists and inserts it into the template to generate a complete sentence:

import random
def generate_sentence(template, adjectives, nouns, verbs, adverbs):
    adjective = random.choice(adjectives)
    noun = random.choice(nouns)
    verb = random.choice(verbs)
    adverb = random.choice(adverbs)
    sentence = template.format(adjective=adjective, noun=noun, verb=verb, adverb=adverb)
    return sentence

Finally, by customizing vocabulary lists and calling the function, we can generate sentences:

adjectives = ["happy", "sad", "angry", "excited", "calm", "crazy", "brave", "shy", "smart", "funny"]
nouns = ["cat", "dog", "book", "computer", "phone", "car", "tree", "house", "city", "music"]
verbs = ["runs", "jumps", "reads", "writes", "talks", "listens", "dances", "learns", "thinks", "dreams"]
adverbs = ["quickly", "slowly", "loudly", "quietly", "happily", "sadly", "angrily", "excitedly", "calmly", "crazily"]
sentence = generate_sentence(template, adjectives, nouns, verbs, adverbs)
print(sentence)

This way, a random sentence is generated using a simple template and some vocabulary lists. The generated result is as follows:

Above are some usage methods for NLTK, come and try it out~

END

Exciting Benefits! In order to better serve the research needs of students, the Crawling Club will continue to provide various indicators required for financial research on the Xiaoetong platform, including top ten shareholders of listed companies, stock price crashes, investment efficiency, financing constraints, corporate tax avoidance, analyst tracking, return on equity, return on assets, the Big Four audits, Tobin’s Q, the first largest shareholder’s shareholding ratio, book-to-market ratio, commonly used control variables for A-share listed companies, and other series of deep-processed data. Based on the data disclosed by various exchanges, we will use Stata to achieve real-time data updates while continuously launching more data indicators. We hope to support everyone’s research work with the most cutting-edge data processing technology, the best service quality, and the greatest sincerity! For related data links, please visit:(https://appbqiqpzi66527.h5.xiaoeknow.com/homepage/10)or scan the QR code:

If we receive more than 1000 yuan in cumulative rewards for our tweets, we can issue you an invoice, with the invoice category as “consultation fee”. We are committed to doing our best and appreciate your support!

Previous ArticlesRecommendations

Scraping A-share listed companies’ specified year annual reports

Machine Learning – Introduction to Supervised Learning

Unveiling the Forbidden Magic: ChatGPT Code Interpreter

Summer is here~ Let’s take a look at the recent weather conditions

[Command Major Update] In-depth use of ChatGPT in Stata

Crawling Club 2023 First Programming Camp Registration is Open!

[Crawling Basics] Scrapy Crawling Framework

Moving towards Interactive Programming, ChatGPT Update!

A simple and fun Python library – MyQR

Enhanced version of the replace command! – How to use ereplace, in combination with egen

XML Easily read: Discover data treasures with Python

Crawling Club Launches cnstata.com.cn

Markdown: Making Mathematical Formula Input More Convenient!

A good hand for date processing: pendulum

Customize your own “Jarvis” – Python calls Chat

Learn to format, data formats at your fingertips

[Python Practical] The most favored cities by tourists, is your hometown on the list?

What’s new? Quick access to Stata 18

[Crawling Practice] Python Scraping Food Recipes

Uncovering Network Central Figures, Will You Be One of Them?

About Us

WeChat official account “Stata and Python Data Analysis” shares practical knowledge of data processing with Stata, Python, and other software. Reprints and rewards are welcome. We are a big data processing and analysis team composed of graduate and undergraduate students under the leadership of Professor Li Chuntao.

Wuhan String Data Technology Co., Ltd. has been providing data collection and analysis services to users. If you have such needs, please email [email protected], or directly contact our data platform chief engineer Mr. Si Haitao, phone: 18203668525, WeChat: super4ht. Mr. Si has long been engaged in research at the University of Hong Kong and is currently a doctoral student at a well-known 985 university, responsible for the courses on web crawling technology and regular expressions at the Crawling Club.

In addition, we welcome everyone to actively contribute articles introducing some data processing and analysis techniques related to Stata and Python.

Submission email: [email protected] Submission requirements: 1) Must be original, plagiarism is prohibited; 2) Must be accurate, detailed, with examples and screenshots; Notes: 1) All submissions will be reviewed by members of the public account operation team, and only those that pass the review will be accepted. Once accepted, the author will be credited in the article and receive a share of the reward.2) Please indicate “submission” in the email, with the email title as “Submission + Article Title”. 3) At the request of many readers, we now offer a paid Q&A service. If you encounter questions related to data processing, analysis, etc., you can ask in the public account for a small reward, and we will answer in later articles.

NLTK: Comprehensive Python NLP Library

Crawling Club Launches cnstata.com.cn

Markdown: Making Mathematical Formula Input More Convenient!

What’s new? Quick access to Stata 18

Leave a Comment Cancel reply