Click the “MLNLP” above, and choose to add “star” or “top”
Heavyweight content delivered promptly
Reprinted from the public account: Deep Learning Natural Language Processing
Citation
How do people conveniently browse and obtain information from a large amount of text? The answer is certainly through keywords. Think about it carefully, how do we extract keywords? We have been exposed to language and grammar since childhood. When we hear or see a sentence, our brains automatically segment the sentence according to rules (didn’t we practice sentence segmentation in elementary school?), and we remember that our language teacher mentioned that the subject (noun), predicate (verb), and object (noun) are usually the key points. Thus, our brains have been labeling words in sentences based on their parts of speech and grammar since childhood, training classifiers. As we are exposed to more and more corpora, the classifiers become increasingly accurate (if you are engaged in linguistics, your classifiers are even more precise). However, relying solely on parts of speech and grammar can lead to a problem in long texts, as there can be many subjects, predicates, and objects in an article, and not all of these words can be keywords. How does our brain handle this? If we are very familiar with the background and theme of an article, we can accurately extract keywords from it, but when we encounter a relatively unfamiliar article, we often find it difficult to extract keywords accurately.
Algorithm
The above corresponds to two methods in machine learning: supervised learning and unsupervised learning.The supervised keyword extraction method extracts keywords through classification, requiring labeled data to train classifiers, but the downside is that it requires a large amount of labeled data, making the labor cost too high. In contrast, unsupervised learning methods do not require labeled data. Common unsupervised keyword extraction algorithms include: TF-IDF, TextRank, and topic modeling algorithms (LDA, LSA, LSI). Here we will focus on the LDA algorithm, and the other algorithms will be discussed later.
I don’t like to use many academically difficult terms, so I willexplain the principle of the LDA algorithm in simple terms. Generally, we can define a topic as a collection of keywords. If an article contains these keywords, we can directly determine that the article belongs to a certain topic. However, this definition has a flaw; for example, if an article mentions the name of a sports star, we might conclude that the topic is sports. You might immediately counter that it’s not certain; the article indeed mentions a sports star, but it is entirely about the star’s scandals, which has nothing to do with basketball. In this case, the topic is more about entertainment. Therefore, a single word cannot rigidly define a topic. If an article mentions a certain sports star, we can only say there is a high probability it belongs to the sports topic, but there is also a small probability it belongs to the entertainment topic.The same word has different probabilities of occurrence in different thematic contexts. LDA believes that articles are composed of basic vocabulary combinations, and LDA reflects themes through the probability distribution of vocabulary!
Thus, we can define the generative process of LDA:
1. For each document, draw a theme from the theme distribution.
2. Randomly select a word from the word distribution corresponding to the drawn theme.
3. Repeat the above process until all words in the document have been traversed.
4. After these three steps, we can check whether the product of the two distributions matches the distribution of the given article, and adjust accordingly.
The training of LDA generates the document-topic distribution matrix and topic-word distribution matrix based on the existing dataset.
Thus, the core of LDA is actually this formula:
P(word | document) = P(word | topic) * P(topic | document)
Practical Exercise
Having said so much, let’s implement it with code. Gensim has a well-implemented training method that can be called directly. Gensim is an open-source third-party Python toolkit used for unsupervised learning of topic vector representations from raw unstructured text.
Training a keyword extraction algorithm requires the following steps:
-
Load existing document datasets.
-
Load the stopword list.
-
Segment the documents in the dataset.
-
Filter out interference words based on the stopword list.
-
Train the algorithm based on the training set.
(Many blogs use jieba for segmentation, but I personally believe that jieba segmentation is not very accurate. If the segmentation is not accurate, how can we extract accurate keywords?), I personally use the perceptron algorithm from pyhanlp for segmentation, which I find to be the most accurate algorithm based on extensive practical work.
a. Import relevant libraries
import math
import numpy as np
from pyhanlp import *
import functools
from gensim import corpora, models
b. Define the method to load the stopword list
def get_stopword_list():
stop_word_path = 'stopwords.txt'
stopword_list = [sw.replace('', '') for sw in open(stop_word_path).readlines()]
return stopword_list
c. Define a segmentation method
def seg_to_list(sentence, pos=False):
seg_list = HanLP.newSegment("perceptron").seg(sentence)
return seg_list
d. Define the interference word filtering method: filter interference words based on segmentation results
def word_filter(seg_list, pos=False):
stopword_list = get_stopword_list()
filter_list = [str(s.word) for s in seg_list if not s.word in stopword_list and len(s.word) > 1]
return filter_list
e. Load the dataset, segment the data in the dataset, and filter out interference words, resulting in a list of terms composed of non-interference words for each text
def load_data(pos=False):
doc_list = []
ll = []
for line in open('corpus.txt', 'r', encoding='utf-8'):
ll.append(line.strip())
content = ''.join(ll)
seg_list = seg_to_list(content, pos)
filter_list = word_filter(seg_list, pos)
doc_list.append(filter_list)
return doc_list
f. Train the LDA model
# doc_list: result of the load dataset method
# keyword_num: number of keywords
# model: specific algorithm of the topic model
# num_topics: number of topics in the topic model
class TopicModel(object):
def __init__(self, doc_list, keyword_num, model='LDA', num_topics=4):
# Use gensim's interface to convert text into vectorized representation
self.dictionary = corpora.Dictionary(doc_list)
# Vectorize using BOW model
corpus = [self.dictionary.doc2bow(doc) for doc in doc_list]
# Weight each word based on TF-IDF to obtain weighted vector representation
self.tfidf_model = models.TfidfModel(corpus)
self.corpus_tfidf = self.tfidf_model[corpus]
self.keyword_num = keyword_num
self.num_topics = num_topics
self.model = self.train_lda()
# Obtain topic-word distribution of the dataset
word_dic = self.word_dictionary(doc_list)
self.wordtopic_dic = self.get_wordtopic(word_dic)
def train_lda(self):
lda = models.LdaModel(self.corpus_tfidf, num_topics=self.num_topics, id2word=self.dictionary)
return lda
def get_wordtopic(self, word_dic):
wordtopic_dic = {}
for word in word_dic:
single_list = [word]
wordcorpus = self.tfidf_model[self.dictionary.doc2bow(single_list)]
wordtopic = self.model[wordcorpus]
wordtopic_dic[word] = wordtopic
return wordtopic_dic
def get_simword(self, word_list):
sentcorpus = self.tfidf_model[self.dictionary.doc2bow(word_list)]
senttopic = self.model[sentcorpus]
# Calculate cosine similarity
def calsim(l1, l2):
a, b, c = 0.0, 0.0, 0.0
for t1, t2 in zip(l1, l2):
x1 = t1[1]
x2 = t2[1]
a += x1 * x1
b += x1 * x1
c += x2 * x2
sim = a / math.sqrt(b * c) if not (b * c) == 0 else 0.0
return sim
sim_dic = {}
for k, v in self.wordtopic_dic.items():
if k not in word_list:
continue
sim = calsim(v, senttopic)
sim_dic[k] = sim
for k, v in sorted(sim_dic.items(), key=functools.cmp_to_key(cmp), reverse=True)[:self.keyword_num]:
print(k + "/", end='')
print() # Word space construction and vectorization method, generally handled when there is no gensim interface
def word_dictionary(self, doc_list):
dictionary = []
for doc in doc_list:
dictionary.extend(doc)
dictionary = list(set(dictionary))
return dictionary
def doc2bowvec(self, word_list):
vec_list = [1 if word in word_list else 0 for word in self.dictionary]
return vec_list
g. Call the main function to perform keyword extraction on the target text
if __name__ == '__main__':
text = 'At the meeting, the China Social Assistance Foundation signed a contract with the "2nd China Caring Cities Conference" organizer, Jinjiang City. Chairman Xu Jialu accepted a donation of goods worth 4 million yuan for the "Million Elderly Care Action" aimed at national key poverty alleviation areas.'
pos = False
seg_list = seg_to_list(text, pos)
filter_list = word_filter(seg_list, pos)
print('LDA model results:')
topic_extract(filter_list, 'LDA', pos)
LDA Model Results:
Key Points/Xu Jialu/Action/Contract/Million/Chairman/Caring/Goods/Jinjiang City/Accepted/
Overall, the results are quite accurate.
Recommended Reading:
Simulated Human Brain Project Declared a Total Failure: Costing 1 Billion Euros, It Shocked the World 10 Years Ago, Now It Has Died Quietly
From Word2Vec to Bert, Let’s Talk About the Past and Present of Word Vectors (Part 1)
Tsinghua Yao Class Graduate, 95-Postdoc Chen Lijie Wins Best Student Paper at Theoretical Computer Science Conference