Source | Wujie Community Mixlab
Editor | An Ke
【PanChuang AI Introduction】: Recently, everyone has been flooded with the popular Qing Dynasty drama “Story of Yanxi Palace”~ The male lead, Emperor Qianlong, is often referred to as a “big pig’s hoof” by everyone because he falls in love with every woman he meets. As simple programmers, how can we untangle the complex character relationships in the drama?~ Through this article, we can briefly introduce how to use word2vec to interpret the character relationships in “Story of Yanxi Palace”. Through technical analysis, let’s see if Qian Xiaosi is really a “big pig’s hoof”! By the way, if you still don’t know what word2vec is, we have introduced it before~ For details, click here: Training Word Vectors Based on Word2Vec. If you want to learn more about machine learning and deep learning, feel free to click on the blue text above to follow our public account: PanChuang AI.
Reading Difficulty: ★★☆☆☆
Skill Requirements: Machine Learning, Python, Word Segmentation, Data Visualization
Word Count: 1500 words
Reading Time: 6 minutes
This article combines the recently popular TV series “Story of Yanxi Palace” to interpret the character relationships through data. By collecting relevant novels, scripts, character introductions, etc., from the internet and training the word2vec deep learning model, we constructed a character relationship graph and displayed it visually.
1
Graph
First, let’s take a look at the character relationship graph of the entire drama:
The larger and closer to the center a node is, the more complex its relationships with other characters are in the drama..
Let’s zoom in and observe:
We can find that:
Yingluo, Erqing, Jixiang, Mingyu, Jinxiu, Fu Heng
These six characters are key to driving the plot of the entire drama. From this perspective, “Story of Yanxi Palace” is a story about multiple palace maids and a guard.
This graph also shows the degree of association between each character and others.
Now let’s look at Qianlong’s relationships with others:
Of course, we can also query through code:
Graph experience address:
https://shadowcz007.github.io/text2kg/
How was the above graph obtained?
2
Construction Ideas
Required data:
Story of Yanxi Palace novel
Story of Yanxi Palace script
Story of Yanxi Palace character names
Algorithm:
word2vec
Frontend:
echart
Development Environment:
python
When processing data, we need to remove punctuation marks and some useless words (e.g., chapters). Using Jieba for word segmentation, we perform a round of segmentation and then remove characters with a length of 1 (e.g., various modal particles, quantifiers, etc.).
Finally processed into:
Once the data is ready, we mainly use gensim for word2vec training. Gensim is a Python NLP package that wraps Google’s C implementation of word2vec. Installing gensim is very easy, just use “pip install gensim”.
3
word2vec
Word2vec, also known as word embeddings, is used to convert words in natural language into dense vectors that computers can understand. The relationship of words converted into vectors is shown in the following diagram:
Word2vec can learn the relationships between words, and the principle is that related words in text tend to appear together frequently. Let’s look at the following diagram:
From the diagram, word2vec can learn various interesting relationships. For example, the word “king” often appears with “queen”, while “man” often appears with “woman”.
Through word2vec analysis, we can find that the vector representing “king” can have a simple relationship with the vectors representing “queen”, “man”, and “woman”:
king = queen – woman + man
Through the transformation from words to vectors, we can perform various operations based on vectors.
In addition to being applied in linguistics, it can also be applied in chemistry, such as Atom2Vec, which can learn to differentiate different atoms from the names of compounds formed by combinations of different elements (e.g., NaCl, KCl, H2O), thereby discovering potential new compounds. This program borrows a simple concept from natural language processing:
The properties of a word can be inferred from the other words that appear around it; correspondingly, chemical elements can be clustered based on their chemical environments.
From this data analysis, AI programs can discover that potassium and sodium have similar properties because they can combine with halogens to form compounds, “just like king and queen are similar, potassium and sodium are also similar.”
The trained model can input compounds composed of different atoms for various vector operations, helping us discover new compounds.
4
Gensim Word2Vec Implementation
You can start training the model with a simple line of code:
model = Word2Vec(line_sent, size=100, window=5, min_count=1)
After training, you can query to find the closest word vectors:
model.wv.similar_by_word('璎珞', topn=10)
You can also see the similarity between two word vectors; here are the similarity levels of two groups of characters in the drama:
print(model.wv.similarity('璎珞', '尔晴'))
print(model.wv.similarity('皇后', '弘历'))
Similarity:
0.9175463897110617
0.8206695311318175
Or find words that are different in category; here are the classifications of characters:
model.wv.doesnt_match("璎珞 皇后 弘历 傅恒 尔晴".split())
Result: 弘历
From the result, 弘历 is the emperor and is not in the same category as these people.
Let’s look at another group:
model.wv.doesnt_match("璎珞 皇后 傅恒 尔晴".split())
Result: 傅恒
傅恒 is male and is also different from these people.
The above is the full content; word2vec has other interesting applications, such as analyzing the personalities of each character, finding descriptive words for character personalities; applying it to the discovery of design languages, we can extract characteristics of a certain design style; applying it to article writing, we can analyze the style of a certain article, allowing machines to collaborate with us in creative writing; etc.
You may also want to check out:
● Do You Really Understand Text Classification and Naive Bayes?
● AI Learning Guide to Impacting an Annual Salary of 500,000, Limited Time Free~
● Recommended Account | High-Quality Public Account Atlas (Gathering AI, Python, C++, Java, etc.)
Welcome to scan and follow: