The Origins of Natural Language Processing: Markov and Shannon’s Language Modeling Experiments

Excerpt from towardsdatascience

Author: Raimi Karim

Translated by: Machine Heart

Contributors: Wang Zijia, Geek AI

Language modeling and text generation are currently two hot research topics in the field of natural language processing. Yet, over a century ago, scientific giants Markov and Shannon began their preliminary explorations in this area…
In 1913, Russian mathematician Andrey Andreyevich Markov sat in his study in St. Petersburg, holding a literary masterpiece of the time—Alexander Pushkin’s poetic novel Eugene Onegin, written in the 19th century.
However, Markov was not actually reading this renowned work; instead, he picked up a pen and a piece of draft paper, removed all punctuation and spaces from the first 20,000 letters of the book, and recorded them as a long string of letters. He then placed these letters into 200 grids (each grid containing 10×10 characters) and counted the number of vowels in each row and column, subsequently organizing these results.
To an uninformed observer, Markov’s behavior might seem peculiar. Why would someone deconstruct a literary genius’s work in such a way, reducing it to an incomprehensible form?
In fact, Markov was not reading this book to learn about knowledge related to life and humanity; he was searching for more fundamental mathematical structures within the text.
The reason for separating vowels and consonants was that Markov was testing his probability research, which he had been studying since 1909 (https://www.americanscientist.org/article/first-links-in-the-markov-chain).
Prior to this, research in the field of probability was largely limited to analyzing phenomena like roulette or coin tossing, where the outcome of previous events did not alter the probability of current events. However, Markov believed that most occurrences are a result of a series of causal relationships and depend on prior outcomes. He sought to find a way to model these events through probabilistic analysis.
Markov considered language to be an example of such a system: the characters that appeared in the past, to some extent, determined the current outcomes. To confirm this, he wanted to prove that the probability of a certain letter appearing in a text like Pushkin’s novel was to some extent dependent on the letters that appeared before it.
Thus, the scene at the beginning of this article where Markov counts the vowels in Eugene Onegin came to be. Through this counting, he found that 43% of the letters were vowels and 57% were consonants. Markov then categorized these 20,000 letters into vowel and consonant combinations: he discovered 1,104 pairs of

Leave a Comment