Understanding Key Points for Learning Natural Language Processing (NLP)

Madio.net

Mathematics China

/// Editor: Mathematics China Qianxia

Understanding Key Points for Learning Natural Language Processing (NLP)

This article is authored by Mr. Scofield and was originally published on the author’s personal blog. Lei Feng Network has obtained copyright for reprinting.

0. Preface

Some time ago, while combining deep learning with NLP, I kept pondering some questions, one of which is the most critical: how exactly does deep learning enable the resolution of various NLP tasks so perfectly? What exactly is happening to my data in the neural network?

Moreover, many terms like: word vectors, word embedding, distributed representation, word2vec, glove, etc., what do these terms represent, what are their specific relationships, and are they on an equal level?

Due to my obsessive-compulsive disorder in pursuing a complete knowledge structure, I kept researching, thinking, and revolving around these topics…

Then I felt a little progress. I thought, why not share my understanding, whether right or wrong, with my peers? Perhaps we can exchange something more meaningful?

The structure of this article is organized logically in order of concepts, peeling back layer by layer, level by level, comparing and explaining.

Additionally, I should mention that the entire text is relatively introductory, and some points may not be described objectively or correctly, limited by my current level of understanding… I hope you can be forgiving and correct me in the comments!

1. The Core Key of DeepNLP: Language Representation

Recently, there is a new term: Deep Learning + NLP = DeepNLP. When conventional machine learning reaches a certain stage of development, it is gradually overtaken by deep learning, which is leading a new wave of excitement because deep learning has advantages that machine learning cannot reach! So when deep learning enters the field of NLP, it naturally sweeps through a batch of ACL papers. And indeed, this is the case.

First, let’s mention the issue of data feature representation. Data representation is a core issue in machine learning. In the past stage of machine learning, feature engineering emerged, with a large number of manually designed features to solve the problem of effective data representation. However, in deep learning, forget about that, it’s end-to-end, one step to success, and hyper-parameters automatically help you select and find key feature parameters.

So how can deep learning exert its real power in NLP? Obviously, without discussing how to design a strong network structure, or how to introduce NN-based solutions for advanced tasks such as sentiment analysis, entity recognition, machine translation, and text generation into NLP, we must first pass the hurdle of language representation—how to make language representation a data type that NN can process.

Let’s take a look at how images and speech represent data:

Understanding Key Points for Learning Natural Language Processing (NLP)

In speech, an audio spectrum sequence vector matrix is used as the front-end input fed to the NN for processing, good; in images, a pixel matrix of the image is flattened into a vector sequence fed to the NN for processing, good; but in natural language processing? Oh, you may know or not know, each word is represented by a vector! The idea is quite simple, yes, in fact, it is that simple, but is it really that simple? Probably not.

Some people mention that images and speech belong to relatively natural low-level data representation forms. In the fields of image and speech, the most basic data is signal data, and we can determine whether signals are similar through some distance metrics. When judging whether two images are similar, we can simply observe the images themselves. However, language, as a high-level abstract tool for expressing human cognitive information evolved over millions of years, has highly abstract features. Text is symbolic data, and as long as two words are literally different, it is difficult to characterize their relationship. Even for synonyms like “microphone” and “话筒 (huàtǒng)”, it is hard to see that they mean the same thing (semantic gap phenomenon). It may not be as simple as adding one and one to represent it, and judging whether two words are similar requires more background knowledge to answer the question.

So can we confidently draw a conclusion here? How to effectively represent language sentences is the key premise for NN to exert its powerful fitting computational ability!

2. Types of Word Representation Methods in NLP

Next, following the above thoughts, I will introduce various word representation methods. According to current development, word representations are divided into one-hot representation and distributed representation.

1. One-hot Representation of Words

The most intuitive and still the most commonly used word representation method in NLP is One-hot Representation. This method represents each word as a very long vector. The dimension of this vector is the size of the vocabulary, where the vast majority of elements are 0, and only one dimension has a value of 1, which represents the current word. There is a lot of information about one-hot encoding, it is ubiquitous, here is a simple example to illustrate:

“话筒 (huàtǒng)” is represented as [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 …], “麦克 (màikè)” is represented as [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 …]

Each word is a 1 in a vast sea of 0s. This One-hot Representation, if stored in a sparse way, will be very concise: that is, assign a numeric ID to each word. For example, in the previous example, “话筒 (huàtǒng)” is recorded as 3, and “麦克 (màikè)” as 8 (assuming starting from 0). If you want to implement it in programming, you can use a Hash table to assign a number to each word. This concise representation method, combined with algorithms like maximum entropy, SVM, CRF, etc., has effectively completed various mainstream tasks in the field of NLP.

Now let’s analyze its drawbacks. 1. The dimension of the vector increases with the number of types of words in the sentence; 2. Any two words are isolated, making it impossible to represent the related information between words on a semantic level, and this is fatal.

2. Distributed Representation of Words

Traditional one-hot representation only symbolizes words and does not contain any semantic information. How to incorporate semantics into word representation? The distributional hypothesis proposed by Harris in 1954 provides a theoretical basis for this idea: words that are contextually similar also have similar meanings. Firth further elaborated and clarified the distributional hypothesis in 1957: the meaning of a word is determined by its context (a word is characterized by the company it keeps).

So far, based on the distributional hypothesis, word representation methods can be mainly divided into three categories based on different modeling: matrix-based distributional representation, clustering-based distributional representation, and neural network-based distributional representation. Although these different distributional representation methods use different technical means to obtain word representations, since these methods are all based on the distributional hypothesis, their core ideas consist of two parts: 1. Choose a method to describe the context; 2. Choose a model to characterize the relationship between a word (hereafter referred to as the “target word”) and its context.

3. NLP Language Models

Before detailing the distributed representation of words, it is necessary to clarify a key concept in NLP: language models. Language models include grammatical language models and statistical language models. Generally, we refer to statistical language models. The reason for placing language models before word representation methods is that the latter will soon use this concept.

Statistical language models view language (sequences of words) as a random event and assign corresponding probabilities to describe the likelihood of belonging to a certain language set. Given a vocabulary set V, for a sequence S = ⟨w1, · · · , wT⟩ ∈ Vn composed of words from V, the statistical language model assigns a probability P(S) to this sequence to measure the confidence that S conforms to the syntactic and semantic rules of natural language.

In simple terms, a language model is a model that computes the probability of a sentence. What is its significance? The higher the probability score of a sentence, the more it indicates that it is a natural sentence spoken by humans.

It is that simple. Common statistical language models include N-gram models, with the most common being unigram model, bigram model, trigram model, etc. Formally, the role of statistical language models is to determine a probability distribution P(w1; w2; :::; wm) for a string of length m, representing its likelihood of existence, where w1 to wm represent the various words in this text. Generally, in practical solving processes, the probability value is usually calculated using the following formula:

Understanding Key Points for Learning Natural Language Processing (NLP)

At the same time, these methods can also retain certain word order information, thus capturing the context information of a word.

Specific details of language models are widely available; please search for them on your own.

4. Distributed Representation of Words

1. Matrix-based Distributed Representation

Matrix-based distributed representation is often referred to as distributional semantic models. In this representation, a row in the matrix becomes the representation of the corresponding word, which describes the distribution of the context of that word. Since the distributional hypothesis posits that words with similar contexts have similar meanings, in this representation, the semantic similarity between two words can be directly converted into the spatial distance between two vectors.

A common model is the Global Vector model (GloVe model), which is a method for obtaining word representations by decomposing the “word-word” matrix, belonging to matrix-based distributed representation.

2. Neural Network-based Distributed Representation, Word Embedding

Neural network-based distributed representation is generally referred to as word vectors, word embedding, or distributed representation. This is our main focus today.

Neural network word vector representation technology models the context and the relationship between the context and the target word using neural network techniques. Due to the flexibility of neural networks, the greatest advantage of these methods is that they can represent complex contexts. In the previous matrix-based distributed representation method, the most common context is words. If n-grams containing word order information are used as context, the total number of n-grams will grow exponentially as n increases, leading to the curse of dimensionality. However, neural networks can model n-grams by combining several words, with the number of parameters increasing only at a linear rate. With this advantage, neural network models can model more complex contexts, embedding richer semantic information in word vectors.

5. Word Embedding

1. Concept

Neural network-based distributed representation is also known as word vectors or word embedding. The neural network word vector model, like other distributional representation methods, is based on the distributional hypothesis, with the core still being the representation of context and modeling the relationship between context and the target word.

As mentioned earlier, to choose a model that characterizes the relationship between a word (hereafter referred to as the “target word”) and its context, we need to capture the context information of a word in the word vector. At the same time, we just happened to mention that statistical language models have the ability to capture context information. Therefore, the most natural approach to construct the relationship between context and the target word is to use language models. Historically, early word vectors were merely by-products of neural network language models.

In 2001, Bengio et al. formally proposed the Neural Network Language Model (NNLM), which, while learning the language model, also obtained word vectors. So please note: word vectors can be considered as by-products of training language models using neural networks.

2. Understanding

As mentioned earlier, one-hot representation has the drawback of excessive dimensionality, so now we make some improvements to the vector: 1. Change each element of the vector from integer to floating point, becoming a representation of the entire real number range; 2. Compress the originally sparse large dimension into a smaller dimensional space. As illustrated:

Understanding Key Points for Learning Natural Language Processing (NLP)

This is also the reason why word vectors are also called word embeddings.

6. Neural Network Language Models and Word2Vec

Now that we have a rational understanding of the hierarchical relationship of distributed representation and word embedding concepts, how does this relate to word2vec?

1. Neural Network Language Models

As mentioned, word vectors can be obtained by training language models through neural networks, so what types of neural network language models are there? As far as I know, there are several:

● Neural Network Language Model, NNLM ● Log-Bilinear Language Model, LBL ● Recurrent Neural Network based Language Model, RNNLM ● The C&W model proposed by Collobert and Weston in 2008 ● The CBOW (Continuous Bag of Words) and Skip-gram models proposed by Mikolov et al.

At this point, some familiar terms may come into view: CBOW and Skip-gram, those familiar with word2vec should understand this. Let’s continue.

2. Word2Vec and CBOW, Skip-gram

Now we formally introduce the hottest term: word2vec.

The five neural network language models mentioned above are only conceptual; we need to implement them through design, and the tools to implement the CBOW (Continuous Bag of Words) and Skip-gram language models are the well-known word2vec! Additionally, the implementation tool for the C&W model is SENNA.

Thus, distributed word vectors were not invented by the authors of word2vec; they simply proposed a faster and better way to train language models. These are: the Continuous Bag of Words Model (CBOW) and the Skip-Gram Model, both of which can train word vectors. In specific code operations, you can choose either one, but according to the paper, CBOW is faster.

By the way, let’s talk about these two language models. A statistical language model calculates the (posterior) probability of a word based on the premise of a few words appearing. CBOW, as the name suggests, calculates the probability of a word based on the C preceding words or the C consecutive words before and after it. The Skip-Gram Model is the opposite, calculating the probabilities of the words that appear before and after a given word.

For example, in the sentence “I love Beijing Tiananmen”, if we focus on the word “love” and C=2, its context would be “I” and “Beijing Tiananmen”. The CBOW model uses the one-hot representation of “I” and “Beijing Tiananmen” as input, which are C numbers of 1xV vectors, multiplied by the same VxN coefficient matrix W1 to obtain C numbers of 1xN hidden layers, then averaging them to calculate one hidden layer. This process is also called a linear activation function (is this even an activation function? It’s clearly without an activation function). Then it is multiplied by another NxV coefficient matrix W2 to get a 1xV output layer, where each element represents the posterior probability of each word in the vocabulary. The output layer needs to be compared with the ground truth, which is the one-hot form of “love”, to calculate the loss. It’s important to note that V is usually a large number, like several million, making the computation quite time-consuming. Besides, the element corresponding to “love” must be included in the loss; word2vec uses hierarchical softmax based on Huffman coding to filter out some impossible words, and then uses negative sampling to discard some negative sample words, reducing the time complexity from O(V) to O(logV). The Skip-gram training process is similar, but the input and output are just reversed.

Additionally, the training methods for word embedding can generally be divided into two categories: one is unsupervised or weakly supervised pre-training; the other is end-to-end (supervised) training. Unsupervised or weakly supervised pre-training is represented by word2vec and auto-encoders. The characteristic of this type of model is that it can obtain reasonably good embedding vectors without requiring a large number of manually labeled samples. However, due to the lack of task orientation, it may still be somewhat distant from the problem we are trying to solve. Therefore, we often fine-tune the entire model using a small number of manually labeled samples after obtaining the pre-trained embedding vectors.

In contrast, end-to-end supervised models have gained increasing attention in recent years. Compared to unsupervised models, end-to-end models are often more complex in structure. At the same time, due to having clear task orientation, the embedding vectors learned by end-to-end models are often more accurate. For example, a deep neural network formed by connecting an embedding layer and several convolutional layers to achieve sentiment classification of sentences can learn richer semantic representations of word vectors.

3. My Understanding of Word Embedding

Now that word vectors can reduce dimensionality while capturing the contextual information of the current word in the sentence (represented by the distance relationship before and after), we are very confident and satisfied to use them to represent language sentences as input for NN.

Another practical suggestion is that when you are working on a specific NLP task where you need to use word vectors, I recommend: either 1. use pre-trained word vectors from others, noting that they must be from the same corpus content domain; or 2. train your own word vectors. I recommend the former because… there are too many pitfalls.

7. Conclusion

At this point, I actually have no intention of continuing to elaborate, as I do not plan to explain the mathematical principles and details of word2vec, because I find that there are already too many articles online explaining word2vec, so many that almost all articles are the same. Therefore, I don’t need to copy another one.

So, to understand the details of word2vec, CBOW, and Skip-gram, please search carefully. I believe that with the background knowledge of this series of prerequisite contexts, when you read articles related to the details of word2vec, it will not be too difficult.

Additionally, this reflects a bigger issue, namely the lack of critical thinking originality in online articles.

Just a casual search for “word2vec” or “word vectors” yields a plethora of explanations about the mathematics of word2vec, CBOW, and Skip-gram, and they are all almost identical… But what is most incomprehensible is that almost no one elaborates on the context of their emergence, their developmental process, or their position within the entire related technology framework. This frustrates me…

In fact, I would like to share that in my personal methodology, a well-structured knowledge framework with complete context is, to some extent, much more important than detailed knowledge points. Because once a complete knowledge structure framework is built, all you need to do is fill in some fragmented details; the reverse is not possible; mere accumulation of knowledge will only confuse your thinking and will not get you far.

So here I also call on all bloggers to fully exercise their initiative, actively create something that doesn’t exist, and share unique insights as a contribution to the Chinese online blog and CS industry! I mean, even if you copy someone else’s original work, it’s best to digest it and add your own insights before sharing!

Click below

Follow us

— THE END —

Leave a Comment