Introduction to Contextual Word Representations in NLP

Excerpt from arXiv

Author: Noah A. Smith

Translated by Machine Heart

Contributors: Panda

The basics of natural language processing involve the representation of words. Noah Smith, a professor of Computer Science and Engineering at the University of Washington, recently published an introductory paper on arXiv that explains how words are processed and represented in natural language processing in a clear and accessible manner. Machine Heart focuses on sections 4 and 5, which involve context, skipping the basic introductions in sections 2 and 3. Interested readers can refer to the previous series of articles published by Machine Heart, “Word Embedding Series Blog: One, Two, Three”.

Introduction to Contextual Word Representations in NLP

Paper: https://arxiv.org/abs/1902.06006

Abstract: The goal of this introductory paper is to tell the story of how computers process language. This is part of the field of natural language processing (NLP), which is a branch of artificial intelligence. The aim of this paper is to provide a basic understanding of computer programming to a broad audience while avoiding detailed mathematical descriptions and not providing any algorithms. The focus is not on any specific applications of NLP, such as translation, question answering, or information extraction. The ideas presented here have been developed over decades by many researchers, so the references cited are not exhaustive but can point readers to some influential papers in the author’s view. After reading this paper, you should have a general understanding of word vectors (also known as word embeddings): why they exist, what problems they solve, where they come from, how they change over time, and what issues remain to be resolved regarding them. Readers already familiar with word vectors are encouraged to skip to section 5 for a discussion of the latest developments in “contextual word vectors”.

1 Prerequisites

There are two ways to talk about words.

word token, which is a word in the form of text. In some languages, determining the boundaries of a word token is a complex process (and speakers of that language might not agree on the “correct” rules), but in English, we often use spaces and punctuation to separate words; in this paper, we assume this has been “resolved”.
word type, which is an abstract word. Each word token is said to “belong” to its own type. When we count the occurrences of a word in a piece of text (called a corpus), we are counting the number of tokens that belong to the same word type.

4 Words as Distributional Vectors: Context as Meaning

There is an important idea in linguistics: words (expressions) that can be used similarly are likely to have related meanings. In a large corpus, we can collect information about how a word type w is used, for example, by counting how often it appears next to each other word. When we start to investigate the complete distribution of contexts in which w appears in the corpus, we can view word meaning from a distributional perspective.

A very successful method designed based on this idea for automatically inferring features is clustering; for example, the clustering algorithm by Brown et al. (1992) can automatically organize words into different clusters based on the context in which they appear in the corpus. Words that frequently appear in the same neighboring contexts (other words) will be grouped into the same cluster. These clusters will then merge into larger clusters. The resulting hierarchy, although not the same as the expert-designed data structures in WordNet, is surprisingly good in terms of interpretability and practicality. It also has the advantage of being reconstructable on any given corpus, and every observed word will be included. Thus, suitable word clusters can be constructed for news texts, biomedical articles, or Weibo posts.

Another class of methods first creates word vectors, where each dimension corresponds to the frequency of that word type appearing in a certain context (Deerwester et al., 1990). For example, one dimension might correspond to “the”, counting how many times that word appears immediately after “the”. This could include contexts from the left or right and different distances and lengths. The result could be a vector that is many times longer than the size of the vocabulary, with each dimension containing a small amount of potentially useful or useless information. Using methods from linear algebra (appropriately called “dimensionality reduction”), these vectors can be compressed into shorter vectors, where the redundancy among dimensions can be folded together.

These reduced vectors have some advantages. First, to meet the needs of programs, NLP program developers can select these dimensions. More compact vectors may be more efficient to compute, and the information loss caused by compression may also be beneficial, as “noise” from specific corpora may thus disappear. However, there are some trade-offs; longer, less compressed vectors retain more original information from the distributional vectors. Although individual dimensions of compressed vectors are difficult to interpret, we can use well-known algorithms to find the nearest neighbors of a word in that vector space, and it has been found that these words are often semantically related.

(In fact, these observations have given rise to the idea of vector space semantics (see Turney and Pantel, 2010), where algebraic operations can be applied to these word vectors to investigate what “meanings” they have learned. A famous example is: “man is to woman as king is to queen”; this analogy can yield a test method: v(man) – v(woman) = v(king) – v(queen). Some word vector algorithms have been designed to adhere to such properties.)

Dimensionality-reduced vectors also have a significant drawback: individual dimensions are no longer interpretable features and cannot be mapped back to intuitive building blocks that provide the meaning of the word. The meaning of the word is distributed across the entire vector; therefore, these vectors are sometimes referred to as “distributed representations”.

As the corpus grows, scalability becomes a major issue, as the number of observable contexts also increases. All word vector algorithms are based on the concept that the value in each dimension of the vector for each word type is an optimizable parameter, along with all other parameters, that can best fit the observed word patterns in the data. Since we view these parameters as continuous values, and the concept of “fitting the data” can be operated via a smooth continuous objective function, iterative algorithms based on gradient descent can be used to select these parameter values. Using tools common in the field of machine learning, researchers have developed some faster methods based on stochastic optimization. The word2vec package (Mikolov et al., 2013) is a well-known set of algorithms. The common pattern now is that industry researchers with large corpora and powerful computational infrastructures will use established (often costly) iterative methods to construct word vectors and then publish these vectors for anyone to use.

There are still many people exploring new methods for obtaining distributional word vectors. Here are some interesting ideas:

When we want to use neural networks for NLP problems, a useful approach is: first map each input word token to its vector, and then “feed” these word vectors into the neural network model that performs tasks such as translation. These vectors can be fixed in advance (i.e., pre-trained on a corpus using methods similar to those mentioned above, typically from others) or they can be treated as parameters of the neural network model, which are then specifically adjusted for the task (e.g., Collobert et al., 2011). Fine-tuning refers to first initializing word vectors through pre-training and then adjusting them through a learning algorithm for a specific task. These word vectors can also be initialized to random values and then evaluated solely through task learning, which we can call “learning from scratch”.
Using expert-constructed data structures (such as WordNet) as additional input to create word vectors. A method known as retrofitting first extracts word vectors from a corpus and then adjusts them so that related word types in WordNet are closer together in the vector space (Faruqui et al., 2015).
Aligning the vectors of words in two languages in a single vector space using bilingual dictionaries, so that corresponding word vectors have small Euclidean distances, such as the English word type “cucumber” and the French word type “concombre” (Faruqui and Dyer, 2014). By constructing a function that can reposition all English vectors to the French space (or vice versa), researchers hope to align all English and French words, not just those in the bilingual dictionary.
The calculation of word vectors is partly (or entirely) based on their character sequences (Ling et al., 2015). These methods often use neural networks to map sequences of arbitrary lengths to fixed-length vectors. This has two interesting effects: (1) in languages with complex morphological systems, different variants of the same root word may have similar vectors, (2) different spelling variants of the same word may have similar vectors. These methods have been quite successful in social media text applications, as there are rich spelling variants in social media texts. For example, the word “would” has many variants in social media messages, which will have similar character-based word vectors because they have similar spellings: would, wud, wld, wuld, wouldd, woud, wudd, whould, woudl, w0uld.

5 Contextual Word Vectors

We initially distinguished between word tokens and word types. All along, we have assumed that in our NLP programs, each word type is represented by a fixed data object (first integers, then vectors). This is convenient, but such assumptions about language do not hold in reality. Most importantly, many words have different meanings in different contexts. Experts have roughly included this phenomenon when designing WordNet; for example, “get” maps to over 30 different meanings (or interpretations). How many meanings should be assigned to different words, or how should the boundaries between one meaning and another be determined? These questions are difficult to achieve consensus on, let alone the fact that word meanings can shift and change. In practice, in many neural network-based NLP programs, the first thing done is to pass the type vector of each word token to a function that transforms it based on the neighboring context words of that word token, providing a new version of that word vector (now specific to the unique context of the token). In our previous example sentence, the two instances of “be” would have different vectors because one appears between “will” and “signed”, while the other is between “we’ll” and “able”.

It now seems that representing word types independently of context actually makes the problem harder than necessary. Because words have different meanings in different contexts, we need representations of types that encompass all these possibilities (for example, the 30 meanings of “get”). Shifting to word token vectors simplifies this, requiring only that the representation of a word token captures its meaning in that context. For the same reason, a set of contexts in which a word type appears can provide clues about its meaning, and a specific token context will provide clues about its specific meaning. For example, you may not know the meaning of the word “blicket”, but if I say, “I ate a strawberry blicket for dessert”, you can probably guess quite well.

Returning to the basic concept of similarity, we can anticipate that words that are similar to each other can replace each other well. For example, what words can replace “gin” well? If we only consider word types, this question is difficult to answer. WordNet tells us that “gin” can refer to a type of liquor, a hunting trap, a machine for separating seeds from cotton fibers, or a card game. But if we consider the given context (for example, “I use two parts gin to one part vermouth.”), it becomes very simple. In fact, if vodka can replace gin, then we can expect that vodka will have a similar contextual word vector.

ELMo stands for “embeddings from language models” (Peters et al., 2018a), which has brought significant advances to the form of word token vectors (i.e., vectors of words in context or contextual word vectors) that have been pre-trained on large corpora. Two important insights underpin ELMo:

If each word token has its own vector, then that vector depends on an arbitrary-length context of adjacent words. To obtain the “contextual vector”, we can start from the word type vectors and then pass them through a neural network that can convert arbitrary-length preceding and following word vectors into a single fixed-length vector. Unlike word type vectors (which are essentially lookup tables), contextual word vectors include both type-level vectors and neural network parameters that “contextualize” each word. ELMo will train a neural network for the preceding context (going back to the starting position of the sentence where the token is located) and will also train a neural network for the following context (up to the end of the sentence). Longer contexts beyond the sentence are also possible.
Recall that estimating word vectors requires solving an optimization problem to “fit the data” (where the data is the corpus). There has long been a fitting problem in NLP, known as language modeling, which refers to predicting the next word based on the sequence of “historical” words. Many of the word (type) vector algorithms in use are based on the concept of fixed-size contexts, collected from all instances of that word type in a corpus. ELMo goes further by using arbitrary-length histories and integrating the most effective language models known at the time (based on recurrent neural networks, Sundermeyer et al., 2012). Although recurrent networks had been widely used in NLP at that time, training them as language models and then using them to provide contextual vectors for each word token as pre-trained word (token) vectors was a novel approach.

Understanding the complete process of how words are handled in computers can be quite interesting. By observing the context in which words are situated, text data can guide the meanings of words; this powerful idea leads us primarily to derive the meaning of a word token from the specific context in which that word token resides. This means that each instance of “plant” has a different word vector; those contexts that seem to refer to the plant are expected to be closer to each other, while those contexts that may refer to a manufacturing center will cluster in another part of the vector space. Will this development fully resolve the problem of words having different meanings? That remains to be seen, but research has shown that ELMo is very beneficial in some NLP applications, including:

Question answering (relative error decreased by 9% on the SQuAD benchmark)
Labeling semantic roles of verbs (relative error decreased by 16% on the Ontonotes semantic role labeling benchmark)
Labeling expressions that refer to people or organizations in text (relative error decreased by 4% on the CoNLL 2003 benchmark)
Solving which expressions refer to the same entity (relative error decreased by 10% on the Ontonotes coreference resolution benchmark)

Peters et al. (2018a) and subsequent researchers have also reported progress on other tasks. Howard and Ruder (2018) introduced a similar method called ULMFiT, which has been shown to be helpful for text classification tasks. Subsequent innovations from the transformer-based bidirectional encoder representations (BERT; Devlin et al., 2018) have introduced some innovative learning methods and learned from more data, achieving an additional 45% reduction in error on the first task (compared to ELMo) and a 7% reduction on the second task. These methods have recently been used to test basic common sense reasoning on the SWAG benchmark (Zellers et al., 2018), where Devlin et al. (2018) found that ELMo achieved a 5% relative error reduction compared to non-contextual word vectors, and BERT further reduced it by 66% compared to ELMo.

As of this writing, there are still many unresolved questions regarding the relative performance of different methods. A complete explanation of the differences between learning algorithms (especially neural network architectures) is beyond the scope of this paper, but it can be reasonably said that the space of possible learning methods has not been fully explored; some exploration directions can be found in Peters et al. (2018b). Some findings on BERT suggest that fine-tuning may be crucial. While ELMo originates from language modeling, the modeling problem addressed by BERT (i.e., the objective function minimized during the estimation phase) is quite different. The role of datasets used to learn language models has not been fully evaluated, except for one unsurprising pattern: larger datasets tend to be more beneficial.

6 Notes

Word vectors are biased
Language is far more than just words
Natural language processing is not a single problem

This article is translated by Machine Heart, please contact this public account for authorization to reprint.

✄————————————————

Join Machine Heart (Full-time Reporter / Intern): [email protected]

Submissions or inquiries for reporting: content@jiqizhixin.com

Advertising & Business Cooperation: [email protected]

Leave a Comment Cancel reply