Step-by-Step NLP Guide: Extract Text Features Using ELMo

Author: PRATEEK JOSHI

Translator: Han Guojun

Proofreader: Li Hao

This article is approximately 3500 words, and is recommended to be read in 15 minutes.

This article will introduce the principles of ELMo and how it differs from traditional word embeddings, followed by practical demonstrations of its effectiveness.

Introduction

I am dedicated to researching issues related to Natural Language Processing (NLP). Each NLP problem presents a unique challenge and reflects how complex, beautiful, and intricate human language is.

However, one major headache for NLP practitioners is that machines cannot understand the true meaning of sentences. Yes, I am referring to the context issue in natural language processing. Traditional NLP techniques and architectures perform well on basic tasks, but their effectiveness declines when we try to incorporate context as a variable.

Over the past 18 months, the landscape of the NLP field has changed significantly, with NLP models such as Google’s BERT and Zalando’s Flair able to analyze sentences and grasp contextual information.

ELMo Model

Understanding contextual meaning is a significant breakthrough in the field of NLP, thanks to ELMo (Embeddings from Language Models), which is a state-of-the-art NLP architecture developed by AllenNLP. By the end of this article, you will be a loyal fan of ELMo just like me.

In this article, we will explore ELMo (Embeddings from Language Models) and build an exciting NLP model using it on a real dataset with Python.

Note: This article assumes you are familiar with various word embeddings and LSTM (Long Short-Term Memory) structures. You can refer to the following articles for more information on these topics:

An Intuitive Understanding of Word Embeddings

(https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/?utm_medium=ELMoNLParticle&utm_source=blog )

Essentials of Deep Learning: Introduction to Long Short Term Memory

(https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_medium=ELMoNLParticle&utm_source=blog )

Table of Contents

1. What is ELMo?

2. Understanding How ELMo Works

3. How Does ELMo Differ from Other Word Embeddings?

4. Applying the ELMo Model for Text Classification in Python:

Understanding the Problem Statement
Dataset Introduction
Importing Libraries
Importing and Checking Data
Text Cleaning and Preprocessing
Brief Introduction to TensorFlow Hub
Preparing ELMo Model Vectors
Building and Evaluating the Model

5. What Else Can We Do with ELMo?

6. Conclusion

1. What is ELMo?

The ELMo we mentioned is not a character from Sesame Street, which is also a typical example that highlights the importance of contextual meaning.

ELMo is a new way to represent vocabulary in word vectors or word embeddings. These word embedding methods can effectively generate state-of-the-art results in several NLP problems:

NLP researchers worldwide have begun using ELMo for NLP problems in both academic and applied fields. I recommend checking out the initial paper on ELMo (https://arxiv.org/pdf/1802.05365.pdf). Generally, I do not suggest reading academic papers because they tend to be long and complex, but this paper is different; it explains the principles and design process of ELMo well.

2. Understanding How ELMo Works

Before we get into practice, let’s intuitively understand how ELMo operates. Why is this step important?

Imagine the following scenario: You have successfully downloaded the ELMo Python code from GitHub and built a model on your text dataset, but you only get mediocre results, so you need to improve. If you don’t understand the architecture of ELMo, how will you know what to improve? If you haven’t studied it, how do you know which parameters need adjustment?

This mindset applies to all machine learning algorithms; you don’t need to understand their derivation process, but you must have enough knowledge to manipulate and improve your model.

Now, let’s return to how ELMo works.

As I mentioned earlier, ELMo’s word vectors are calculated on a two-layer bidirectional language model (biLM). This model consists of two stacked layers, each with a forward pass and a backward pass iteration.

The structure in the above image uses a character-level convolutional neural network (CNN) to convert words in the text into raw word vectors.
These raw word vectors are input into the first layer of the bidirectional language model.
The forward iteration includes information about the word and some preceding words or context.
The backward iteration includes information about the words that follow.
The information from both iterations forms the intermediate word vectors.
These intermediate word vectors are input into the next layer of the model.
The final representation (ELMo) is the weighted sum of the raw word vectors and the two intermediate word vectors.

Because the input metric of the bidirectional language model is characters rather than vocabulary, the model can capture the internal structural information of words. For example, for “beauty” and “beautiful,” even without understanding the context of these two words, the bidirectional language model can recognize a certain degree of correlation between them.

3. How Does ELMo Differ from Other Word Embeddings?

Unlike traditional word embeddings like word2vec or GLoVe, the vector corresponding to each word in ELMo is actually a function of the entire sentence containing that word. Therefore, the same word will have different word vectors in different contexts.

You might ask: How does this difference help me in dealing with NLP problems? Let me clarify with an example:

We have the following two sentences:

I read the book yesterday.
Can you read the letter now?

Take a moment to consider the difference between these two sentences; the verb “read” in the first sentence is in the past tense, while “read” in the second sentence is in the present tense, which is a case of polysemy.

How Intricate and Complex Language Is

Traditional word embeddings generate the same vector for the word “read” in both sentences, so these architectures cannot distinguish between polysemous words; they cannot recognize the context of words.

In contrast, ELMo’s word vectors can effectively solve this problem. The ELMo model calculates word embeddings by inputting the entire sentence into the equation. Therefore, the “read” in the two sentences above will have different ELMo vectors.

4. Implementation: Applying the ELMo Model for Text Classification in Python

Now comes the part you have been looking forward to—implementing ELMo in Python! Let’s go step by step:

Understanding the Problem Statement

The first step in addressing a data science problem is to clarify the problem statement, which will serve as the foundation for your subsequent actions.

For this article, we have the following problem statement:

Sentiment analysis has always been a key issue in the NLP field. This time we collected tweets from consumers about several companies that produce and sell high-tech products like mobile phones and computers on Twitter. Our task is to determine whether these tweets contain negative evaluations.

This is clearly a binary text classification task that requires us to predict sentiment from the extracted tweets.

Dataset Introduction

We have split the dataset:

There are 7920 tweets in the training set.
There are 1953 tweets in the test set.

You can download the dataset from here:

https://datahack.analyticsvidhya.com/contest/linguipedia-codefest-natural-language-processing-1/#data_dictionary

Note: Most of the profanity in the dataset has been replaced with “$&@*#”, but some tweets may still contain some profanity.

Alright, let’s open our favorite Python IDE and start coding!

Importing Libraries

Import the libraries we will use:

import pandas as pd

import numpy as np

import spacy

from tqdm import tqdm

import re

import time

import pickle

pd.set_option(‘display.max_colwidth’, 200)

Importing and Checking Data

# read data

train = pd.read_csv(“train_2kmZucJ.csv”)

test = pd.read_csv(“test_oJQbWVk.csv”)

train.shape, test.shape

Output:

((7920, 3), (1953, 2))

There are 7920 tweets in the training set and 1953 tweets in the test set. Next, let’s check the category distribution in the training set:

train[‘label’].value_counts(normalize = True)

Output:

0 0.744192

1 0.255808

Name: label, dtype: float64

Here, 1 represents negative tweets, and 0 represents non-negative tweets.

Now let’s take a look at the first five rows of the training set:

train.head()

We have three columns of data: the “tweet” column is the independent variable, and the “label” column is the target variable.

Text Cleaning and Preprocessing

Ideally, we would have a tidy and structured dataset, but it is still quite challenging in the NLP field.

We need to spend some time cleaning the data to prepare for model building. Extracting features from cleaned text will become easier, and the features may even contain more information. You will find that the higher the quality of your data, the better your model’s performance will be.

So let’s clean the existing dataset first.

We can see that some tweets contain URL links, which do not help sentiment analysis, so we need to remove them.

# remove URL’s from train and test

train[‘clean_tweet’] = train[‘tweet’].apply(lambda x: re.sub(r’http\S+’, ”, x))

test[‘clean_tweet’] = test[‘tweet’].apply(lambda x: re.sub(r’http\S+’, ”, x))

We use regular expressions (Regular Expression) to remove URLs.

Note: You can learn about regular expressions from here:

https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/regular-expressions-python?utm_medium=ELMoNLParticle&utm_source=blog

Now we will do some routine data cleaning:

# remove punctuation marks

punctuation = ‘”!#$%&()*+-/:;<=>?@[\]^_`{|}~’

train[‘clean_tweet’] = train[‘clean_tweet’].apply(lambda x: ”.join(ch for ch in x if ch not in set(punctuation)))

test[‘clean_tweet’] = test[‘clean_tweet’].apply(lambda x: ”.join(ch for ch in x if ch not in set(punctuation)))

# convert text to lowercase

train[‘clean_tweet’] = train[‘clean_tweet’].str.lower()

test[‘clean_tweet’] = test[‘clean_tweet’].str.lower()

# remove numbers

train[‘clean_tweet’] = train[‘clean_tweet’].str.replace(“[0-9]”, ” “)

test[‘clean_tweet’] = test[‘clean_tweet’].str.replace(“[0-9]”, ” “)

# remove whitespaces

train[‘clean_tweet’] = train[‘clean_tweet’].apply(lambda x:’ ‘.join(x.split()))

test[‘clean_tweet’] = test[‘clean_tweet’].apply(lambda x: ‘ ‘.join(x.split()))

Next, we will normalize the text, which will simplify words into their base forms. For instance, “produces”, “production”, and “producing” will become “product”. Generally speaking, the various forms of the same word are not important; we only need their base forms.

We will use the popular spaCy library for normalization:

# import spaCy’s language model

nlp = spacy.load(‘en’, disable=[‘parser’, ‘ner’])

# function to lemmatize text

def lemmatization(texts):

output = []

for i in texts:

s = [token.lemma_ for token in nlp(i)]

output.append(‘ ‘.join(s))

return output

Classifying in the test and training sets (Lemmatize):

train[‘clean_tweet’] = lemmatization(train[‘clean_tweet’])

test[‘clean_tweet’] = lemmatization(test[‘clean_tweet’])

Now let’s compare the original tweets with the cleaned tweets:

train.sample(10)

Carefully examine the comparison of the two columns of tweets in the image above; the cleaned tweets have become clearer and easier to understand.

However, there is still much that can be done in the text cleaning step, and I encourage everyone to further explore the data to discover areas for improvement in the text.

Brief Introduction to TensorFlow Hub

Wait, what does TensorFlow have to do with our tutorial?

TensorFlow Hub is a library that allows for transfer learning, supporting the use of various machine learning models for different tasks. ELMo is one example, which is why we need to use TensorFlow Hub in our implementation.

We first need to install TensorFlow Hub; you must install or upgrade to version 1.7 or higher to use it:

$ pip install “tensorflow>=1.7.0”

$ pip install tensorflow-hub

Preparing ELMo Model Vectors

Now we need to import the pre-trained ELMo model. Note that this model is over 350 MB in size, so it may take some time to download.

import tensorflow_hub as hub

import tensorflow as tf

elmo = hub.Module(“https://tfhub.dev/google/elmo/2”, trainable=True)

I will first show you how to generate ELMo vectors for a sentence; you just need to input a column of strings into elmo:

# just a random sentence

x = [“Roasted ants are a popular snack in Columbia”]

# Extract ELMo features

embeddings = elmo(x, signature=”default”, as_dict=True)[“elmo”]

embeddings.shape

Output:

TensorShape([Dimension(1), Dimension(8), Dimension(1024)])

This output is a three-dimensional tensor (1, 8, 1024):

The first dimension represents the number of training samples, which in this case is 1;
The second dimension represents the maximum length of the input list; since we are currently inputting only one string, the second dimension is the length of that string, which is 8;
The third dimension equals the length of the ELMo vector.

Each word in the input has an ELMo vector of length 1024.

Let’s start extracting ELMo vectors for the cleaned tweets in the test and training sets. To get the ELMo vector for the entire tweet, we need to take the average of the vectors for each word in the tweet.

We can achieve this by defining a function:

def elmo_vectors(x):

embeddings = elmo(x.tolist(), signature=”default”, as_dict=True)[“elmo”]

with tf.Session() as sess:

sess.run(tf.global_variables_initializer())

sess.run(tf.tables_initializer())

# return average of ELMo features

return sess.run(tf.reduce_mean(embeddings,1))

If you use the code above to process all the tweets at once, you may exhaust all your memory. We can avoid this problem by splitting the training and testing sets into batches of 100 samples each and passing them sequentially to the elmo_vectors() function.

I choose to store these samples in a list:

list_train = [train[i:i+100] for i in range(0,train.shape[0],100)]

list_test = [test[i:i+100] for i in range(0,test.shape[0],100)]

Now let’s iterate over these samples and extract the ELMo vectors, which will take a while:

# Extract ELMo embeddings

elmo_train = [elmo_vectors(x[‘clean_tweet’]) for x in list_train]

elmo_test = [elmo_vectors(x[‘clean_tweet’]) for x in list_test]

Once we have all the vectors, we can concatenate them into an array:

elmo_train_new = np.concatenate(elmo_train, axis = 0)

elmo_test_new = np.concatenate(elmo_test, axis = 0)

I recommend saving these arrays, as we will need a long time to obtain their ELMo vectors. We can save them as pickle files:

# save elmo_train_new

pickle_out = open(“elmo_train_03032019.pickle”,”wb”)

pickle.dump(elmo_train_new, pickle_out)

pickle_out.close()

# save elmo_test_new

pickle_out = open(“elmo_test_03032019.pickle”,”wb”)

pickle.dump(elmo_test_new, pickle_out)

pickle_out.close()

Then use the following code to reload them:

# load elmo_train_new

pickle_in = open(“elmo_train_03032019.pickle”, “rb”)

elmo_train_new = pickle.load(pickle_in)

# load elmo_train_new

pickle_in = open(“elmo_test_03032019.pickle”, “rb”)

elmo_test_new = pickle.load(pickle_in)

Building and Evaluating the Model

Let’s build an NLP model using ELMo!

We can use the ELMo vectors from the training set to build a classification model. Then, we will use that model to make predictions on the test set. But before we do this, we need to split elmo_train_new into training and validation sets to evaluate our model.

from sklearn.model_selection import train_test_split

xtrain, xvalid, ytrain, yvalid = train_test_split(elmo_train_new, train[‘label’], random_state=42, test_size=0.2)

Since our goal is to set a baseline score, we will build a simple logistic regression model using the ELMo vectors as features:

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import f1_score

lreg = LogisticRegression()

lreg.fit(xtrain, ytrain)

It’s time to make predictions! First, on the validation set:

preds_valid = lreg.predict(xvalid)

We use the F1 metric to evaluate our model since it is the official evaluation metric for the competition:

f1_score(yvalid, preds_valid)

Output: 0.789976

The F1 score on the validation set is quite good; next, we make predictions on the test set:

# make predictions on test set

preds_test = lreg.predict(elmo_test_new)

Prepare the submission file to upload to the competition page:

# prepare submission dataframe

sub = pd.DataFrame({‘id’:test[‘id’], ‘label’:preds_test})

# write predictions to a CSV file

sub.to_csv(“sub_lreg.csv”, index=False)

The public leaderboard shows that our prediction results achieved a score of 0.875672, which can be considered very good since we only performed relatively basic preprocessing and used a very simple model. It is foreseeable that if we used more advanced techniques, we would achieve better scores; everyone can try this and let me know the results!

5. What Else Can We Do with ELMo?

We have just witnessed how efficient ELMo is in text recognition, and if paired with a more complex model, it will undoubtedly perform even better. The application of ELMo is not limited to text classification; it can be used whenever you need to vectorize text data.

Here are some NLP problems that can be addressed using ELMo:

Machine Translation
Language Modeling
Text Summarization
Named Entity Recognition
Question-Answering Systems

6. Conclusion

ELMo is undoubtedly a significant advancement in NLP and will continue to trend. Given the rapid pace of progress in NLP research, other new state-of-the-art word embeddings, such as Google BERT and Zalando’s Flair, have emerged in recent months. It can be said that an exciting era for NLP practitioners has arrived!

I strongly recommend you use ELMo on other datasets and personally experience the performance improvement process. If you have any questions or wish to share your experiences with me and the community, please do so in the comments section below. If you are just starting in the field of NLP, you should also check out the following NLP-related resources:

Natural Language Processing (NLP) course

(https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_medium=ELMoNLParticle&utm_source=blog )

Certified Program: Natural Language Processing (NLP) for Beginners

(https://courses.analyticsvidhya.com/bundles/nlp-combo?utm_medium=ELMoNLParticle&utm_source=blog )

You can also read this article through the Analytics Vidhya Android app.

Original Title: A Step-by-Step NLP Guide to Learn ELMo for Extracting Features from Text

Original Link: https://www.analyticsvidhya.com/blog/2019/03/learn-to-use-elmo-to-extract-features-from-text/

Edited by: Wang Jing

Proofread by: Lin Yilin

Translator’s Profile

Step-by-Step NLP Guide: Extract Text Features Using ELMo

Han Guojun, graduated from the School of Information at Renmin University of China, currently a second-year master’s student in Data Science at the University of Melbourne. Interested in mining the information behind data, hoping to meet like-minded friends to learn deeply and continuously improve.

Translation Team Recruitment Information

Job Content: Requires attention to detail to translate selected foreign articles into fluent Chinese. If you are a data science/statistics/computer major studying abroad, or working in related fields, or confident in your foreign language skills, you are welcome to join the translation team.

What You Will Get: Regular translation training to improve volunteers’ translation skills, enhance awareness of cutting-edge data science, and overseas friends can keep in touch with domestic technology application development. The THU Data Team’s background provides good development opportunities for volunteers.

Other Benefits: Data scientists from well-known companies, students from prestigious universities such as Peking University and Tsinghua University, and overseas students will become your partners in the translation team.

Click on the end of the article “Read the Original” to join the Data Team~

Reprint Notice

If you need to reprint, please prominently indicate the author and source at the beginning of the article (reprinted from: Data Team ID: datapi), and place a prominent QR code for Data Team at the end of the article. For articles with original identification, please send [Article Name – Waiting for Authorization Public Account Name and ID] to the contact email to apply for whitelist authorization and edit according to requirements.

After publishing, please feedback the link to the contact email (see below). Unauthorized reprints and adaptations will be pursued legally.

Step-by-Step NLP Guide: Extract Text Features Using ELMo

Click “Read the Original” to embrace the organization

Leave a Comment Cancel reply