A Guide to Solve 90% of Natural Language Processing Problems

Author: Emmanuel Ameisen

Source: Machine Heart

This article is approximately 5000 words long and is recommended to read in 9 minutes. This article explains how to process natural language in the field of artificial intelligence.

Natural Language Processing (NLP) is one of the two most important directions in the field of artificial intelligence, just like Computer Vision (CV). How can machine learning methods understand the thoughts contained in human language from text? In this article, Emmanuel Ameisen from Insight AI will briefly outline the thoughts we need to follow for most tasks.

The 5Ws and 1H of Text Data!

Text data is everywhere

Whether it’s an established company or one dedicated to launching new services, you can use text data to validate, improve, and expand product functionality. The science of extracting information and learning from text data is an active research topic in natural language processing (NLP).

NLP covers a wide range of fields, and exciting new results emerge every day. However, after collaborating with hundreds of companies, the Insight team found that several important applications appear particularly frequently:

Identifying different user/customer groups (e.g., predicting customer churn, customer lifetime value, product preferences)

Accurately detecting and extracting different categories of feedback (positive and negative comments/opinions, specific attributes mentioned, such as clothing size/fit)

Classifying text based on intent (e.g., seeking general help, urgent issues)

Although there are many NLP papers and tutorials available online, it is challenging to find guides and tips for efficiently learning these issues from scratch.

This article will help you

Combining the experience of leading hundreds of project teams each year and advice from the top teams across the United States, we have completed this article, which will explain how to use machine learning solutions to address the aforementioned NLP problems. We will start with the simplest methods and then introduce more detailed approaches, such as feature engineering, word vectors, and deep learning.

After reading this article, you will know how to:

Collect, prepare, and validate data

Build simple models and convert them to deep learning when necessary

Interpret and understand models, ensuring that what is captured is information rather than noise

This article will provide you with step-by-step guidance and can also serve as a high-level overview that provides effective standard methods.

This article includes an interactive notebook that demonstrates and applies all techniques. You can run the code freely and learn synchronously:

https://github.com/hundredblocks/concrete_NLP_tutorial/blob/master/NLP_notebook.ipynb

Step 1: Collect Data

Examples of Data Sources

Every machine learning problem starts with data, such as emails, posts, or tweets. Common sources of textual information include:

Product reviews (from Amazon, Yelp, and various app stores)

User-generated content (tweets, Facebook posts, StackOverflow questions)

Troubleshooting (customer requests, support tickets, chat logs)

The “Disasters on Social Media” Dataset

In this article, we will use a dataset called “Disasters on Social Media” provided by CrowdFlower, where:

The editor reviewed over 10,000 tweets, including various searches such as “fire,” “isolation,” and “chaos,” and then checked whether the tweets referred to disaster events (excluding cases where these words were used to joke or comment on movies where no disaster occurred).

Our task is to detect which tweets are about catastrophic events, excluding unrelated topics like movies. Why? One possible application is to notify law enforcement only during emergencies (rather than when discussing the latest Adam Sandler movie).

In other parts of this article, we will refer to tweets about disasters as “disasters” and others as “non-related events.”

Labels

We have labeled the data, so we know the category to which each tweet belongs. As Richard Socher outlines below, finding and labeling enough data to train a model is often faster, simpler, and cheaper than trying to optimize complex unsupervised methods.

Richard Socher’s Tip

Step 2: Clean Data

The primary rule we follow is: “Your model is limited by your data.”

One of the essential skills of a data scientist is knowing whether the next step is to work on the model or the data. A good approach is to look at the data first and then clean it. A clean dataset allows the model to learn meaningful features rather than overfitting irrelevant noise.

Here is a checklist for cleaning data: (More details can be found in the code code (https://github.com/hundredblocks/concrete_NLP_tutorial/blob/master/NLP_notebook.ipynb)):

1. Remove all irrelevant characters, such as any non-alphanumeric characters

2. Tokenize the text into individual words for parsing

3. Remove irrelevant words, such as “@” or URLs in tweets

4. Convert all characters to lowercase, standardizing words like “hello,” “Hello,” and “HELLO”

5. Consider grouping misspelled and repeated words (e.g., “cool”/”kewl”/”cooool”)

6. Consider lemmatization (unifying forms like “am,” “are,” “is” to a common form “be”)

After following these steps and checking for errors, you can use the cleaned, tokenized data to train the model!

Step 3: Find Good Data Representations

The input to machine learning models is numerical. For image processing models, matrices represent the intensity of each pixel in each color channel.

A smiley face can be represented as a numerical matrix.

If our dataset consists of a series of sentences, to enable the algorithm to extract features from the data, we need to represent it in a form recognizable by the algorithm, such as a series of numbers.

One-hot Encoding (Bag of Words Model)

A common method for representing text is to encode each character as a separate number (e.g., ASCII). If we directly use this simple form for classifiers, it would only learn the structure of words from scratch based on our data, which is impractical for most datasets. Therefore, we need a more advanced method.

For example, we can build a vocabulary of all words in the dataset, with each word corresponding to a different number (index). A sentence can then be represented as a list of the length of the vocabulary’s unique words. At each index in the list, we mark the occurrence of that word in the sentence. This is the Bag of Words model, which completely ignores the order of words in the sentence, as shown below.

Representing a sentence as a bag of words. The left shows the sentence, and the right shows the corresponding representation, where each number (index) in the vector represents a specific word.

Visualizing Word Embeddings

In the example of “Disasters on Social Media,” there are about 20,000 words in the vocabulary, which means each sentence will be represented as a vector of length 20,000. There are many zeros in the vector because each sentence only contains a very small subset of the vocabulary.

To understand whether the word embeddings capture relevant information related to the problem (such as whether a tweet is about a disaster), a good approach is to visualize them and see how well these classes are separated. Since the vocabulary is large, visualizing data in 20,000 dimensions is impossible, so methods like Principal Component Analysis (PCA) are needed to reduce the data to two dimensions, as shown in the figure below.

Visualizing the embedded bag of words.

It seems difficult to separate into two classes, and it is also challenging to reduce dimensions, which is a characteristic of the embedding. To understand whether the features of the bag of words model are useful, we can train a classifier based on them.

Step 4: Classifier

When encountering a problem, we usually start by looking for tools to solve it. When we want to classify data, we often use Logistic Regression for its generality and interpretability. It is straightforward to train, and the results are interpretable because it is easy to extract the most important parameters from the model.

We split the data into a training set for fitting the model and a test set for analyzing the fit to unseen data. After training, the accuracy is 75.4%. That’s acceptable! The most frequent class (“non-related events”) only has 57%. However, even with only 75% accuracy, it is sufficient for our needs, as modeling should be based on understanding.

Step 5: Validation

Confusion Matrix

First, we need to know the types of errors our model makes and which errors are the least desirable. In our case, false positives refer to classifying non-related tweets as disasters, while false negatives refer to classifying disaster tweets as non-related events. If we want to prioritize handling each possible event, we want to reduce the occurrence of false negatives. If we prioritize limited resources, we would aim to reduce false positives to minimize erroneous alerts. We can visualize this information with a confusion matrix, which compares our model’s predictions with the actual situation. Ideally (if our predictions perfectly match the true situation), the matrix would be a diagonal matrix from the top left to the bottom right.

Confusion matrix (green proportion is large, blue proportion is small)

Our classifier has a relatively high false negative rate compared to the false positive rate. This means that this model is likely to misclassify disasters as non-related events. If the cost of law enforcement in the case of false positives is high, we would prefer to use this classifier.

Interpreting the Model

To validate and interpret the model’s predictions, we need to see which words play a significant role in the predictions. If the data is biased, the classifier may make accurate predictions on the sample data, but the model’s predictive performance may not be ideal in actual applications. In the figure below, we provide important vocabulary related to disasters and non-related events. We can extract and compare the prediction coefficients in the model, making it straightforward to find important vocabulary using the bag of words model and Logistic Regression.

Bag of Words: Important Vocabulary

Our classifier correctly identified some patterns (Hiroshima, Holocaust), but this is clearly overfitting irrelevant data (heyoo, x1392). Now our bag of words model is dealing with a massive vocabulary, where all vocabulary is treated equally. However, some vocabulary appears very frequently and only adds noise to our predictions. Next, we will try a method to represent the frequency of vocabulary occurrences to see if we can gain more signal from the data.

Step 6: Statistical Vocabulary

TF-IDF

To make the model focus more on meaningful words, we can use TF-IDF (Term Frequency-Inverse Document Frequency) to evaluate our bag of words model. TF-IDF weights the frequency of vocabulary occurrences in the dataset and reduces the weight of high-frequency words that only add noise. This is the PCA prediction of our new embedding.

Visualizing TF-IDF embeddings.

From the figure above, we see that the differences between the two colors are more pronounced. This makes it easier for the classifier to group. Let’s see if this result is better. Training the new TF-IDF Logistic Regression, we achieved an accuracy of 76.2%.

Just a slight improvement. Now can our model select more important words? If the model effectively bypasses the “traps” during prediction and gets better results, we can say that this model has been optimized.

TF-IDF: Important Vocabulary

The words extracted seem more relevant! Although the metrics of our test set have slightly increased, the vocabulary used by the model has become more critical, so we say, “The entire system operates more comfortably and effectively in interaction with customers.”

Step 7: Leveraging Semantics

Word2Vec

Our latest model can pick out high-signal words. However, it is likely that we will encounter words not present in the training set when operating the model. Therefore, even when encountering very similar words during training, previous models may not accurately classify these tweets.

To address this issue, we need to capture the meanings of words, meaning we need to understand that “good” and “positive” are closer than “apricot” or “continent.” The tool used to capture word meanings is called Word2Vec.

Using Pre-trained Words

Word2Vec is a technique for finding continuous embeddings of words. It learns by reading large amounts of text and memorizing which words tend to appear in similar contexts. After training on sufficient data, each word in the vocabulary generates a 300-dimensional vector composed of semantically similar words.

The authors of the paper “Efficient Estimation of Word Representations in Vector Space” have open-sourced a model that pre-trains on a sufficiently large available corpus, incorporating some semantics into our model. The pre-trained vectors can be found in the resource library related to this article: https://github.com/hundredblocks/concrete_NLP_tutorial.

Representing Sentences

A quick way to obtain sentence embeddings for the classifier is to average the Word2Vec evaluations of all words in the sentence. This is similar to the previous bag of words model, but this time we retain some linguistic information while ignoring the grammar of the sentence.

Below is the visualization of the new embeddings from the previous techniques:

Visualizing Word2Vec embeddings

The two colors of data are more distinctly separated, and our new embeddings allow the classifier to find the previous separation of the two classes. After training the same model for the third time (Logistic Regression), we achieved an accuracy of 77.7%, which is currently the best result! We can now validate our model.

Trade-off Between Complexity and Interpretability

Our embeddings do not represent each word as a one-dimensional vector like previous models, so it is challenging to see which words are most relevant to our vector. While we can still use the coefficients of Logistic Regression, they relate to the 300 dimensions of our embeddings rather than the indices of the words.

Given its low accuracy, discarding all interpretability seems a rough trade-off. However, for more complex models, we can use black box explainers like LIME to gain deeper insights into how the classifier works.

LIME

LIME can be found in an open-source package on Github: https://github.com/marcotcr/lime

Black box explainers allow users to explain the decisions of any classifier by perturbing the input and observing the changes in predictions for a specific example.

Let’s look at the explanations for several sentences in the dataset.

Selecting the correct disaster vocabulary and classifying it as “related.”

Here, the impact of this word on the classifier seems less significant.

However, we do not have time to explore thousands of examples in the dataset. What we want to do is run LIME on a representative sample of test examples to see which vocabulary contributes significantly. Using this approach, we can evaluate important words as in previous models and validate the predictions of the model.

Word2Vec: Important Words

The highly relevant words extracted by the model indicate that it can make more interpretable decisions. These appear to be the most relevant vocabulary from previous models, so we are more inclined to incorporate them into our model.

Step 8: Using an End-to-End Approach

We have introduced a method for generating concise sentence embeddings quickly and effectively. However, by ignoring the order of words, we have skipped all grammatical information in the sentences. If the results provided by these methods are insufficient, we can use more complex models that input the entire sentence and predict labels without the need for intermediate representations. A common approach is to use Word2Vec or similar methods (like GloVe or CoVe) to treat sentences as a sequence of word vectors. This is what we will do next.

Efficient End-to-End Structure

Convolutional Neural Networks (CNNs) for sentence classification train very quickly and perform excellently as an entry-level deep learning framework. While CNNs are widely known for their use in image processing, they also yield excellent results in text-related tasks, often training faster than most complex NLP methods (such as LSTMs and Encoder/Decoder structures). This model considers the order of words and learns valuable information about which sequences of words can predict target classes, distinguishing between “Alex eats plants” and “Plants eat Alex.”

Training this model requires no more work than previous models, and it performs better, achieving an accuracy of 79.5%! See the code for details: https://github.com/hundredblocks/concrete_NLP_tutorial/blob/master/NLP_notebook.ipynb

As with the above models, the next step is to use this method to explore and explain predictions to verify whether it is the best model to provide to users. By now, you should be quite familiar with such issues.

Conclusion

Below is a brief review of the successful methods we employed:

Start with a simple and fast model

Interpret its predictions

Understand its types of errors

Based on the above knowledge, determine the next step—whether to handle data or seek more complex models

These methods are only used for specific examples—using appropriate models to understand and leverage short texts (tweets), but this thinking applies to various problems. We hope this article will be helpful to you, and we welcome your comments and questions!

Original link: https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e

Leave a Comment Cancel reply