Beginner's Guide to Using BERT: Principles and Hands-On Examples

Author Jay Alammar, Translated by QbitAI | WeChat Official Account QbitAI

BERT, as a key player in the field of natural language processing, is an unavoidable topic for NLPer.

However, for those with little experience and a weak foundation, mastering BERT can be a bit challenging.

Now, tech blogger Jay Alammar has created a “Visual Guide to Using BERT for the First Time,” which introduces how to get started with BERT in a very simple and clear way, with illustrations covering everything from the principles of BERT to practical operations, with even more images than code. Here is the compiled translation by QbitAI~

This article mainly uses a variant of BERT for sentence classification as an example to demonstrate how to use BERT.

There is also a Colab link at the end.

Dataset: SST2

First, we need to use the SST2 dataset, which contains sentences from movie reviews.

If the reviewer expresses a positive appreciation for the movie, there will be a “1” label;

If the reviewer dislikes the movie and gives a negative review, there will be a “0” label.

The movie reviews in the dataset are written in English and look something like this:

Sentence	Label
a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films A stirring, funny, and finally transporting reimagining of Beauty and the Beast and 1930s horror films	1
apparently reassembled from the cutting room floor of any given daytime soap Apparently reassembled from the cutting room floor of any given daytime soap	0
they presume their audience won’t sit still for a sociology lesson They presume their audience won’t sit still for a sociology lesson	0
this is a visually stunning rumination on love , memory , history and the war between art and commerce This is a visually stunning rumination on love, memory, history, and the war between art and commerce	1
jonathan parker ‘s bartleby should have been the be all end all of the modern office anomie films Jonathan Parker’s Bartleby should have been the be-all and end-all of modern office anomie films	1

Sentence Sentiment Classification Model

Now, with the SST2 movie review dataset, we need to create a model that automatically classifies English sentences.

If the judgment is positive, label it as 1; if the judgment is negative, label it as 0.

The general logic is as follows:

Input a sentence, and the movie review sentence classifier outputs a positive or negative result.

This model is actually composed of two models.

DistilBERT is responsible for processing the sentence, extracting information, and then passing it to the next model. This is an open-source version of BERT made by 🤗 Hugging Face, which is lightweight and runs fast, with performance comparable to the original.

The next model is a basic logistic regression model, which takes the output from DistilBERT as input and outputs either a positive or negative result.

The data we pass between the two models is a vector of size 768, which can be considered as a sentence embedding for classification.

Model Training Process

Although we will use two models, we only need to train the logistic regression model; DistilBERT can be used directly with the pre-trained version.

However, this model has never been trained or fine-tuned for the sentence classification task. We obtain some sentence classification capabilities from the general-purpose BERT, especially for the output at the first position (related to the [CLS] token), which is BERT’s second training objective. Next comes sentence classification, which seems to be the goal of training the model to encapsulate the full sentence meaning into the output at the first position.

The Transformer library provides us with the implementation of DistilBERT and the pre-trained version of the model.

Tutorial Overview

This is the entire plan for this tutorial. We will first use the trained DistilBERT to generate sentence embeddings for 2000 sentences.

After that, we won’t touch DistilBERT anymore; everything here will be Scikit Learn, where we will perform regular training and testing on this dataset:

We will train and test the first model, which is DistilBERT, create our training dataset, and evaluate the second model, which is the logistic regression model.

Then we will train the logistic regression model on the training set:

How Single Prediction Works

Before we explore the code explaining how to train the model, let’s take a look at how a trained model makes predictions.

Let’s try to classify this sentence:

a visually stunning rumination on love

A visually stunning rumination on love

First, use the BERT tokenizer to split the sentence into two tokens;

Second, we add special tokens for sentence classification (the first position is [CLS], and the end of the sentence is [SEP]).

Third, the tokenizer replaces each token with the ID from the embedding table, forming the components for training the model.

Note that the tokenizer accomplishes all steps in this line of code:

tokenizer.encode("a visually stunning rumination on love", add_special_tokens=True)

Now our input sentence is in the appropriate state to be passed to DistilBERT.

This step visualizes as follows:

Passing Through DistilBERT

The input vector passes through DistilBERT, outputting a vector for each input token, each vector composed of 768 numbers.

Since this is a sentence classification task, we ignore all content except for the first vector (which is related to the [CLS] token) and take the first vector as input to the logistic regression model.

From here, the job of the logistic regression model is to classify this vector based on the experience it learned during training.

The process of this prediction calculation is as follows:

Code

Now, let’s look at the code for the entire process. You can also find the GitHub code and the runnable version on Colab through the links provided.

First, import the necessary tools.

import numpy as np
import pandas as pd
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

You can find this dataset on GitHub, so we can directly import it into a pandas dataframe.

df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

You can directly use df.head() to view the first five rows of the dataframe and see what the dataset looks like.

df.head()

Then it outputs:

Import Pre-trained DistilBERT Model and Tokenizer

model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Now we can tokenize this dataset.

Note that this step is different from the previous example, which only processed one sentence; we want to batch process all sentences.

Tokenization

tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

This step turns each sentence into a list of IDs.

The dataset is the current list (or pandas Series/DataFrame), and before DistilBERT processes it, we need to standardize all vectors, adding token 0 to short sentences.

After filling with 0, we now have a complete matrix/tensor that can be fed to BERT:

Processing with DistilBERT

Now, create an input tensor for the padded token matrix and send it to DistilBERT.

input_ids = torch.tensor(np.array(padded))

with torch.no_grad():
    last_hidden_states = model(input_ids)

After running this step, last_hidden_states retains the output from DistilBERT.

Opening BERT’s Output Tensor

Unpack this 3D output tensor and first check its dimensions:

Review the Sentence Processing Steps

Each row is associated with a sentence in our dataset. To review, the entire processing process is as follows:

Extracting Important Parts

For sentence classification, we are only interested in the output of BERT’s [CLS] token, so we only need to extract the important parts.

Here’s how to extract the 2D tensor we need from the 3D tensor:

# Slice the output for the first position for all the sequences, take all hidden unit outputs
features = last_hidden_states[0][:,0,:].numpy()

Now the features are a 2D numpy array containing the sentence embeddings for all sentences in our dataset.

Logistic Regression Dataset

Now that we have the output from BERT, the logistic regression model has already been trained. The 798 columns in the following image are features, and the labels are from the original dataset.

After completing the traditional machine learning training and testing, we can further train the logistic regression model.

labels = df[1]
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

Split the data into training and testing sets:

Next, train the logistic regression model on the training set:

lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

Now the model is trained, let’s score it with the test set:

lr_clf.score(test_features, test_labels)

The model accuracy achieved is 81%.

Score Benchmarks

As a reference, the current highest accuracy score for this dataset is 96.8.

In this task, DistilBERT can be trained to improve scores, a process called fine-tuning, which can update BERT’s weights for better sentence classification.

The fine-tuned DistilBERT can achieve an accuracy of 90.7, while the complete BERT model can reach an accuracy of 94.9.

Links

A Visual Guide to Using BERT for the First Time: https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

Code: https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

Colab: https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

DistilBERT: https://medium.com/huggingface/distilbert-8cf3380435b5

— The End —

Join the Big Names! The QbitAI MEET Conference is Live

The QbitAI MEET 2020 Intelligent Future Conference is live! AI experts like Kai-Fu Lee, Guangnan Ni, Junqing Jing, Bo-Wen Zhou, Minghui Wu, Xudong Cao, Jieping Ye, Wenbin Tang, Yan Feng Wang, Gang Huang, and many others are here to help you understand artificial intelligence. Scan to watch the live stream~ ~

Beginner's Guide to Using BERT: Principles and Hands-On Examples

QbitAI QbitAI · Author on Toutiao

Tracking the latest trends in AI technology and products

If you like it, please hit “Read”!

Beginner’s Guide to Using BERT: Principles and Hands-On Examples