Understanding BERT Principles for Beginners

Understanding BERT Principles for Beginners
Source: Machine Learning Beginners

This article is about 4500 words long and is recommended to be read in 8 minutes. We will explore the BERT model and understand how it works, which is a very important part of NLP (Natural Language Processing).

Introduction

Since Google announced BERT’s outstanding performance in 11 NLP tasks at the end of October 2018, BERT (Bidirectional Encoder Representation from Transformers) has become a hot topic in the field of NLP. In this article, we will study the BERT model and understand how it works, which is a very important part of NLP.

Main Body Begins

Foreword
2018 can be considered the year of Natural Language Processing (NLP), where we made significant advancements in how to assist computers in conceptually understanding sentences in a way that captures latent semantic relationships. Additionally, some open-source communities in the NLP field have released many powerful components that we can freely download and use in our model training process. (This year can be seen as the ImageNet moment for NLP, similar to the development in computer vision a few years ago.)

Understanding BERT Principles for Beginners

The latest release, BERT, is a milestone model for NLP tasks, and its release is bound to usher in a new era for NLP. BERT is an algorithmic model that has broken records in many natural language processing tasks. Shortly after the BERT paper was published, Google’s R&D team also opened the model’s code and provided some pre-trained algorithm models for download on large datasets. Google’s open-sourcing of this model and provision of pre-trained models allow everyone to build an NLP-related algorithm model, saving a lot of time, effort, knowledge, and resources required for training language models.

Understanding BERT Principles for Beginners

BERT integrates some of the top ideas in the NLP field recently, including but not limited to Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le), ELMo (by Matthew Peters and researchers from AI2 and UW CSE), ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder), and the OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), and the Transformer (Vaswani et al).
To properly understand the content of BERT, you need to pay attention to a few things, but before introducing the concepts involved in the model, we can use BERT’s methods.
Example: Sentence Classification
The simplest way to use BERT is to create a text classification model, as shown in the structure below:

Understanding BERT Principles for Beginners

To train such a model (mainly to train a classifier), the changes occurring in the BERT model during the training phase are minimal. This training process is called fine-tuning and is derived from Semi-supervised Sequence Learning and ULMFiT.
To make it easier to understand, let’s give an example of a classifier. A classifier belongs to the field of supervised learning, which means you need some labeled data to train these models. For the example of a spam classifier, the labeled dataset consists of two parts: the content of the email and the category of the email (the categories are divided into “spam” or “not spam”).

Understanding BERT Principles for Beginners

Other examples of use cases include:
Sentiment Analysis
Input: Movie/Product reviews. Output: Is the review positive or negative?
Example dataset: SST
Fact Verification
Input: Sentence. Output: “Claim” or “No Claim” more ambitious/futuristic examples: Input: Sentence. Output: “True” or “False”
Model Architecture
Now that you understand how to use BERT’s example, let’s take a closer look at how it works.

Understanding BERT Principles for Beginners

The BERT paper introduces two versions:
  • BERT BASE – Comparable in size to the OpenAI Transformer for performance comparison
  • BERT LARGE – A very large model that achieves state-of-the-art results presented in this paper.
The basic building block of BERT is the Encoder of the Transformer. For an introduction to the Transformer, you can read the author’s previous article: The Illustrated Transformer, which explains the basic concepts of the Transformer model – the foundation of BERT and the concepts we will discuss next.

Understanding BERT Principles for Beginners

Both BERT models have a large number of encoder layers (referred to as Transformer Blocks in the paper) – the base version has 12 layers, while the advanced version has 24 layers. It also has a large feed-forward neural network (768 and 1024 hidden layer neurons), and many attention heads (12-16). This exceeds the reference configuration parameters in the Transformer paper (6 encoder layers, 512 hidden layer units, and 8 attention heads).
Model Input

Understanding BERT Principles for BeginnersThe first character of the input is [CLS], which simply represents – Classification.

BERT encodes in the same way as the Transformer. It takes a fixed-length string as input, with data being passed up through the layers, using self-attention at each layer and passing its results through the feed-forward neural network to the next encoder.

Understanding BERT Principles for Beginners

This architecture seems to follow the Transformer architecture (except for the number of layers, which is a parameter we can set). So what is the difference between BERT and the Transformer? Perhaps we can find some clues in the model’s output.
Model Output
The output returned for each position is a vector of hidden layer size (768 for the base version of BERT). For text classification, we focus on the output at the first position (the first position is the classification identifier [CLS]). As shown below:

Understanding BERT Principles for Beginners

This vector can now be used as input for the classifier of our choice. The paper states that using a single-layer neural network as a classifier can achieve good results. The principle is as follows:

Understanding BERT Principles for Beginners

In the example, there are only spam and non-spam categories; if you have more labels, you just need to increase the number of output neurons and change the last activation function to softmax.
Parallels with Convolutional Nets (BERT VS Convolutional Neural Networks)
For those with a background in computer vision, this vector switch should remind you of what happens between the convolutional part of networks like VGGNet and the fully connected classification part at the end of the network. You can understand it this way, and it is quite convenient.
Understanding BERT Principles for Beginners
A New Era of Word Embeddings
The open-sourcing of BERT has brought about an update in word embeddings. So far, word embeddings have become a major component in NLP models for processing natural language. Methods such as Word2vec and Glove have been widely used to address these issues. Before we use new word embeddings, it is necessary to review their development.
A Review of Word Embeddings
To allow machines to learn the feature attributes of text, we need some way to represent text in a numerical form.
The Word2vec algorithm represents words using a set of fixed-dimensional vectors, the calculation method can capture the semantics of words and the relationships between words. Using the vectorized representation of Word2vec can be used to determine whether words are similar or opposite, or to assess the relationship between “man” and “woman” as similar to that of “king” and “queen”. (A phrase that you may have heard a lot – emmm, essential for flowery writing). It can also capture some grammatical relationships, which is very useful in English. For example, the relationship between “had” and “has” is similar to that of “was” and “is”.
With this approach, we can use a large amount of text data to pre-train a word embedding model, which can then be widely used for other NLP tasks. This is a good idea, as it allows some startups or companies with insufficient computational resources to complete NLP tasks by downloading already open-sourced word embedding models.
ELMo: The Contextual Issue
The previously introduced word embedding methods have a significant problem, as they use pre-trained word vector models, meaning that regardless of the contextual relationship, each word has a unique and fixed vectorized form. “Wait a minute” – from (Peters et. al., 2017, McCann et. al., 2017, and yet again Peters et. al., 2018 in the ELMo paper).
This is similar to homophones in Chinese. For example, the character ‘长’ in ‘长度’ represents measurement, while in ‘长高’ it signifies growth. So why don’t we determine its pronunciation or meaning based on the context of “长” being around degree or height? This question gives rise to the contextualized word embedding model.

Understanding BERT Principles for Beginners

ELMo changes the Word2vec-like approach of fixing words to specified-length vectors by looking at the entire sentence before assigning a word vector for each word, using bi-LSTM to train its corresponding word vector.
Understanding BERT Principles for Beginners
ELMo has made significant contributions to solving the contextual issues in NLP; its LSTM can be trained using a large amount of text data relevant to our task, and then the trained model can serve as a benchmark for word vectors in other NLP tasks.
What is the Secret of ELMo?
ELMo trains a model that takes a sentence or word as input and outputs the most likely next word to appear. Think about input methods; yes, it’s the same principle. In NLP, we also refer to this as Language Modeling. Such a model is easy to implement because we have a large amount of text data and can learn without needing labels.
Understanding BERT Principles for Beginners
The above image introduces part of the steps in the ELMo pre-training process: we need to complete a task like this: input “Let’s stick to” and predict the next most likely word. If we train with a large dataset in the training phase, we may accurately predict the next word we expect in the prediction phase. For example, inputting “机器” (machine) in ‘学习’ (study) and ‘买菜’ (buy groceries) will most likely output ‘学习’ (study) rather than ‘买菜’ (buy groceries).
From the above image, we can see that each expanded LSTM completes the prediction in the last step.
Furthermore, true ELMo goes a step further; it can not only determine the next word but also predict the previous word (Bi-LSTM).

Understanding BERT Principles for Beginners

ELMo combines the initial hidden states (the initial embeddings) in the manner shown in the following image to refine a contextually meaningful word embedding (weighted summation after full connection).

Understanding BERT Principles for Beginners

ULM-FiT: Transfer Learning Applications in NLP
The ULM-FiT mechanism allows for better utilization of the model’s pre-trained parameters. The parameters utilized are not limited to embeddings, nor are they limited to contextual embeddings. ULM-FiT introduces a Language Model and an effective process for fine-tuning that Language Model to perform various NLP tasks. This allows NLP tasks to conveniently use transfer learning, similar to computer vision.
The Transformer: Beyond LSTM Structures
The release of the Transformer paper and code, along with its excellent results in tasks like machine translation, led some researchers to believe it is a replacement for LSTM. In fact, the Transformer is better at handling long-term dependencies than LSTM. The structure of Transformer Encoding and Decoding is very suitable for machine translation, but how can we use it for text classification tasks? In fact, you only need to use it to pre-train a language model that can be fine-tuned for other tasks.
OpenAI Transformer: Transformer Decoder Pre-training for Language Models
It turns out that we do not need a complete transformer structure to use transfer learning and a good language model for NLP tasks. We only need the decoder of the Transformer. The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when generating a translation word by word.

Understanding BERT Principles for Beginners

This model stacks twelve Decoder layers. Since there is no Encoder in this setup, these Decoders will not have the Encoder-Decoder attention layer that Transformer Decoder layers possess. Instead, there is a self-attention layer (masked so it doesn’t peek at future tokens).
Through this structural adjustment, we can continue training the model on similar language modeling tasks: training with a large amount of unlabeled data to predict the next word. For example, you can feed your model with those 7000 books (books are excellent training samples~ much better than blogs and tweets). The training framework is as follows:

Understanding BERT Principles for Beginners

Transfer Learning to Downstream Tasks
Through the pre-training of OpenAI’s transformer and some fine-tuning, we can use the trained model for other downstream NLP tasks (for example, training a language model and then using its hidden state for classification). Below we introduce this operation (still using the above example: distinguishing between spam and non-spam).
Understanding BERT Principles for Beginners
The OpenAI paper outlines many examples of using transformers with transfer learning to handle different types of NLP tasks. As shown in the example below:

Understanding BERT Principles for Beginners

BERT: From Decoders to Encoders
The OpenAI transformer provides us with a sophisticated pre-trained model based on the Transformer. However, during the transition from LSTM to Transformer, we found that something was missing. The language model of ELMo is bidirectional, but the OpenAI transformer is a forward-trained language model. Can we make our Transformer model also have the characteristics of Bi-LSTM?
R-BERT: “Hold my beer”
Masked Language Model
BERT says: “I will use the encoders of the transformer”
Ernie scoffs: “Heh, you can’t consider the article like Bi-LSTM”
BERT confidently replies: “We will use masks”
To explain the Mask:
Language models predict the next word based on preceding words, but self-attention only focuses on itself, making it meaningless to predict itself 100%. Thus, we use a Mask to block the words that need to be predicted.
As shown in the image below:

Understanding BERT Principles for Beginners

Two-sentence Tasks
Let’s review how the OpenAI transformer handles input transformations for different tasks. You will find that for certain tasks, we need two sentences as input and make some more intelligent judgments, such as whether they are similar. For instance, if we input a Wikipedia entry along with a question about that entry, can our algorithm model handle this question?
To enable BERT to better handle the relationships between two sentences, the pre-training process includes an additional task: Given two sentences (A and B), are A and B similar? (0 or 1)

Understanding BERT Principles for Beginners

Special NLP Tasks
The BERT paper introduces several NLP tasks that BERT can handle:
  1. Short Text Similarity
  2. Text Classification
  3. QA Bots
  4. Semantic Tagging

Understanding BERT Principles for Beginners

BERT as Feature Extractor
The fine-tuning method is not the only way to use BERT, just like ELMo, you can use pre-trained BERT to create contextualized word embeddings. Then you can provide these embeddings to existing models.
Understanding BERT Principles for Beginners
Which vector is most suitable as contextual embedding? I think it depends on the task. This article examines six options (compared to the fine-tuned model, scoring 96.4):
Understanding BERT Principles for Beginners
How to Use BERT

The best way to use BERT is through BERT Fine-Tuning with Cloud TPUs hosted on Google Cloud.

(https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb).

If you haven’t used Google Cloud TPU, you should give it a try; it’s a good experiment. Additionally, BERT is suitable for TPU, CPU, and GPU.
The next step is to check the code in the BERT repository:
1.The model is constructed in modeling.py (BertModel class), which is identical to the vanilla Transformer encoder.
https://github.com/google-research/bert/blob/master/modeling.py
2.run_classifier.py is an example of the fine-tuning process.
It also constructs the classification layer for the supervised model.
https://github.com/google-research/bert/blob/master/run_classifier.py
If you want to build your own classifier, check the create_model() method in that file.
3.You can download several pre-trained models.
These multilingual models cover 102 languages, all trained on Wikipedia data.
BERT does not treat words as tokens.
Instead, it focuses on WordPieces.
tokenization.py is the tokenizer that converts your words into wordPieces suitable for BERT.
https://github.com/google-research/bert/blob/master/tokenization.py
You can also check the PyTorch implementation of BERT.
https://github.com/huggingface/pytorch-pretrained-BERT
The AllenNLP library uses this implementation to allow BERT embeddings to be used with any model.
https://github.com/allenai/allennlp
https://github.com/allenai/allennlp/pull/2067
Author: jinjiajia95
Source:
https://blog.csdn.net/weixin_40746796/article/details/89951967
Original Author: Jay Alammar
Original Link:
https://jalammar.github.io/illustrated-bert/
Editor: Yu Tengkai
Proofreader: Lin Yilin
Understanding BERT Principles for Beginners
Understanding BERT Principles for Beginners

Leave a Comment