BERT Paper Notes

Author: Prince Changqin (NLP Algorithm Engineer)

Bert, Pre-training of Deep Bidirectional Transformers for Language Understanding Note

Paper: https://arxiv.org/pdf/1810.04805.pdf

Code: https://github.com/google-research/bert

The core idea of Bert: MaskLM utilizes bidirectional context + MultiTask.

Abstract

BERT obtains a deep bidirectional representation of text by jointly training the context across all layers.

Introduction

Two methods to apply pre-trained models to downstream tasks:

  • feature-based: for example, ELMo, using pre-trained representations as additional features

  • fine-tuning: for example, OpenAI GPT, introducing a small number of task-specific parameters, fine-tuning all parameters in downstream tasks

Current technology has a limitation of only using a left-to-right unidirectional mechanism, which is unsuitable for some tasks, such as question answering.

Bert mitigates this limitation through the “masked language model”, which randomly masks some tokens in the input, aiming to predict the original vocabulary id of the masked tokens based only on the context (both left and right).

Additionally, a “next sentence prediction” task is jointly trained to represent text pairs.

Related Work

Unsupervised Feature-based Approaches

  • Word Embedding:

    • Left-to-right language model (Mnih and Hinton, 2009)

    • Distinguishing the correct center word from context (Mikolov, 2013)

  • Sentence Embedding:

    • Ranking a candidate set of next sentences (Jernite, 2017; Logeswaran and Lee, 2018)

    • Generating the next word from a given sentence left-to-right (Kiors, 2015)

    • Denoising autoencoder (Hill, 2016)

ELMo uses both left-to-right and right-to-left language models, where each token’s context representation is associated with both language models.

Melamud (2016) learns context representations through a task predicting a single word from left and right context (using LSTMs).

Unsupervised Fine-tuning Approaches

  • Directly training embeddings from the corpus (Collobert and Weston, 2018)

  • Training context representations from text and fine-tuning in downstream tasks (Dai and Le, 2015; Howard and Ruder, 2018; Radford, 2018)

  • OpenAI GPT (Radford, 2018)

Transfer Learning from Supervised Data

Some effective transfer learning:

  • Natural Language Inference

  • Machine Translation

BERT

Two steps:

  • pre-training: based on multi-task

  • fine-tuning: initialize with pre-trained parameters, then refine on supervised tasks

BERT Paper Notes

Model Architecture

  • Multi-layer bidirectional Transformer encoder

  • Layer number (Transformer blocks): L (Base 12, Large 24)

  • Hidden size: H (Base 768, Large 1024)

  • Number of self-attention heads: A (Base 12, Large 16)

Input/Output Representations

  • Input sentence pairs (for example, “Question, Answer”) as a sequence of tokens, where “sentence” refers to adjacent text, not an actual “sentence”

  • Using WordPiece Embedding (Wu, 2016)

  • The state of the last hidden layer as the representation of the sequence

  • Two methods to process sentence pairs into one sentence: separate with a special token (SEP); or use a new embedding to mark which token belongs to which sentence

  • The input representation of each token consists of token + segment + position embeddings

BERT Paper Notes

Pre-training BERT

Instead of using left-to-right or the opposite direction language models, two unsupervised tasks are used.

Task #1: Masked LM (MLM)

Randomly mask a certain percentage (15%) of tokens in the input, then predict these tokens, with the final hidden layer vector fed into softmax.

Only predict the masked tokens, not reconstruct the entire input.

However, there is an issue during fine-tuning when there is no mask, so it’s not simply replacing masked tokens with [MASK]:

  • 80% replaced with [MASK]

  • 10% choose a random token

  • 10% still use the original token

Task #2: Next Sentence Prediction (NSP)

Among all sentence pairs, 50% are related and 50% are random next sentences.

Pre-training data

Uses document-level corpus rather than randomized sentence-level corpus.

Fine-tuning BERT

Sentence A and B from pre-training correspond to different tasks:

  • Paraphrase: sentence pairs

  • Entailment: hypothesis-premise pairs

  • Question Answering: question-paragraph pairs

  • Text Classification, Sequence Labeling: text-empty pairs

Experiments

  • GLUE: A set of natural language understanding tasks.

  • SQuAD v1.1: Question answering task.

  • SQuAD v2.0: Question answering task without short answers.

  • SWAG: Common sense reasoning task.

Ablation Studies

Effect of Pre-training Tasks

BERT Paper Notes

Effect of Model Size

BERT Paper Notes

Feature-based Approach with BERT

BERT Paper Notes
Related Articles:
ERNIE Tutorial (Paper Notes + Practical Guide)BERT_Paper_Chinese_Translation: BERT Paper Chinese Translation Version

Original link:

https://yam.gift/2019/08/05/Paper/2019-08-05-Bert-Paper/

This article is authorized by the author and originally published on the AINLP public account platform. Click ‘Read the original text’ to go directly to the original link. Submissions are welcome, AI and NLP are both acceptable.

Leave a Comment