BERT Implementation in PyTorch: A Comprehensive Guide

Selected from GitHub

Author: Junseong Kim

Translated by Machine Heart

Contributors: Lu Xue, Zhang Qian

Recently, Google AI published an NLP paper introducing a new language representation model, BERT, which is considered the strongest pre-trained NLP model, setting new state-of-the-art performance records on 11 NLP tasks. Today, Machine Heart discovered a PyTorch implementation of BERT on GitHub by Junseong Kim from Scatter Lab.

Introduction

The Google AI paper on BERT showcases the impressive results achieved by the model across various NLP tasks, including an F1 score that exceeds human performance on the SQuAD v1.1 QA task. The paper demonstrates that a Transformer-based (self-attention) encoder can effectively replace previously well-trained language models. More importantly, the paper indicates that this pre-trained language model can be applied to any NLP task without the need to customize the model architecture for the task.

This article mainly elaborates on the implementation of BERT. Its code is very simple and easy to understand. Some of the code is based on the annotated Transformer in the paper “Attention is All You Need.”

The project is still in progress. The code has not yet been validated.

Pre-training of Language Models

In this paper, the authors present a new method for training language models, namely the “masked language model” (MLM) and “next sentence prediction.”

Masked LM

See original paper: 3.3.1 Task #1: Masked LM

Input Sequence  : The man went to [MASK] store with [MASK] dog
Target Sequence :                  the                his

Rules:

Based on the following sub-rules, 15% of the input tokens will be altered randomly:

80% of the tokens are [MASK] tokens.
10% of the tokens are [RANDOM] tokens (another word).
10% of the tokens remain unchanged but need to be predicted.

Next Sentence Prediction

See original paper: 3.3.2 Task #2: Next Sentence Prediction

Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next

Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

“Can this sentence continue to relate to the next sentence?”

Understanding the relationship between two text sentences cannot be directly obtained through language modeling.

Rules:

The next sentence has a 50% chance of being a consecutive sentence.
The next sentence has a 50% chance of being an unrelated sentence.

Usage

Note: Your corpus should have two sentences per line, separated by a ( ) delimiter.

Welcome to the 	 the jungle 
I can stay 	 here all night

1. Build vocab from your own corpus

python build_vocab.py -c data/corpus.small -o data/corpus.small.vocab

usage: build_vocab.py [-h] -c CORPUS_PATH -o OUTPUT_PATH [-s VOCAB_SIZE]
                      [-e ENCODING] [-m MIN_FREQ]

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS_PATH, --corpus_path CORPUS_PATH
  -o OUTPUT_PATH, --output_path OUTPUT_PATH
  -s VOCAB_SIZE, --vocab_size VOCAB_SIZE
  -e ENCODING, --encoding ENCODING
  -m MIN_FREQ, --min_freq MIN_FREQ

2. Build BERT training dataset using your own corpus

python build_dataset.py -d data/corpus.small -v data/corpus.small.vocab -o data/dataset.small

usage: build_dataset.py [-h] -v VOCAB_PATH -c CORPUS_PATH [-e ENCODING] -o
                        OUTPUT_PATH

optional arguments:
  -h, --help            show this help message and exit
  -v VOCAB_PATH, --vocab_path VOCAB_PATH
  -c CORPUS_PATH, --corpus_path CORPUS_PATH
  -e ENCODING, --encoding ENCODING
  -o OUTPUT_PATH, --output_path OUTPUT_PATH

3. Train your own BERT model

python train.py -d data/dataset.small -v data/corpus.small.vocab -o output/

usage: train.py [-h] -d TRAIN_DATASET [-t TEST_DATASET] -v VOCAB_PATH -o
                OUTPUT_DIR [-hs HIDDEN] [-n LAYERS] [-a ATTN_HEADS]
                [-s SEQ_LEN] [-b BATCH_SIZE] [-e EPOCHS]

optional arguments:
  -h, --help            show this help message and exit
  -d TRAIN_DATASET, --train_dataset TRAIN_DATASET
  -t TEST_DATASET, --test_dataset TEST_DATASET
  -v VOCAB_PATH, --vocab_path VOCAB_PATH
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
  -hs HIDDEN, --hidden HIDDEN
  -n LAYERS, --layers LAYERS
  -a ATTN_HEADS, --attn_heads ATTN_HEADS
  -s SEQ_LEN, --seq_len SEQ_LEN
  -b BATCH_SIZE, --batch_size BATCH_SIZE
  -e EPOCHS, --epochs EPOCHS

Original link: https://github.com/codertimo/BERT-pytorch

Leave a Comment Cancel reply