Master Bert Source Code in 10 Minutes (PyTorch Version)

The application of Bert in production environments requires compression, which demands a deep understanding of the Bert structure. This repository will interpret the Bert source code (PyTorch version) step by step. The repository can be found at

https://github.com/DA-southampton/NLP_ability

Code and Data Introduction

First, for the code, I referenced this repository.

I directly cloned the code and renamed it to bert_read_step_to_step in my repository.

I will use this code to run the Bert text classification code step by step while recording various details, including my own implementations.

Before running, two things need to be done.

Preparing the Pre-trained Model

One is preparing the pre-trained model. I am using Google’s Chinese pre-trained model: chinese_L-12_H-768_A-12.zip. The model is quite large, so I will not upload it. If it does not exist locally, click here to download it directly, or run the command line directly:

wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip

After downloading the pre-trained model, unzip it, and convert the TensorFlow model to the corresponding PyTorch version. The corresponding code is as follows:

export BERT_BASE_DIR=/path/to/bert/chinese_L-12_H-768_A-12

python convert_tf_checkpoint_to_pytorch.py \
  --tf_checkpoint_path $BERT_BASE_DIR/bert_model.ckpt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin

After successful conversion, place the model in the corresponding location in the repository:

Read_Bert_Code/bert_read_step_to_step/prev_trained_model/

and rename it to:

bert-base-chinese

Preparing Text Classification Training Data

The second task is preparing the training data. Here I am preparing to do a text classification task using the Tnews dataset. This dataset can be found here and is divided into training, testing, and development sets. I have uploaded it to the repository, specifically located at

Read_Bert_Code/bert_read_step_to_step/chineseGLUEdatasets/tnews

It is important to note that since I am only trying to understand the internal code, accuracy is not my primary concern, so I am only using a portion of the data, with 1k for training, 1k for testing, and 1k for development.

Once prepared, import the project using PyCharm and get ready for debugging. My debugging file is run_classifier.py, with the following parameters:

--model_type=bert --model_name_or_path=prev_trained_model/bert-base-chinese --task_name="tnews" --do_train --do_eval --do_lower_case --data_dir=./chineseGLUEdatasets/tnews --max_seq_length=128 --per_gpu_train_batch_size=16 --per_gpu_eval_batch_size=16 --learning_rate=2e-5 --num_train_epochs=4.0 --logging_steps=100 --save_steps=100 --output_dir=./outputs/tnews_output/ --overwrite_output_dir

Then debug run_classifier.py, and I will detail the debugging steps below.

1. Entering the Main Function

First, set a breakpoint at the main function location, which is here, and then check the main function’s status:

## Set breakpoint at the main function
if __name__ == "__main__":
    main()## Entering the main function

2. Parsing Command Line Parameters

This section is parsing the command line parameters, which is standard practice, mainly for model name, model path, whether to test, etc. It is quite straightforward, so we can pass through it.

3. Checking Various Conditions

This section involves some standard checks:

Checking if the output folder exists

Checking if remote debugging is needed

Checking whether to train on a single CPU, single multi-GPU, or multi-machine distributed GPU. This is controlled by two parameters.

Specific code can be seen below:

if args.local_rank == -1 or args.no_cuda:
    device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
    args.n_gpu = torch.cuda.device_count()
else:  # Initializes the distributed backend which will take care of synchronizing nodes/GPUs
    torch.cuda.set_device(args.local_rank)
    device = torch.device("cuda", args.local_rank)
    torch.distributed.init_process_group(backend='nccl')
    args.n_gpu = 1

4. Getting the Corresponding Processor for the Task

Get the corresponding processor for the task, which is a function we need to define to handle our own input files. The code is as follows:

processor = processors[args.task_name]()

Here we are using the following class:

TnewsProcessor(DataProcessor)

Specific code can be found here,

4.1 TnewsProcessor

Let’s analyze the TnewsProcessor closely. First, it inherits from DataProcessor.

Click here to open the collapsed code:

## DataProcessor's position in the entire project: processors.utils.DataProcessor
class DataProcessor(object):
    def get_train_examples(self, data_dir):
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        raise NotImplementedError()

    def get_labels(self):
        raise NotImplementedError()

    @classmethod
    def _read_tsv(cls, input_file, quotechar=None):
        with open(input_file, "r", encoding="utf-8-sig") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
            lines = []
            for line in reader:
                lines.append(line)
            return lines

    @classmethod
    def _read_txt(cls, input_file):
        """Reads a tab separated value file."""
        with open(input_file, "r") as f:
            reader = f.readlines()
            lines = []
            for line in reader:
                lines.append(line.strip().split("_!_"))
            return lines

It contains five functions: reading training data, development data, getting labels, and formatting data for Bert.

Next, let’s take a look at the TnewsProcessor code format:

Click here to open the collapsed code:

class TnewsProcessor(DataProcessor):

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_txt(os.path.join(data_dir, "toutiao_category_train.txt")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_txt(os.path.join(data_dir, "toutiao_category_dev.txt")), "dev")

    def get_test_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_txt(os.path.join(data_dir, "toutiao_category_test.txt")), "test")

    def get_labels(self):
        """See base class."""
        labels = []
        for i in range(17):
            if i == 5 or i == 11:
                continue
            labels.append(str(100 + i))
        return labels

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            if set_type == 'test':
                label = '0'
            else:
                label = line[1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

Here is a point to remind everyone: if we are using our own training data, there are two methods: the first is to change the data format to match our test cases, and the second is to modify the source code here to read our own data format.

5. Loading the Pre-trained Model

The code is quite simple, just call the pre-trained model without going into detail.

Click here to open the collapsed code:

config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name)
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)

6. Training the Model – The Most Important Part

Training the model, as seen from the main function, involves two steps: loading the required dataset and performing training. The code is located here. The general code looks like this:

train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, data_type='train')
global_step, tr_loss = train(args, train_dataset, model, tokenizer)

Let’s look at these two functions one by one:

6.1 Loading the Training Set

First, let’s look at the first function, load_and_cache_examples, which loads the training dataset. The code is located here. Let’s take a quick look at this code, which has three core operations.

The first core operation is as follows:

examples = processor.get_train_examples(args.data_dir)

This code is to use the processor to read the training set. It is very simple.

The returned example looks something like this (the return format is clearly displayed above when looking at the processor):

guid='train-0'
label='104'
text_a='The stock market is not doing well today'
text_b=None

The second core operation is convert_examples_to_features, which transforms the data, and it is also quite simple.

The code is located here. The code is as follows:

features = convert_examples_to_features(examples, tokenizer, label_list=label_list, max_length=args.max_seq_length, output_mode=output_mode, pad_on_left=bool(args.model_type in ['xlnet']),                                                pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,

Let’s look inside this function to see what’s going on. The location is:

processors.glue.glue_convert_examples_to_features

It performs label mapping, ‘100’ -> 0, ‘101’ -> 1…

Next, it obtains the serialized representation of the input text: input_ids, token_type_ids; the form is roughly as follows:

‘input_ids’=[101, 5500, 4873, 704, 4638, 4960, 4788, 2501, 2578, 102]

‘token_type_ids’=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

It calculates the current length and obtains the padding length. For example, if our length is 10, it needs to be padded to 128, requiring 118 zeros.

At this point, our input_ids becomes the above list with 128 zeros added at the end, and our attention_mask becomes the above form with 118 zeros added, because the second sentence is not padded, so token_type_ids is a total of 128 zeros.

After processing each data point, we need to do the following:

features.append(InputFeatures(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
label=label,
input_len=input_len))## Length is the original length, should be 10, not 128

InputFeatures here is used to store the transformed features into a new variable.

After transforming all the original data into features, we obtain the features list and then convert its elements into tensor form, and then

The third core operation is to construct the final dataset using TensorDataset and return it:

dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_lens, all_labels)

6.2 Training the Model – The Train Function

Now let’s look at the second function, which is the train operation.

6.2.1 Standard Operations

First, there are some standard operations.

Random sampling of data: RandomSampler

DataLoader reads the data

Calculating the total training steps (gradient accumulation), setting warm_up parameters, optimizer, whether to use fp16, etc.

Then, training is done batch by batch. The core code here is to send the data and parameters into the model:

outputs = model(**inputs)

We are performing a text classification demo operation using the corresponding Bert class, BertForSequenceClassification.

Let’s directly enter this class to see its internal functions.

6.2.2 Bert Classification Model: BertForSequenceClassification

The main code is as follows:

Click here to open the collapsed code:

## reference: transformers.modeling_bert.BertForSequenceClassification 
class BertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config):
                ...
        ...
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, labels=None):
        outputs = self.bert(input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            position_ids=position_ids, 
                            head_mask=head_mask)
        ## Note: In init, self.bert is defined as BertModel, so we need to check how the data enters BertModel
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
                ...
        ...
        return outputs  # (loss), logits, (hidden_states), (attentions)

This class has two core parts. The first part uses BertModel to obtain the original output from Bert, and then uses the cls output for subsequent classification operations. It is important to look at the BertModel, so let’s directly check the internal situation of the BertModel class. The code is as follows:

Then let’s see what the BertModel model looks like:

6.2.1.1 BertModel

The code is as follows:

Click here to open the collapsed code:

## reference: transformers.modeling_bert.BertModel  
class BertModel(BertPreTrainedModel):
    def __init__(self, config):
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)
                ...
    def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None):
                ...
        ### First part, operate on attention_mask and perform embedding on the input
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
        ### Second part enters the encoder for encoding
        encoder_outputs = self.encoder(embedding_output,
                                       extended_attention_mask,
                                       head_mask=head_mask)
                ...
        return outputs

For the BertModel, we can divide it into two parts: the first part operates on the attention_mask and performs embedding on the input, and the second part enters the encoder for encoding, where the encoder uses BertEncoder. Let’s directly look at the BertEncoder class.

6.2.1.1.1 BertEncoder

The code is as follows:

## reference: transformers.modeling_bert.BertEncoder
class BertEncoder(nn.Module):
    def __init__(self, config):
        super(BertEncoder, self).__init__()
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])

    def forward(self, hidden_states, attention_mask=None, head_mask=None):
        all_hidden_states = ()
        all_attentions = ()
        for i, layer_module in enumerate(self.layer):
            if self.output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)

            layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i])
            hidden_states = layer_outputs[0]

            if self.output_attentions:
                all_attentions = all_attentions + (layer_outputs[1],)

        # Add last layer
        if self.output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)

        outputs = (hidden_states,)
        if self.output_hidden_states:
            outputs = outputs + (all_hidden_states,)
        if self.output_attentions:
            outputs = outputs + (all_attentions,)
        return outputs  # last-layer hidden state, (all hidden states), (all attentions)

A small detail about the BertEncoder is that if output_hidden_states is True, it will output the results of each layer, including word vectors. So if there are twelve layers, the output will be 13 layers, with the first layer being the word-embedding result, and each layer’s result has the shape [batchsize, seqlength, Hidden_size] (except for the first layer, [batchsize, seqlength, embedding_size]).

Of course, the embedding_size in dimension is the same as the hidden layer dimension.

Another point to note is that we can see a detail here, which is that we can do head_mask. I remember there was a paper discussing the impact of which head on the results, and this seems to be achievable.

The most important part of the BertEncoder is the BertLayer.

BertLayer

BertLayer consists of two operations, BertAttention and BertIntermediate. Let’s look at them one by one.

BertAttention
BertSelfAttention

def forward(self, hidden_states, attention_mask=None, head_mask=None):
  ## Accept parameters as above
  mixed_query_layer = self.query(hidden_states) ## Generate query [16,32,768], 16 is batch_size, 32 is the length of each sentence in the batch, 768 is the dimension
  mixed_key_layer = self.key(hidden_states)
  mixed_value_layer = self.value(hidden_states)

  query_layer = self.transpose_for_scores(mixed_query_layer)## Transform the generated query dimensions, now the dimension: [16,12,32,64]:[Batch_size,Num_head,Seq_len,head_dimension]
  key_layer = self.transpose_for_scores(mixed_key_layer)
  value_layer = self.transpose_for_scores(mixed_value_layer)

  # Take the dot product between "query" and "key" to get the raw attention scores.
  attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
  ## After this operation, attention_scores dimension is torch.Size([16, 12, 32, 32])
  attention_scores = attention_scores / math.sqrt(self.attention_head_size)
  if attention_mask is not None:
  # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
  attention_scores = attention_scores + attention_mask
  ## Here it directly adds, the pad part is a very large negative value, so during softmax, it becomes close to 0

  # Normalize the attention scores to probabilities.
  attention_probs = nn.Softmax(dim=-1)(attention_scores)

  # This is actually dropping out entire tokens to attend to, which might
  # seem a bit unusual, but is taken from the original Transformer paper.
  attention_probs = self.dropout(attention_probs)## Dimension torch.Size([16, 12, 32, 32])

  # Mask heads if we want to
  if head_mask is not None:
    attention_probs = attention_probs * head_mask

  context_layer = torch.matmul(attention_probs, value_layer)## Dimension torch.Size([16, 12, 32, 64])

  context_layer = context_layer.permute(0, 2, 1, 3).contiguous()## Dimension torch.Size([16, 32, 12, 64])
  new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)## new_context_layer_shape：torch.Size([16, 32, 768])
  context_layer = context_layer.view(*new_context_layer_shape)
## Dimension becomes torch.Size([16, 32, 768])
  outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)
  return outputs

At this point, the BertSelfAttention returns a result of dimension torch.Size([16, 32, 768]), which is used as input for BertSelfOutput.

BertSelfOutput

class BertSelfOutput(nn.Module):
    def __init__(self, config):
        super(BertSelfOutput, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        ## A linear operation with unchanged dimensions
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

After the two functions, BertSelfAttention and BertSelfOutput, the attention results are returned. Next is the next operation of BertLayer: BertIntermediate.

BertIntermediate

This function is quite simple, passing through a Linear layer and a Gelu activation function.

The input result dimension is torch.Size([16, 32, 3072]).

This result then enters the BertOutput model.

BertOutput

This is also quite simple, Linear + BertLayerNorm + Dropout, output result dimension is torch.Size([16, 32, 768]).

The output result of BertOutput is returned to the BertEncoder class.

The result of BertEncoder is returned to the BertModel class as encoder_outputs, with a dimension of torch.Size([16, 32, 768]).

The return value of the BertModel is outputs = (sequence_output, pooled_output,) + encoder_outputs[1:].

sequence_output: torch.Size([16, 32, 768]).

pooled_output: torch.Size([16, 768]) is the cls output after passing through a pooling layer (which is actually just a linear operation with unchanged dimensions + tanh).

The outputs are returned to BertForSequenceClassification, which performs classification on pooled_output.

Code and Data Introduction

Preparing the Pre-trained Model

Preparing Text Classification Training Data

1. Entering the Main Function

2. Parsing Command Line Parameters

3. Checking Various Conditions

4. Getting the Corresponding Processor for the Task

4.1 TnewsProcessor

5. Loading the Pre-trained Model

6. Training the Model – The Most Important Part

6.1 Loading the Training Set

6.2 Training the Model – The Train Function

6.2.1 Standard Operations

6.2.2 Bert Classification Model: BertForSequenceClassification

6.2.1.1 BertModel

6.2.1.1.1 BertEncoder

Leave a Comment Cancel reply