In-Depth Analysis of BERT Source Code

By/Gao Kaiyuan Image Source: Internet

Introduction

[email protected]

I have been reviewing materials related to Paddle, so I decided to take a closer look at the source code of Baidu’s ERNIE. When I skimmed through it before, I noticed that ERNIE 2.0 and ERNIE-tiny are quite similar to BERT. I wonder what changes have been made in the updates~ After finishing, I will organize a summary similar to the one below. If any colleagues researching Paddle or ERNIE want to discuss, feel free to add me, haha.

Original Content @2019.05.16

The BERT model has been out for a while now. I have previously read papers and blogs interpreting it: NLP’s Killer Tool BERT Model Interpretation[1], but I had not looked closely at the implementation in the source code. Recently, I had the opportunity to use it, so I took some time to examine it and record my findings for discussion.

Note that the source code reading series requires prior knowledge of NLP-related concepts, such as the attention mechanism, transformer framework, as well as basic knowledge of Python and TensorFlow. The principles of BERT are not the focus of this article.

Here is a compilation of materials related to BERT: Summary of BERT-related papers, articles, and code resources^[2]

Today, I will introduce the main model implementation of BERT – BertModel, with the code located in:

modeling.py module^[3]

There are comments inside the code blocks as well as outside.

If there are any misunderstandings in the interpretation, please feel free to point them out~

1. Configuration Class (BertConfig)

This part of the code mainly defines some default parameters for the BERT model and includes some file handling functions.

class BertConfig(object):
  """Configuration class for the BERT model."""

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02):

    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range

  @classmethod
  def from_dict(cls, json_object):
    """Constructs a `BertConfig` from a Python dictionary of parameters."""
    config = BertConfig(vocab_size=None)
    for (key, value) in six.iteritems(json_object):
      config.__dict__[key] = value
    return config

  @classmethod
  def from_json_file(cls, json_file):
    """Constructs a `BertConfig` from a json file of parameters."""
    with tf.gfile.GFile(json_file, "r") as reader:
      text = reader.read()
    return cls.from_dict(json.loads(text))

  def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output

  def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

Parameter Meanings

vocab_size: Size of the vocabulary
hidden_size: Number of neurons in the hidden layer
num_hidden_layers: Number of hidden layers in the Transformer encoder
num_attention_heads: Number of heads in multi-head attention
intermediate_size: Number of neurons in the intermediate hidden layer (e.g., feed-forward layer)
hidden_act: Activation function for the hidden layer
hidden_dropout_prob: Dropout rate for the hidden layer
attention_probs_dropout_prob: Dropout rate for the attention part
max_position_embeddings: Maximum position encoding
type_vocab_size: Vocabulary size for token_type_ids
initializer_range: Standard deviation for truncated_normal_initializer initialization method

One thing to note is that you may find the type_vocab_size parameter a bit confusing at first. It actually refers to the Segment A and Segment B in the next sentence prediction task. This is also explained in the downloaded bert_config.json file, where the default value should be 2. Refer to this Issue^[4]

2. Obtaining Word Vectors (Embedding_lookup)

For the input word_ids, it returns the embedding table. One can use either one-hot or tf.gather()

def embedding_lookup(input_ids,  # word_id: 【batch_size, seq_length】
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):

  # The default input shape is 【batch_size, seq_length, input_num】
  # If the input is 2D 【batch_size, seq_length】, expand to 【batch_size, seq_length, 1】
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  flat_input_ids = tf.reshape(input_ids, [-1])  # 【batch_size*seq_length*input_num】
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:  # Indexing
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)

  # output: [batch_size, seq_length, num_inputs]
  # Reshape to: [batch_size, seq_length, num_inputs*embedding_size]
  output = tf.reshape(output,
        input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

Parameter Meanings

input_ids: Word ID 【batch_size, seq_length】
vocab_size: Embedding vocabulary
embedding_size: Embedding dimension
initializer_range: Embedding initialization range
word_embedding_name: Name for the embedding table
use_one_hot_embeddings: Whether to use one-hot embeddings
Return: 【batch_size, seq_length, embedding_size】

3. Subsequent Processing of Word Vectors (embedding_postprocessor)

We know that the BERT model’s input consists of three parts: token embedding, segment embedding, and position embedding. In the previous section, we only obtained the token embedding. This part of the code refines the information, adds regularization, and dropout, then outputs the final embedding. Note that in the Transformer paper, the position embedding is generated by fixed values from the sin/cos function, while in this code implementation, it is randomly generated like ordinary word embeddings, making it trainable. The reason for this choice may be that the data used for training BERT is much larger than that in the original Transformer, allowing the model to learn independently.

def embedding_postprocessor(input_tensor,  # [batch_size, seq_length, embedding_size]
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,  # Typically 2
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,  # Max position encoding, must be >= max_seq_len
                            dropout_prob=0.1):

  input_shape = get_shape_list(input_tensor, expected_rank=3)  # 【batch_size, seq_length, embedding_size】
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]

  output = input_tensor

  # Segment position information
  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # Since token-type-table is relatively small, one-hot embedding is used here for acceleration
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings

  # Position embedding information
  if use_position_embeddings:
    # Ensure seq_length is less than or equal to max_position_embeddings
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))

      # Here, position embedding is a learnable parameter, [max_position_embeddings, width]
      # However, the actual input sequence usually does not reach max_position_embeddings
      # Thus, to improve training speed, use tf.slice to extract the embedding of the sentence length
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Word embedding tensor is [batch_size, seq_length, width]
      # Since position encoding is independent of input content, its shape is always [seq_length, width]
      # We cannot add position embedding to word embedding
      # Therefore, we need to expand the position encoding to [1, seq_length, width]
      # Then we can add them through broadcasting.
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

  output = layer_norm_and_dropout(output, dropout_prob)
  return output

4. Constructing Attention Mask

This part of the code constructs the attention_mask for the attention visual field. Since each sample goes through a padding process, the padding part cannot attend to other parts during self-attention. The input is the padding-processed [batch_size, from_seq_length,…] of input_ids and the mask marking vector of shape [batch_size, to_seq_length].

def create_attention_mask_from_input_mask(from_tensor, to_mask):
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  batch_size = from_shape[0]
  from_seq_length = from_shape[1]

  to_shape = get_shape_list(to_mask, expected_rank=2)
  to_seq_length = to_shape[1]

  to_mask = tf.cast(
      tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)

  broadcast_ones = tf.ones(
      shape=[batch_size, from_seq_length, 1], dtype=tf.float32)

  mask = broadcast_ones * to_mask

  return mask

5. Attention Layer (Attention Layer)

This part of the code implements multi-head attention, primarily derived from the paper “Attention is All You Need”. In the attention mechanism, the from_tensor serves as the query, while the to_tensor serves as the key and value. When both are the same, it is self-attention. For a more detailed introduction to attention, please refer to 【Understanding Attention Mechanism Principles and Models^[5]】.

def attention_layer(from_tensor,   # 【batch_size, from_seq_length, from_width】
                    to_tensor,  # 【batch_size, to_seq_length, to_width】
                    attention_mask=None,  # 【batch_size, from_seq_length, to_seq_length】
                    num_attention_heads=1,  # Number of attention heads
                    size_per_head=512,  # Size of each head
                    query_act=None,  # Activation function for query transformation
                    key_act=None,  # Activation function for key transformation
                    value_act=None,  # Activation function for value transformation
                    attention_probs_dropout_prob=0.0,  # Dropout for attention layer
                    initializer_range=0.02,  # Initialization range
                    do_return_2d_tensor=False,  # Whether to return a 2D tensor.
# If True, output shape is 【batch_size*from_seq_length, num_attention_heads*size_per_head】
# If False, output shape is 【batch_size, from_seq_length, num_attention_heads*size_per_head】
                    batch_size=None,  # If input is 3D,
# then batch is the first dimension, but 3D may have been compressed to 2D, so batch_size needs to be specified
                    from_seq_length=None,  # Same as above
                    to_seq_length=None):  # Same as above

  def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                           seq_length, width):
    output_tensor = tf.reshape(
        input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])  #[batch_size,  num_attention_heads, seq_length, width]
    return output_tensor

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

  if len(from_shape) != len(to_shape):
    raise ValueError(
        "The rank of `from_tensor` must match the rank of `to_tensor`.")

  if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
  elif len(from_shape) == 2:
    if (batch_size is None or from_seq_length is None or to_seq_length is None):
      raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified.")

  # To facilitate shape annotation, the following abbreviations are used:
  #   B = batch size (number of sequences)
  #   F = `from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`

  # Compress from_tensor and to_tensor into 2D tensors
  from_tensor_2d = reshape_to_matrix(from_tensor)  # 【B*F, hidden_size】
  to_tensor_2d = reshape_to_matrix(to_tensor)  # 【B*T, hidden_size】

  # Input from_tensor into a fully connected layer to obtain query_layer
  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

  # Input from_tensor into a fully connected layer to obtain key_layer
  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # Same as above
  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

  # Transform query_layer into multi-head: [B*F, N*H] ==> [B, F, N, H] ==> [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)

  # Transform key_layer into multi-head: [B*T, N*H] ==> [B, T, N, H] ==> [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  # Calculate attention scores by performing dot product between query and key, then scale, refer to the original paper for the formula
  # `attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # If the elements in attention_mask are 1, then through the following operation, (1-1)*-10000, adder is 0
    # If the elements in attention_mask are 0, then through the following operation, (1-0)*-10000, adder is -10000
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # The attention score we ultimately obtain is generally not very large,
    # so the operation above makes the score for mask=0 effectively negative infinity
    attention_scores += adder

  # Negative infinity becomes 0 after softmax, which means that the position with mask=0 does not compute attention scores
  # `attention_probs` = [B, N, F, T]
  attention_probs = tf.nn.softmax(attention_scores)

  # Apply dropout to attention_probs, although this seems a bit strange, the original Transformer paper does it this way
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

  # `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

  if do_return_2d_tensor:
    # `context_layer` = [B*F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size * from_seq_length, num_attention_heads * size_per_head])
  else:
    # `context_layer` = [B, F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size, from_seq_length, num_attention_heads * size_per_head])

  return context_layer

To summarize, the main flow of the attention layer is:

Verify the shape of the input tensor and extract batch_size, from_seq_length, to_seq_length;
Convert the input to a 2D matrix if it is a 3D tensor;
Use from_tensor as the query and to_tensor as the key and value, obtaining query_layer, key_layer, and value_layer after passing through a fully connected layer;
Transform the above tensors into multi-head using transpose_for_scores;
Calculate attention_score and attention_probs according to the formula in the paper (note the trick with attention_mask):

Multiply the resulting attention_probs with the value and return a 2D or 3D tensor

6. Transformer

The following code is the core code of the well-known Transformer, which can be considered a reproduction of the original code from “Attention is All You Need”. Refer to the 【original paper^[6]】 and 【original code^[7]】.

def transformer_model(input_tensor,  # 【batch_size, seq_length, hidden_size】
                      attention_mask=None,  # 【batch_size, seq_length, seq_length】
                      hidden_size=768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072,
                      intermediate_act_fn=gelu,  # Activation function for feed-forward layer
                      hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02,
                      do_return_all_layers=False):

  # Note that because the final output must have hidden_size, we have num_attention_head areas,
  # Each head area has size_per_head more hidden layers
  # So hidden_size = num_attention_head * size_per_head
  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

  attention_head_size = int(hidden_size / num_attention_heads)
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2]

  # Since there are residual operations in the encoder, the shapes need to be the same
  if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

  # Reshape operations are fast on CPU/GPU, but not friendly on TPU
  # Therefore, to avoid frequent reshaping between 2D and 3D, we represent all 3D tensors as 2D matrices
  prev_output = reshape_to_matrix(input_tensor)

  all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output

      with tf.variable_scope("attention"):
      # Multi-head attention
        attention_heads = []
        with tf.variable_scope("self"):
        # Self-attention
          attention_head = attention_layer(
              from_tensor=layer_input,
              to_tensor=layer_input,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)
          attention_heads.append(attention_head)

        attention_output = None
        if len(attention_heads) == 1:
          attention_output = attention_heads[0]
        else:
          # If there are multiple heads, concatenate them
          attention_output = tf.concat(attention_heads, axis=-1)

        # Perform a linear mapping on the attention output, to make the shape consistent with the input
        # Then dropout + residual + norm
        with tf.variable_scope("output"):
          attention_output = tf.layers.dense(
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)

      # Feed-forward
      with tf.variable_scope("intermediate"):
        intermediate_output = tf.layers.dense(
            attention_output,
            intermediate_size,
            activation=intermediate_act_fn,
            kernel_initializer=create_initializer(initializer_range))

      # Use linear transformation on the output of the feed-forward layer to return to ‘hidden_size’
      # Then dropout + residual + norm
      with tf.variable_scope("output"):
        layer_output = tf.layers.dense(
            intermediate_output,
            hidden_size,
            kernel_initializer=create_initializer(initializer_range))
        layer_output = dropout(layer_output, hidden_dropout_prob)
        layer_output = layer_norm(layer_output + attention_output)
        prev_output = layer_output
        all_layer_outputs.append(layer_output)

  if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output

Using the accompanying diagram will enhance understanding, as BERT only contains the encoder, and there is no decoder In-Depth Analysis of BERT Source Code

7. Function Entry (Init)

The constructor of the BertModel class. With the groundwork laid in the previous sections, we can implement the BERT model now.

def __init__(self,
               config,  # BertConfig object
               is_training,
               input_ids,  # 【batch_size, seq_length】
               input_mask=None,  # 【batch_size, seq_length】
               token_type_ids=None,  # 【batch_size, seq_length】
               use_one_hot_embeddings=False,  # Whether to use one-hot; otherwise tf.gather()
               scope=None):

    config = copy.deepcopy(config)
    if not is_training:
      config.hidden_dropout_prob = 0.0
      config.attention_probs_dropout_prob = 0.0

    input_shape = get_shape_list(input_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1]
# Not masking, all elements are 1
    if input_mask is None:
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

    with tf.variable_scope(scope, default_name="bert"):
      with tf.variable_scope("embeddings"):
        # Word embedding
        (self.embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

        # Add position embedding and segment embedding
        # Layer norm + dropout
        self.embedding_output = embedding_postprocessor(
            input_tensor=self.embedding_output,
            use_token_type=True,
            token_type_ids=token_type_ids,
            token_type_vocab_size=config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

      with tf.variable_scope("encoder"):

        # input_ids is the padded word_ids: [25, 120, 34, 0, 0]
        # input_mask is the valid word marking: [1, 1, 1, 0, 0]
        attention_mask = create_attention_mask_from_input_mask(
            input_ids, input_mask)

        # Stacking transformer modules
        # `sequence_output` shape = [batch_size, seq_length, hidden_size].
        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)

  # `self.sequence_output` is the output of the last layer, shape is 【batch_size, seq_length, hidden_size】
      self.sequence_output = self.all_encoder_layers[-1]

      # The 'pooler' part converts the encoder output 【batch_size, seq_length, hidden_size】
      # to 【batch_size, hidden_size】
      with tf.variable_scope("pooler"):
        # Take the first token tensor corresponding to the last layer's [CLS], which is crucial for classification tasks
# sequence_output[:, 0:1, :] gets [batch_size, 1, hidden_size]
# We need to use squeeze to remove the second dimension
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        # Then add a fully connected layer, output remains [batch_size, hidden_size]
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))

Conclusion

With the in-depth understanding of the source code above, we will be more adept at using the BertModel. Here is a simple example of using the model:

# Assuming the input has been tokenized into word_ids. shape=[2, 3]
input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
# segment_embedding. Indicates that the first two words of the first sample belong to sentence 1, and the last word belongs to sentence 2.
# The first word of the second sample belongs to sentence 1, the second word belongs to sentence 2, and the third element is 0 indicating padding.
# The original code is like this, but it seems unnecessary to use 2, I don't know if I misunderstood something.
token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])

# Create an instance of BertConfig
config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
         num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)

# Create an instance of BertModel
model = modeling.BertModel(config=config, is_training=True,
     input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)

label_embeddings = tf.get_variable(...)
# Get the first token of the last layer, which is the [CLS] vector representation, can be considered as a sentence embedding
pooled_output = model.get_pooled_output()
logits = tf.matmul(pooled_output, label_embeddings)

The main flow of building the BERT model is as follows:

Embed the input sequence (three parts), then proceed with ‘Attention is all you need’;
Simply put, input the embedding into the transformer to obtain the output;
In detail, it is embedding –> N * [multi-head attention –> Add(Residual) & Norm –> Feed-Forward –> Add(Residual) & Norm];
Isn’t it simple~
There are also some other auxiliary functions in the source code, which are not very difficult to understand, so I won’t elaborate further.

References

[1]

NLP’s Killer Tool BERT Model Interpretation: https://blog.csdn.net/Kaiyuan_sjtu/article/details/83991186

[2]

Summary of BERT-related papers, articles, and code resources: http://www.52nlp.cn/bert-paper-%E8%AE%BA%E6%96%87-%E6%96%87%E7%AB%A0-%E4%BB%A3%E7%A0%81%E8%B5%84%E6%BA%90%E6%B1%87%E6%80%BB

[3]

modeling.py module: https://github.com/google-research/bert/blob/master/modeling.py

[4]

Refer to this Issue: https://github.com/google-research/bert/issues/16

[5]

Understanding Attention Mechanism Principles and Models: https://blog.csdn.net/Kaiyuan_sjtu/article/details/81806123

[6]

Original Paper: https://arxiv.org/abs/1706.03762

[7]

Original Code: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py

Submission Email: [email protected]