How to Quickly Use BERT?

Follow the public account “ML_NLP“

Set as “Starred“, important content delivered first!

Source | Zhihu

Address | https://zhuanlan.zhihu.com/p/112235454

Author | TotoroWang

Editor | Machine Learning Algorithms and Natural Language Processing Public Account

This article has been authorized by the author and is prohibited from secondary reproduction

Introduction

Since I have been working on BERT models recently, I will record several common ways to quickly use the BERT model here~

BERT Model

The BERT model, as the strongest pre-trained model currently available, has set multiple records in the NLP field. Although BERT has achieved state-of-the-art results in many areas, training the BERT model is quite challenging, with a staggering number of parameters reaching the hundreds of millions, making it impossible for general users to complete the training. Fortunately, Google has open-sourced several versions of the BERT pre-trained model, specifically the following versions:

BERT-Large, Uncased (Whole Word Masking)

Language: English
Network Structure: 24-layer, 1024-hidden, 16-heads
Parameter Size: 340M

BERT-Large, Cased (Whole Word Masking)

Language: English
Network Structure: 24-layer, 1024-hidden, 16-heads
Parameter Size: 340M

BERT-Base, Uncased

Language: English
Network Structure: 12-layer, 768-hidden, 12-heads
Parameter Size: 110M

BERT-Large, Uncased

Language: English
Network Structure: 24-layer, 1024-hidden, 16-heads
Parameter Size: 340M

BERT-Base, Cased

Language: English
Network Structure: 12-layer, 768-hidden, 12-heads
Parameter Size: 110M

BERT-Large, Cased

Language: English
Network Structure: 24-layer, 1024-hidden, 16-heads
Parameter Size: 340M

BERT-Base, Multilingual Cased (New, recommended)

Language: 104 languages
Network Structure: 12-layer, 768-hidden, 12-heads
Parameter Size: 110M

BERT-Base, Multilingual Uncased (Original, not recommended)

Language: 102 languages
Network Structure: 12-layer, 768-hidden, 12-heads
Parameter Size: 110M

BERT-Base, Chinese

Language: Chinese
Network Structure: 12-layer, 768-hidden, 12-heads
Parameter Size: 110M

From the above versions, we can see that the languages mainly fall into three categories: Chinese, English, and multilingual. Among them, the English and multilingual versions are further divided into: cased and uncased, where cased indicates case sensitivity and uncased indicates case insensitivity. The network structure is mainly divided into two types: Base and Large. The Base version has a smaller network size compared to the Large version, with a parameter size of 110M. The Chinese pre-trained model has only one version, which is trained using the Base version of the network structure. The specific network structure and principles of the BERT model can be found in the paper BERT, which will not be elaborated here.

Using the BERT Model

The BERT model can be used mainly for two purposes: first, as a tool for text feature extraction, similar to the Word2vec model; second, as a trainable layer that can be connected to a customized network for transfer learning.

Feature Extraction Tool

As a feature extraction tool, there is a simple usage method: bert-as-service. First, you need to install the server and client packages:

pip install bert-serving-server # Server
pip install bert-serving-client # Client

Environment requirements: Python >= 3.5, Tensorflow >= 1.10. Then, download the BERT pre-trained model. You can click the link above to download, for example, we download the Chinese version BERT model BERT-Base, Chinese. After downloading, unzip it to a local directory. For example: /tmp/chinese_L-12_H-768_A-12/. Then, open the terminal and enter the following command to start the service:

bert-serving-start -model_dir /tmp/chinese_L-12_H-768_A-12/ -num_worker=2

Here, the parameter model_dir is set to the path of the unzipped BERT pre-trained model, and num_worker is the number of processes. It should be noted that num_worker must be less than the number of CPU cores or GPU devices.

Finally, write the client code:

from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['龙猫小组真棒', '关注有惊喜', '哈哈'])

For encoding sentence pairs, you can use the ||| symbol as a separator between the two sentences:

bc.encode(['龙猫小组真棒 ||| 关注有惊喜'])

Of course, there are also some parameters that can be customized, and different APIs can be used. For details, please refer to: bert-as-service.

In addition to this usage method, you can also use deep learning frameworks such as Tensorflow and Keras to rebuild the BERT pre-trained model, and then use the rebuilt BERT model to obtain text vector representations.

Loading the BERT Model with Tensorflow

Common methods for loading the BERT model with Tensorflow are:

Using the open-source code of BERT
Using Tensorflow_hub (recommended)

The advantage of using the open-source code of BERT is that the API documentation is very detailed, allowing for quick fine-tuning and text classification tasks using the pre-trained model. However, the BERT open-source code is developed on Tensorflow 1.11.0, which may not be very friendly for those who want to use Tensorflow 2.0. If you want to run the BERT open-source code on Tensorflow 2.0 or higher, you need to modify some source code (follow the column, send me a private message, and you can get it directly~).

The core code for loading the model with Tensorflow is as follows:

import tensorflow as tf
from bert import modeling

tf.compat.v1.disable_eager_execution()

def convert_ckpt_to_saved_model(bert_config, init_checkpoint):
    # BERT configuration file
    bert_config = modeling.BertConfig.from_json_file(bert_config)

    # Create BERT input
    input_ids = tf.compat.v1.placeholder(shape=(None, None), dtype=tf.int32, name='input_ids')
    input_mask = tf.compat.v1.placeholder(shape=(None, None), dtype=tf.int32, name='input_mask')
    segment_ids = tf.compat.v1.placeholder(shape=(None, None), dtype=tf.int32, name='segment_ids')

    # Create BERT model
    model = modeling.BertModel(
        config=bert_config,
        is_training=True,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=False, # Set to True for TPU, False for CPU/GPU.
    )
    
    # Get all trainable parameters in the model
    tvars = tf.compat.v1.trainable_variables()

    # Load BERT model
    (assigment_map, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars=tvars, init_checkpoint=init_checkpoint)
    tf.compat.v1.train.init_from_checkpoint(init_checkpoint, assigment_map)

    # Print the parameters loaded into the model
    tf.compat.v1.logging.info("  **** Trainable Variables ****")
    for var in tvars:
        init_string = ""
        if var.name in initialized_variable_names:
            init_string = ", *INIT_FROM_CKPT*"
        tf.compat.v1.logging.info("  name = {}, shape = {}{}".format(var.name, var.shape, init_string))
     
    with tf.compat.v1.Session() as sess:
        sess.run(tf.compat.v1.global_variables_initializer())

bert_config is the path to the bert_config.json file in the pre-trained model, and init_checkpoint is the path to the bert_model.ckpt file. If you want to save the model parameters and network structure as a more convenient deployable pb file, i.e., in SavedModel format, you can use the following method:

builder = tf.compat.v1.saved_model.builder.SavedModelBuilder(saved_model_path)
    
model_signature =  tf.compat.v1.saved_model.signature_def_utils.build_signature_def(
    inputs={
        "input_ids": tf.compat.v1.saved_model.utils.build_tensor_info(input_ids),
        "input_mask": tf.compat.v1.saved_model.utils.build_tensor_info(input_mask),
        "segment_ids": tf.compat.v1.saved_model.utils.build_tensor_info(segment_ids)},
    outputs={
        "pooled_output": tf.compat.v1.saved_model.utils.build_tensor_info(model.pooled_output),
        "sequence_output": tf.compat.v1.saved_model.utils.build_tensor_info(model.sequence_output)},
        method_name=tf.compat.v1.saved_model.signature_constants.PREDICT_METHOD_NAME)
        
builder.add_meta_graph_and_variables(
    sess,
    [tf.saved_model.TRAINING, tf.saved_model.SERVING],
    strip_default_attrs=False,
    signature_def_map={
        tf.compat.v1.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
        model_signature})
builder.save()

saved_model_path is the storage path for the SavedModel model.

In addition to the above usage methods, there is an easier way to use the BERT model, which is to use Tensorflow_hub. You need to install Tensorflow_hub:

pip install tensorflow_hub

Then, download a BERT model in SavedModel format that can be used by the hub module. The download link is here. After downloading, unzip it locally, and you can load the model using the following method:

import tensorflow_hub as hub

bert_layer = hub.KerasLayer(BERT_PATH, trainable=True, name='bert_layer')

Of course, you can also load the model directly by passing the URL address of the model without downloading it locally:

BERT_URL = "https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/1"
bert_layer = hub.KerasLayer(BERT_URL, trainable=True, name='bert_layer')

Loading the BERT Model with Keras

You can also load the BERT model with Keras using Tensorflow_hub as a KerasLayer. Another convenient method to load the model is to use keras_bert, as follows:

First, install keras_bert:

pip install keras-bert

Then, you can directly load the checkpoint format BERT model using the load_trained_model_from_checkpoint method:

def BertLayer(bert_path, trainable=True, training=False, seq_len=None, name='bert_layer'):
    
    bert_config_path = os.path.join(bert_path, 'bert_config.json')
    bert_checkpoint_path = os.path.join(bert_path, 'bert_model.ckpt')
    
    bert_layer = load_trained_model_from_checkpoint(
            bert_config_path, bert_checkpoint_path, training=training, seq_len=seq_len)
    
    bert_layer.name = name
        
    for layer in bert_layer.layers:
        layer.trainable = trainable
    
    return bert_layer

Postscript

This section mainly introduces several common methods for quickly using the BERT model, making it convenient for future reference and for those who need to quickly use the BERT model. I will further share how to use BERT for specific NLP tasks~



Recommended Reading:
【Long Article Explanation】From Transformer to BERT Model
Sail Translation | Understanding Transformer from Scratch
Seeing is believing! A hands-on guide to building a Transformer with Python