Step-By-Step Guide to Text Classification Using Bert

Author: GjZero

Tags: Bert, Chinese Classification, Sentence Vector

This article is about 1500 words, recommended reading time is 8 minutes.

This article starts from practice, guiding everyone through the tutorial on Chinese text classification using Bert and its use as a sentence vector.

Bert Introduction

The Bert model is a language representation model released by Google in October 2018. Bert has swept the optimal results of 11 tasks in the NLP field, making it one of the most significant breakthroughs in NLP today. The full name of the Bert model is Bidirectional Encoder Representations from Transformers, which is obtained by training a Masked Language Model and predicting the next sentence task. For more details on the specific training of Bert and more principles, interested readers can refer to the original paper on arXiv. This article starts from practice, guiding everyone through the tutorial on Chinese text classification using Bert and its use as a sentence vector.

For the text classification task, N characters in a sentence correspond to E_1,…,E_N, which are N embeddings. Text classification is actually connecting the T_1 layer obtained from BERT to a fully connected layer for multi-class classification.

Preparation Work

1. Download Bert

Enter the following command in the terminal:

git clone

https://github.com/google-research/bert.git

2. Download the Bert Pre-trained Model

Google provides various pre-trained Bert models, available for different languages and model sizes. For the Chinese model, we use Bert-Base, Chinese. To download this model, you may need to use a VPN. If you need to download other models (English and other languages), you can find the download links in the Bert repository under Pre-trained models.

3. (Optional)

Install bert-as-service, which is a service that uses the Bert model to map sentences to fixed-length vectors.

Enter the following command in the terminal:

pip install bert-serving-server # server

pip install bert-serving-client # client, independent of ‘bert-serving-server’

This service requires a minimum version of TensorFlow 1.10.

Prepare Data

Data Format

For the Chinese text classification problem, you need to organize the dataset into a usable format. Different formats correspond to different DataProcessor classes. You can save the data in the following format:

game APEX is a newly released battle royale game.

technology Google is about to release TensorFlow 2.0.

Each line represents a text, consisting of a label followed by a tab and the body text.

Split the text into three files: train.tsv (training set), dev.tsv (validation set), and test.tsv (test set); then place them in the same data_dir folder.

Write DataProcessor Class

In the “run_classifier.py” file in the Bert folder, add the processors content in the “def main(_):” function as follows:

python

processors = {

“cola”: ColaProcessor,

“mnli”: MnliProcessor,

“mrpc”: MrpcProcessor,

“xnli”: XnliProcessor,

“mytask”: MyTaskProcessor,

}

Implement the following “MyTaskProcessor” (DataProcessor) class and place this code alongside other processors in “run_classifier.py”.

In the “__init__(self)” method, self.labels contains all classification labels; in this example, we will classify the text into 3 categories: game, fashion, houseliving.

python

class MyTaskProcessor(DataProcessor):

“””Processor for the News data set (GLUE version).”””

def __init__(self):

self.labels = [‘game’, ‘fashion’, ‘houseliving’]

def get_train_examples(self, data_dir):

return self._create_examples(

self._read_tsv(os.path.join(data_dir, “train.tsv”)), “train”)

def get_dev_examples(self, data_dir):

return self._create_examples(

self._read_tsv(os.path.join(data_dir, “dev.tsv”)), “dev”)

def get_test_examples(self, data_dir):

return self._create_examples(

self._read_tsv(os.path.join(data_dir, “test.tsv”)), “test”)

def get_labels(self):

return self.labels

def _create_examples(self, lines, set_type):

“””Creates examples for the training and dev sets.”””

examples = []

for (i, line) in enumerate(lines):

guid = “%s-%s” % (set_type, i)

text_a = tokenization.convert_to_unicode(line[1])

label = tokenization.convert_to_unicode(line[0])

examples.append(

InputExample(guid=guid, text_a=text_a, text_b=None, label=label))

return examples

If the data format is not a label followed by a tab and a piece of text, you need to modify the implementation of “_create_examples()”.

Write Run Script

Create a run script file named “run.sh” and edit its content as follows:

bash

export DATA_DIR=/media/ganjinzero/Code/bert/data/

export BERT_BASE_DIR=/media/ganjinzero/Code/bert/chinese_L-12_H-768_A-12

python run_classifier.py \

–task_name=mytask \

–do_train=true \

–do_eval=true \

–data_dir=$DATA_DIR/ \

–vocab_file=$BERT_BASE_DIR/vocab.txt \

–bert_config_file=$BERT_BASE_DIR/bert_config.json \

–init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \

–max_seq_length=128 \

–train_batch_size=32 \

–learning_rate=2e-5 \

–num_train_epochs=3.0 \

–output_dir=/mytask_output

Where DATA_DIR is the folder where your training text data is located, and BERT_BASE_DIR is the address where your Bert pre-trained model is stored. The task_name must match the name in your DataProcessor class. The following parameters, do_train indicates whether to fine-tune, do_eval indicates whether to evaluate, and the not mentioned parameter do_predict indicates whether to make predictions. If fine-tuning is not required, or if the GPU configuration is too low, you can remove do_train. max_seq_length represents the maximum length of the sentence, and can be reduced if memory is insufficient.

Make Predictions

Run the script:

bash

./run.sh

You can get results similar to the following:

***** Eval results *****

eval_accuracy = 0.845588

eval_loss = 0.505248

global_step = 343

loss = 0.505248

If such output appears, it means the run was successful. In the output_dir folder specified in “run.sh”, you can see the model’s evaluation results and the model file after fine-tuning.

Using Bert as Sentence Vectors

If you want to use the encoding of the Bert model along with other models, it is meaningful to use the Bert model as sentence vectors (also known as sentence-level encoding). We can achieve this goal using bert-as-service.

After installing bert-as-service, you can use the Bert model to map sentences to fixed-length vectors. Start the service with the following command in the terminal:

bash

bert-serving-start -model_dir /media/ganjinzero/Code/bert/chinese_L-12_H-768_A-12 -num_worker=4

The parameter after model_dir is the folder where the Bert pre-trained model is located. The number of num_worker should be based on your CPU/GPU count.

Now you can call the following command in Python:

python

from bert_serving.client import BertClient

bc = BertClient()

bc.encode([‘一二三四五六七八’, ‘今天您吃了吗？’])

It is best to pass the parameters to “bc.encode()” in a list format rather than a single string, as this improves the efficiency of the program.

Reference Documents

[Github:bert]

(https://github.com/google-research/bert)

[arXiv:bert](https://arxiv.org/pdf/1810.04805.pdf)

[Github:bert-as-service](https://github.com/hanxiao/bert-as-service)

Author Introduction

Step-By-Step Guide to Text Classification Using Bert

GjZero, a second-year PhD student at Tsinghua University’s Statistics Center. His research focuses on natural language processing in medical informatics. His interests include games related to game theory such as poker and mahjong.

Github:

https://github.com/GanjinZero

Personal Homepage:

https://ganjinzero.github.io/

Editor: Wenjing

Proofreader: Lin Yilin

Leave a Comment Cancel reply