Big Data Digest authorized repost from Data Party THU
Author: Chen Zhiyan
Twitter has always been an important source of news, and during the COVID-19 pandemic, the public has been able to express their anxieties on Twitter. However, manually classifying, filtering, and summarizing the massive amount of COVID-19 information on Twitter is nearly impossible. This daunting and challenging task has fallen on BERT, which is the go-to machine learning tool in the field of natural language processing (NLP). The BERT model is utilized to automatically classify, filter, and summarize the vast amount of COVID-19 information on Twitter, improving the understanding of relevant COVID-19 content on Twitter and allowing for analysis and summarization of this content, referred to as the COVID-Twitter-BERT model, abbreviated as CT-BERT.
The model is based on BERT-LARGE (English, case-insensitive, full word masking) model. BERT-LARGE is primarily trained on large raw text datasets such as English Wikipedia (3.5B words) and free book corpora (0.8B words). Although these datasets contain a massive amount of data, they do not include relevant information in specialized subfields. In some specific professional fields, there have been cases of training specialized domain corpora using transformer models, such as BIOBERT and SCIBERT, which all adopt the same unsupervised training techniques MLM / NSP / SOP, requiring enormous hardware resources. A more common and general approach is to first train weights using a general model, and after completing the pre-training in the specialized field, replace the general domain pre-training results with the specialized domain pre-training results, inputting them into downstream tasks for training.
1. Training Process
The CT-BERT model is trained on a 160M corpus collected from tweets about the coronavirus from January 12, 2020, to April 16, 2020, using the Twitter filtering API to listen for a set of English keywords related to COVID-19. Before training, the retweet tags in the original corpus are cleaned, and a generic text replaces the usernames in each tweet, with similar operations performed on all URLs and program pages. In addition, all unicode emojis are converted to text ASCII representations (for example: replacing smiley faces with ‘smile’) using the Python emoji library. Finally, all retweets and duplicate data are removed from the dataset, resulting in a final corpus of 22.5 million tweets totaling 0.6B words. The content of the specialized domain pre-training dataset is one-seventh the size of the general model dataset. Each tweet is treated as an independent document, and the spaCy library is used to split it into individual sentences.
All sequences input into BERT are converted into a token set consisting of a vocabulary of 30,000 words, with each tweet message limited to 280 characters, and a maximum sequence length of 96 tokens. The training batch size is increased to 1024, ultimately generating 285M training examples and 2.5M validation examples on the dataset. The control continuous learning rate is set to 2e-5, and when pre-training on the specialized domain dataset, the model parameters are consistent with those recommended by Google on GitHub.
Pre-training computes loss and accuracy programs, saving a checkpoint every 100,000 training steps, and positioning it for various types of downstream classification tasks. Distributed training on TPUv3-8 (128GB) using TensorFlow 2.2 ran continuously for 120 hours.
CT-BERT is a transformer-based model pre-trained on a large corpus of tweets about COVID-19. The v2 model is trained on 9,700 tweets (1.2B training examples).
CT-BERT is used to train datasets in specific professional domains, and the training evaluation results indicate that the model’s performance will improve by 10-30% compared to the standard BERT-Large model, especially on the Twitter information dataset related to COVID-19, where the performance improvement is particularly significant.
2. Training Method
If you are familiar with fine-tuning transformer models, you can download the CT-BERT model from two channels: either from TFHub or from Huggingface.
Figure 1
Huggingface
Load the pre-trained model from huggingface:
Figure 2
Use the built-in pipeline to predict internal identifiers:
Figure 3
Load the pre-trained model from TF-Hub:
Figure 4
Fine-tune CT-BERT with the following script
The script run_finetune.py can be used to train the classifier, and this code relies on the official BERT model implementation under TensorFlow 2.2/Keras framework.
Before running the code, the following settings are required:
-
Google Cloud bucket; -
Google Cloud VM running TensorFlow 2.2; -
TPU running TensorFlow 2.2 in the same zone as the VM.
If you are doing research work, you can apply for access to TPU and/or Google Cloud.
Installation
Recursively clone the repository:
Figure 5
The code is developed using tf-nightly and ensures backward compatibility to run on TensorFlow 2.2. It is recommended to use Anaconda to manage Python versions:
Figure 6
Install requirements.txt
Figure 7
3. Data Preparation
Split the data into training dataset: train.tsv and validation dataset dev.tsv in the following format:
Figure 8
Place the prepared two dataset files in the following folder data/finetune/originals/<dataset_name>/(train|dev).tsv
Then run:
Figure 9
Then in data/finetune/run_2020-05-19_14-14-53_517063_test_run/<dataset_name>/tfrecords directory, generate the TF_Record file,
Load the data in:
Figure 10
4. Pre-training
The pre-training code is based on existing pre-trained models (such as BERT-Large) and performs unsupervised pre-training on target domain data (in this case, Twitter data). This code can in principle be used for pre-training any specialized domain dataset.
Data Preparation Phase
The data preparation phase consists of two steps:
Data Cleaning Phase
In the first step, run the following script to clean usernames/URLs and other information using asciifying emojis:
Figure 11
In the second step, generate TFrecord files for pre-training
Figure 12
This process will consume a lot of memory and take a long time, so you can choose the already prepared TFrecord file max_num_cpus, and the preprocessed data is stored in data/pretrain/<run_name>/tfrecords/ directory.
Just sync the prepared data in:
Figure 13
Pre-training
Before the pre-training model, ensure that the compressed file located at gs://cloud-tpu-checkpoints/bert/keras_bert/wwm_uncased_L-24_H-1024_A-16.tar.gz is extracted, and the pre-trained model is copied to gs://<bucket_name>/pretrained_models/bert/keras_bert/wwm_uncased_L-24_H-1024_A-16/ directory:
After loading the model and TFrecord file, access the TPU and bucket on the Google Cloud VM (both located in the same zone).
Figure 14
Run log files and model checkpoint files will be generated in gs://<bucket_name>/pretrain/runs/<run_name> directory
5. Fine-tuning
Use the following command to fine-tune this dataset with CT-BERT:
Figure 15
Run the configuration file for training, saving the run log files to gs://<bucket_name>/covid-bert/finetune/runs/run_2020-04-29_21-20-52_656110_<run_prefix>/. In the TensorFlow log files, the run_logs.json file contains all relevant training information.
Figure 16
Run the sync_bucket_data.py script on your local computer to download the training log files:
Figure 17
The model training utilized resources provided by TensorFlow Research Cloud (TFRC) and Google Cloud COVID-19 research.
6. Model Evaluation
Five independent training sets were selected to evaluate the actual performance of the model applied to downstream tasks. Three of the datasets are public datasets, while two come from unpublished internal projects, all datasets include COVID-19 related data from Twitter.
Figure 18: Overview of Evaluation Datasets: All five evaluation datasets are labeled multi-class datasets, visualized through the width of the proportion bar in the label column, where N and Neg represent negative emotions; Disc and A represent frustration and uncertainty emotions respectively.
7. Training Results
Figure 19 shows the results of CT-BERT on the validation dataset after pre-training for 25k steps and validating for 1k steps, with all metrics verified throughout the training process. Notably, there was a significant improvement in performance for the MLM loss task, with a final loss value of 1.48. The NSP task’s loss only saw slight improvement, as its initial performance was already quite good. Training was halted at 500,000 steps, equivalent to training on 512M samples, corresponding to about 1.8 epochs. All performance metrics for MLM and NLM tasks showed steady improvement throughout the training process. However, using the loss/metrics of these tasks to evaluate the correct time to stop training is relatively challenging.
Figure 19: CT-BERT Domain-Specific Dataset Pre-training Evaluation Metrics. Displays the loss and accuracy for the masked language model (MLM) and next sentence prediction (NSP) tasks.
Experiments show that after completing 200,000 training steps in pre-training, downstream performance begins to improve rapidly. On the other hand, after completing 200,000 training steps in pre-training, the loss curve also gradually increases. For COVID-19 related datasets, downstream performance significantly improved after completing 200,000 training steps. The only non-Twitter dataset, SST-2, showed much slower performance improvement, only beginning to improve after completing 200,000 training steps.
Even when running the same model on the same dataset, performance differences can still be observed to some extent. This difference is related to the dataset, but it does not significantly increase throughout the pre-training process, roughly consistent with the differences observed when running BERT-LARGE. Training on the SE dataset is the most stable, while training on the SST-2 dataset is the least stable, with most differences within the margin of error.
8. Conclusion
Although using CT-BERT can significantly improve the performance of classification tasks, experiments have not yet been conducted to apply CT-BERT to other natural language processing tasks. Additionally, at the time of writing this article, only access to one COVID-19 related dataset was available. The next step could involve further improving the model’s performance by modifying hyperparameters, such as learning rate, training batch size, and optimizer. Future work may include evaluating training results on other datasets.