Why Is Your Saved BERT Model So Large?

Follow the public account “ML_NLP”

Set as “Starred”, heavy content delivered first-hand!

Produced by Machine Learning Algorithms and Natural Language Processing

Original Column Author on Public Account Liu Cong

School | NLP Algorithm Engineer

A while ago, a friend asked me this question: the ckpt file size of the bert-base model provided by Google is only 400M, why is the ckpt model I saved after fine-tuning training 1.19G?

My answer at the time was: because the ckpt file of the bert-base model provided by Google only contains the parameters of each layer of the BERT transform, and does not include other parameters. However, you added some additional parameters during the fine-tuning process, which is why it is larger.

Now thinking about it, my answer seems a bit ambiguous, and perhaps other students might have the same question. So today I will explain in detail why so many parameters are saved.

Practice is the sole criterion for testing truth.

First, we will use tf.train.NewCheckpointReader to directly read the ckpt file of the bert-base model provided by Google. The advantage of this method is that we can see all the saved nodes without reloading the model.

from tensorflow.python import pywrap_tensorflow

ckpt_model_path = "chinese_L-12_H-768_A-12/bert_model.ckpt"
model_ckpt = pywrap_tensorflow.NewCheckpointReader(ckpt_model_path)
var_dict = model_ckpt.get_variable_to_shape_map()
for key in var_dict:
    print("bert_parameter:", key)

The results obtained are as follows (due to too many parameters, only part of the parameters are listed):

bert_parameter: bert/embeddings/LayerNorm/beta
bert_parameter: bert/embeddings/LayerNorm/gamma
bert_parameter: bert/encoder/layer_9/attention/output/LayerNorm/beta
bert_parameter: bert/encoder/layer_9/attention/output/dense/bias
bert_parameter: bert/encoder/layer_9/attention/output/dense/kernel
bert_parameter: bert/encoder/layer_9/attention/self/key/kernel
bert_parameter: bert/encoder/layer_9/attention/self/query/bias
bert_parameter: bert/encoder/layer_9/attention/self/query/kernel
bert_parameter: bert/encoder/layer_9/intermediate/dense/bias
bert_parameter: bert/encoder/layer_9/intermediate/dense/kernel
bert_parameter: bert/encoder/layer_9/output/LayerNorm/beta
bert_parameter: bert/encoder/layer_9/output/dense/bias
bert_parameter: bert/encoder/layer_9/output/dense/kernel
bert_parameter: bert/pooler/dense/bias
bert_parameter: bert/pooler/dense/kernel
bert_parameter: cls/predictions/transform/LayerNorm/beta
bert_parameter: cls/predictions/transform/LayerNorm/gamma
bert_parameter: cls/predictions/transform/dense/bias
bert_parameter: cls/predictions/transform/dense/kernel
bert_parameter: cls/seq_relationship/output_bias
bert_parameter: cls/seq_relationship/output_weights

We can see that the ckpt file of the bert-base model provided by Google not only saves the parameters of each layer of BERT-transform but also stores the embedding and prediction parameters required for pre-training (for example, the parameters of the fully connected layer required for NPS prediction). Therefore, my previous answer to that friend was biased.

Next, we will use tf.train.NewCheckpointReader to read our model after fine-tuning to see what parameters are saved.

from tensorflow.python import pywrap_tensorflow

ckpt_model_path = "my_model\bert_model.ckpt"
model_ckpt = pywrap_tensorflow.NewCheckpointReader(ckpt_model_path)
var_dict = model_ckpt.get_variable_to_shape_map()
for key in var_dict:
    print("bert_parameter:", key)

The results obtained are as follows (again, only part of the parameters are listed):

bert_parameter: bert/embeddings/LayerNorm/beta
bert_parameter: bert/embeddings/LayerNorm/beta/adam_v
bert_parameter: bert/encoder/layer_9/attention/output/LayerNorm/beta/adam_m
bert_parameter: bert/encoder/layer_9/attention/output/LayerNorm/beta/adam_v
bert_parameter: bert/encoder/layer_9/attention/output/dense/bias/adam_m
bert_parameter: bert/encoder/layer_9/attention/output/dense/bias/adam_v
bert_parameter: bert/encoder/layer_9/attention/output/dense/kernel
bert_parameter: bert/encoder/layer_9/attention/output/dense/kernel/adam_m
bert_parameter: bert/encoder/layer_9/attention/output/dense/kernel/adam_v
bert_parameter: bert/encoder/layer_9/attention/self/key/kernel
bert_parameter: bert/encoder/layer_9/attention/self/key/kernel/adam_m
bert_parameter: bert/encoder/layer_9/attention/self/key/kernel/adam_v
bert_parameter: bert/encoder/layer_9/attention/self/query/kernel/adam_m
bert_parameter: bert/encoder/layer_9/attention/self/query/kernel/adam_v
bert_parameter: bert/encoder/layer_9/attention/self/value/bias
bert_parameter: bert/encoder/layer_9/attention/self/value/bias/adam_m
bert_parameter: bert/encoder/layer_9/attention/self/value/bias/adam_v
bert_parameter: bert/encoder/layer_9/attention/self/value/kernel
bert_parameter: bert/encoder/layer_9/attention/self/value/kernel/adam_m
bert_parameter: bert/encoder/layer_9/attention/self/value/kernel/adam_v
bert_parameter: bert/encoder/layer_9/intermediate/dense/bias
bert_parameter: bert/encoder/layer_9/intermediate/dense/bias/adam_m
bert_parameter: bert/encoder/layer_9/intermediate/dense/bias/adam_v
bert_parameter: bert/encoder/layer_9/output/LayerNorm/beta/adam_v
bert_parameter: bert/encoder/layer_9/output/dense/bias
bert_parameter: bert/encoder/layer_9/output/dense/kernel/adam_m
bert_parameter: bert/encoder/layer_9/output/dense/kernel/adam_v
bert_parameter: bert/pooler/dense/bias
bert_parameter: bert/pooler/dense/bias/adam_m
bert_parameter: bert/pooler/dense/bias/adam_v
bert_parameter: bert/pooler/dense/kernel
bert_parameter: bert/pooler/dense/kernel/adam_m
bert_parameter: bert/pooler/dense/kernel/adam_v

When we see this result, I believe everyone should have a sudden realization: actually, during fine-tuning, we did not add many parameter variables. The reason why the ckpt file we saved reached 1.19G is that we saved the adam_m and adam_v for each variable. Thus, one variable became three variables, which corresponds to the increase from 400M to 1.19G.

Next, some students might ask: what are adam_m and adam_v, and why are these parameters saved?

The answer: when the model is optimized during training (error propagation), we usually use the Adam optimizer for optimization. During the optimization process, we typically need to maintain (save the previous moments’ moving averages) the first moment (corresponding to adam_m) and the second moment (corresponding to adam_v) for each parameter to ensure smooth gradient updates.

Simply put, during model training, each parameter requires additional variable parameters to store some information for error propagation and gradient updates. However, after training stops, these additional variable parameters actually lose their function; and during model prediction or when initializing parameters for another model, these parameter variables are not needed. However, when we save the model, we generally save all parameter variables by default, which is why our saved BERT model is 1.19G.

To make our saved model smaller and alleviate the pressure on our hard disk, we can perform the following operation when saving the model:

tf.train.Saver(tf.trainable_variables()).save(sess, save_model_path)

By doing so, the saved model will only contain training parameters, and the additional storage parameters will not be saved. Just modifying one line of code can reduce the pressure on the hard disk by 2/3, why not do it!

Below is the code we usually use to save the model, which saves all parameters.

tf.train.Saver().save(sess, save_model_path)
# Equivalent to
# tf.train.Saver(tf.all_variables()).save(sess, save_model_path)

Practice is the sole criterion for testing truth. Sometimes, what you think is merely your opinion; what you produce is what is true.

Heavy news! The academic WeChat group for Yizhen Natural Language Processing has been established.

You can scan the QR code below, and the assistant will invite you to join the group for discussion.

Note: Please modify the remarks when adding, as [School/Company + Name + Direction]

For example —— Harbin Institute of Technology + Zhang San + Dialogue System.

Advertisers, please consciously avoid this. Thank you!


Recommended Reading:
The Differences and Connections Between Fully Connected Graph Convolutional Networks (GCN) and Self-Attention Mechanisms
Complete Guide to Graph Convolutional Networks (GCN) for Beginners
Paper Review [ACL18] Component Syntax Analysis Based on Self-Attentive

Leave a Comment Cancel reply