BERT Model – Deeper and More Efficient

1 Algorithm Introduction

The full name of BERT is Bidirectional Encoder Representation from Transformers, which is a pre-trained language representation model. It emphasizes that pre-training is no longer conducted using traditional unidirectional language models or by shallowly concatenating two unidirectional language models, but rather by adopting a new masked language model (MLM) to generate deep bidirectional language representations. The BERT paper mentioned achieving new state-of-the-art results in 11 NLP (Natural Language Processing) tasks, which is astonishing.

The model has the following main characteristics:

1) It uses MLM to pre-train bidirectional Transformers to generate deep bidirectional language representations.

2) After pre-training, only an additional output layer needs to be added for fine-tuning, enabling state-of-the-art performance across various downstream tasks without requiring task-specific structural modifications to BERT.

2 Algorithm Principles

The input to BERT is the representation of each token (the pink blocks in the figure are tokens, and the yellow blocks are the corresponding representations), and the word dictionary is constructed using the WordPiece algorithm. To complete specific classification tasks, in addition to the token of the word, the author also inserts a special classification token ([CLS]) at the beginning of each input sequence, which corresponds to the output of the last Transformer layer used to aggregate the entire sequence’s representation information.
Since BERT is a pre-trained model, it must adapt to various natural language tasks, so the input sequences must be capable of containing a single sentence (text sentiment classification, sequence labeling tasks) or more than one sentence (text summarization, natural language inference, question answering tasks). So how does the model distinguish which range belongs to sentence A and which range belongs to sentence B? BERT employs two methods to address this:

1) In the sequence tokens, a separator token ([SEP]) is inserted after each sentence to separate different sentence tokens.

2) A learnable segmentation embedding is added to each token representation to indicate whether it belongs to sentence A or sentence B.

Thus, the final input sequence tokens for the model are as shown in the figure (if the input sequence contains only one sentence, there will be no [SEP] and the subsequent tokens):

BERT Model - Deeper and More Efficient

Input Sequence of the Model
As mentioned, the input to BERT is the representation of each token, which is actually composed of three parts: the corresponding token, segmentation, and position embeddings, as shown in the figure below:

BERT Model - Deeper and More Efficient

Composition of Token Representation
Up to this point, the input of BERT has been introduced, and it can be seen that its design concept is very simple and effective.
After introducing the input of BERT, the output of BERT naturally follows because of the Transformer characteristic that the number of inputs corresponds to the number of outputs, as shown in the figure below:

BERT Model - Deeper and More Efficient

Output of BERT
C is the classification token ([CLS]) corresponding to the output of the last Transformer, and Ti represents the outputs of other tokens. For some token-level tasks (e.g., sequence labeling and question answering tasks), Ti is input into an additional output layer for prediction. For some sentence-level tasks (e.g., natural language inference and sentiment classification tasks), C is input into an additional output layer, which explains why a specific classification token must be inserted before each token sequence.

3 Algorithm Applications

The BERT model is a powerful pre-trained model that trains a strong pre-trained model using Transformers and can perform transfer learning with the pre-trained model. For example, in the research on named entity recognition based on traditional Chinese medicine cases, researchers proposed adding the BERT language model to the BiLSTM-CRF algorithm to improve the performance of named entity recognition in traditional Chinese medicine cases. This model uses a bidirectional Transformer encoder, and the generated character vectors can fully integrate the contextual information on both sides of the characters. Compared with traditional language models, this model can better represent the polysemy of characters. The experimental results also indicate that the BERT model has a significant effect on feature extraction of the relationships between characters in text data and improves performance, showing clear advantages in named entity recognition in traditional Chinese medicine cases compared to other models.

BERT Model - Deeper and More Efficient

BERT Model - Deeper and More Efficient

4 Conclusion

BERT can perform concurrent execution compared to the original RNN and LSTM while extracting the relational features of words in sentences and can extract relational features at multiple different levels, thereby reflecting the semantics of sentences more comprehensively. Compared to word2vec, it can also obtain word meanings based on the context of sentences, thus avoiding ambiguity. However, the BERT model also has disadvantages such as too many parameters, a large model size, susceptibility to overfitting with small training data, and poor support for generative tasks and long sequence modeling.

  • References:
    [1] Zhihu Column. “Understanding BERT, This Article is Enough”. Accessed on February 28, 2024. https://zhuanlan.zhihu.com/p/403495863.
  • [2] Zhihu Column. “Understanding BERT, This Article is Enough”. Accessed on February 28, 2024. https://zhuanlan.zhihu.com/p/403495863.
  • [3] Hu Wei, Liu Wei, Shi Yujing. Method for Named Entity Recognition of Traditional Chinese Medicine Cases Based on BERT-BiLSTM-CRF [J]. Computer Era, 2022(09):119-122+135. DOI:10.16644/j.cnki.cn33-1094/tp.2022.09.027.
  • Recommended Reading:
  • Deep Convolutional Neural Networks – The More Armed, The Stronger
  • Factor Analysis – To “Strike”, First Reduce Dimensions
  • Independent Component Analysis (ICA) – The “Butcher’s Knife” for Mixed Data Signals

BERT Model - Deeper and More Efficient

BERT Model - Deeper and More Efficient

Ancient and Modern Medical Cases Cloud Platform

Providing Over 500,000 Ancient and Modern Medical Case Retrieval Services

Supports Manual, Voice, OCR, and Batch Structured Entry of Medical Cases

Designed Nine Analysis Modules, Close to Clinical Needs

Supports Collaborative Analysis of Massive Medical Cases and Personal Medical Cases

EDC Traditional Chinese Medicine Research Case Collection System

Supports Multi-Center, Online Random Grouping, Data Entry

SDV, Audit Trails, SMS Reminders, Data Statistics

Analysis and Other Functions

Supports Customizable Form Design

Users can log in at: https://www.yiankb.com/edc

Free Experience!

BERT Model - Deeper and More Efficient

BERT Model - Deeper and More Efficient

Institute of Traditional Chinese Medicine Information Research, Chinese Academy of Traditional Chinese Medicine

Intelligent R&D Center for Traditional Chinese Medicine Health

Big Data R&D Department

Phone: 010-64089619

13522583261

QQ: 2778196938

https://www.yiankb.com

Leave a Comment