Transformer: A Deep Learning Model Based on Self-Attention Mechanism

1. Algorithm Introduction

Deep learning (DL) is a new research direction in the field of machine learning. By simulating the structure of the human brain’s neural network, it enables the analysis and processing of complex data, solving the difficulties traditional machine learning methods face when dealing with unstructured data. Its performance has significantly improved in areas such as image recognition, speech recognition, and natural language processing. Deep learning models include Convolutional Neural Networks (CNN), Deep Neural Networks (DNN), and Recurrent Neural Networks (RNN). In the early stages, CNN achieved remarkable success in the fields of image recognition and natural language processing. However, as task complexity increased, sequence-to-sequence models and RNNs became common methods for processing sequential data. RNNs and their variants often encounter gradient vanishing and model degradation issues when dealing with long sequences. To address these problems, the Transformer model was proposed. Subsequent large models such as GPT and BERT have demonstrated outstanding performance based on the Transformer architecture.

The Transformer is a sequence model based on the attention mechanism, originally proposed by Google’s research team and applied to machine translation tasks. Unlike traditional RNNs and CNNs, the Transformer solely utilizes the self-attention mechanism to process input and output sequences, allowing for parallel computation and greatly enhancing computational efficiency.

2. Algorithm Principles

The transformer model based on the seq2seq architecture can accomplish typical tasks in NLP research, such as machine translation and text generation. It can also construct pre-trained language models for transfer learning across different tasks.

2.1 Overall Architecture of Transformer

It consists of four parts:

(1) Input part

1) Source text embedding layer and its positional encoder

2) Target text embedding layer and its positional encoder

(2) Output part

1) Linear layer: transforms the previous step’s output to the specified dimensional output, serving the purpose of dimension conversion;

2) Softmax processor: scales the numbers in the last dimension’s vector to a probability range of 0-1, ensuring their sum equals 1.

(3) Encoder part

1) Composed of N stacked encoder layers;

2) Each encoder layer consists of two sub-layer connection structures;

3) The first sub-layer connection structure includes a multi-head self-attention sub-layer, a normalization layer, and a residual connection;

4) The second sub-layer connection structure includes a feedforward fully connected sub-layer, a normalization layer, and a residual connection.

(4) Decoder part

1) Composed of N stacked decoder layers; each decoder layer consists of three sub-layer connection structures;

2) The first sub-layer connection structure includes a multi-head self-attention sub-layer, a normalization layer, and a residual connection;

3) The second sub-layer connection structure includes a multi-head attention sub-layer, a normalization layer, and a residual connection;

4) The third sub-layer connection structure includes a feedforward fully connected sub-layer, a normalization layer, and a residual connection.

2.2 Transformer Attention Mechanism

When the embedding vectors are passed as input to the Transformer’s Attention module, the Attention module calculates the attention weights through Q, K, and V, allowing these vectors to analyze each other, enabling the embedding vectors to “communicate” and update their values based on each other’s information.

The main function of the Attention module is to determine which embedding vectors are most relevant to the current task in the given context and to update or adjust the representations of these embedding vectors accordingly.

Attention mechanism calculation formula: In the attention mechanism, Q (Query), K (Key), and V (Value) obtain corresponding vectors through mapping matrices. By calculating the dot product similarity between Q and K and normalizing it through softmax, weights are derived, which are then used to compute the weighted sum of V to obtain the output.

In the Transformer architecture, there are three different attention layers: Self Attention, Cross Attention, and Causal Attention.

3. Algorithm Applications

The Attention-based Transformer model not only outperforms previous deep learning models in terms of performance but also has better interpretability, making it more suitable for deep learning translation of traditional Chinese medicine data features. The intelligent prescription system established using this method can collect, analyze, and inherit expert experience, and can be further promoted to clinical and educational settings, thereby promoting the inheritance and development of traditional Chinese medicine.

In specific applications, Yang Yun and others selected patients from the Shanghai Traditional Chinese Medicine Hospital’s Oncology Clinical Medical Center and the TCM Oncology Department of Shanghai University of Traditional Chinese Medicine, who were treated from October 19, 2005, to April 28, 2021. They used the Transformer model to construct and evaluate an artificial intelligence lung cancer prescription prediction system. The construction process includes: (1) Data collection, organizing patient general information and treatment records into Excel 2010 software, including gender, age, chief complaint, medical history, Western medical diagnosis (TNM staging, late-stage metastasis locations, pathological diagnosis types), past treatment history (surgery and radiotherapy/chemotherapy), symptoms, syndrome types, tongue and pulse signs, treatment methods, and Chinese medicine usage. (2) Data processing, using the Pandas library in Python (https://www.python.org) for data processing, extracting “symptoms,” and constructing a data cleaning and standardization platform based on the mapping relationship from raw symptoms to standard symptoms, replacing words that essentially refer to the same symptom or medicine with standard symptoms or standard medicines. (3) Model computation, replacing the input with symptoms and the output with medicines and syndrome types. The model computation and operation process are shown in Figure 1. First, symptoms and medicines are embedded into 512-dimensional vectors; then, the input is fed into the transformer model for training; finally, during training, the input and output can interact, but the output of medicines is predicted one by one. (4) Evaluation standards, using general evaluation standards for multi-label classification, the predicted medicine set {Pred}, the correct medicine set {True}, and the intersection of predictions and controls {Ins}={Pred}∩{True} are used to assess the accuracy of the proportion of correctly predicted medicines in the predicted prescriptions (Precision), with the formula Precision={Ins}/{Pred}; the recall rate (Recall) of correctly predicted medicines in the actual sample prescriptions is given by Recall={Ins}/{True}; the F1 value integrates precision and recall to reflect the overall performance, calculated as F1=2*Precision*Recall/(Precision+Recall). Additionally, three senior traditional Chinese medicine oncology experts were invited to score the predicted prescriptions generated by the system (0-10 points).

The specific results show: (1) Using a cross-validation method with a training set: validation set: test set ratio of 7:2:1, the average results were P:69.3%, R:66.9%, F1:68.0%. A sample of 800 patients was tested, showing overall results similar to the average situation (see Table 1). (2) Overall syndrome differentiation accuracy, the prescription prediction system matched the predicted syndrome types based on symptoms with the actual syndrome types of patients. (3) The three experts gave scores of 8.5, 8.6, and 8.7, with an average of 8.6.

Table 1 Test Results Classified by Frequency of Medicine Appearance (%)

4. Summary

The Transformer model in deep learning has advantages such as efficient parallel computing capabilities, strong representational abilities, and adaptability to long sequence data, demonstrating outstanding performance in various fields like natural language processing and computer vision. However, it also has drawbacks, such as relatively low parameter efficiency, sensitivity to input data, and difficulty in handling spatiotemporal dynamic changes. With continuous technological development, we look forward to overcoming these limitations in the future and further expanding the application range of the Transformer model. Additionally, combining other technical approaches, such as lightweight Transformer models and knowledge distillation, can maintain high performance while reducing the model’s complexity and computational costs, making the Transformer model more practically valuable.

References:

[1] Yang Yun, Pei Chaohan, Cui Ji, et al. Construction of an Artificial Intelligence Lung Cancer Prescription Prediction System Based on the Transformer Model [J]. Beijing Traditional Chinese Medicine, 2024, 43(02): 208-211.

[2] Deep Learning | Transformer Basic Principles_Deep Learning Transformer-CSDN Blog. Accessed August 1, 2024.

https://blog.csdn.net/m0_64140451/article/details/140395012.

[3] A Comprehensive Understanding of Transformer – Attention Mechanism_transformer Attention Mechanism-CSDN Blog. Accessed August 1, 2024.

https://blog.csdn.net/m0_59164304/article/details/140279543.

[4] The Ten Core Algorithms of Deep Learning_Deep Learning Algorithms-CSDN Blog. Accessed August 1, 2024.

https://blog.csdn.net/m290345792/article/details/135115445.

Recommended Reading:

Weekly Algorithm Insights—SVM, A Generalized Linear Classifier for Binary Classification

Weekly Algorithm Insights—ID3, An Important Tool for Classifying and Predicting Large-Scale Medical Data

Minimizing Random Algorithms—An Effective Tool for Ensuring Balanced Experimental Conditions

Transformer: A Deep Learning Model Based on Self-Attention Mechanism

Ancient and Modern Medical Case Cloud Platform

Providing Search Services for Over 500,000 Ancient and Modern Medical Cases

Supports Manual, Voice, OCR, and Batch Structured Input of Medical Cases

Designed with Nine Analysis Modules, Close to Clinical Practical Needs

Supports Collaborative Analysis of Massive Medical Cases and Personal Cases on the Platform

EDC Traditional Chinese Medicine Research Case Collection System

Supports Multi-Center, Online Random Grouping, Data Entry

SDV, Audit Trails, SMS Reminders, Data Statistics

Analysis and Other Functions

Supports Customized Form Design

Users can log in at: https://www.yiankb.com/edc

Free Trial!

Transformer: A Deep Learning Model Based on Self-Attention Mechanism

Institute of Information Research, China Academy of Chinese Medical Sciences

Intelligent R&D Center for Traditional Chinese Medicine Health

Big Data R&D Department

Phone: 010-64089619

13522583261

QQ: 2778196938

https://www.yiankb.com

Leave a Comment Cancel reply