MLNLP ( Machine Learning Algorithms and Natural Language Processing ) community is a well-known natural language processing community both domestically and internationally, covering NLP master’s and doctoral students, university professors, and corporate researchers.
The vision of the community is to promote communication between the academic and industrial circles of natural language processing and machine learning, as well as among enthusiasts, especially for beginners.
This article is reprinted from | Jishi Platform
Source | Harbin Institute of Technology SCIR
Author | Alexander Rush
Below is a detailed blog post about the Transformer translated by our laboratory from Harvard University.

The Transformer network structure proposed in the paper “Attention is All You Need” has recently attracted a lot of attention. The Transformer not only significantly improves translation quality but also provides a new structure for many NLP tasks. Although the original text is written clearly, many people have reported difficulty in implementing it correctly.
Therefore, we have written an annotated document for this article and provided the code for implementing the Transformer line by line. This document removes some sections of the original text and rearranges them, adding relevant annotations throughout the article. In addition, this document is completed in the form of a Jupyter notebook, which is directly executable code with a total of 400 lines of library code, capable of processing 27,000 tokens per second on 4 GPUs.
To run this work, you first need to install PyTorch. The complete notebook file and dependencies can be found on GitHub or Google Colab.
It should be noted that this annotated document and code are intended as an introductory tutorial for researchers and developers. The code provided mainly relies on OpenNMT implementation, and for more information about other implementation versions of this model, you can check Tensor2Tensor (TensorFlow version) and Sockeye (MXNet version).
-
Alexander Rush (@harvardnlp or [email protected])
0
Preparation Work
0
Preparation Work
# !pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl numpy matplotlib spacy torchtext seaborn

Table of Contents
Preparation Work
Background
Model Structure
– Encoder and Decoder
– Encoder
– Decoder
– Attention
– Application of Attention in the Model
– Position-wise Feedforward Network
– Embedding and Softmax
– Position Encoding
– Complete Model
(Due to the length of the original text, the remaining parts will be in the next article)
Training
– Batches and Masks
– Training Loop
– Training Data and Batching
– Hardware and Training Progress
– Optimizer
– Regularization
– Label Smoothing
First Example
– Data Generation
– Loss Calculation
– Greedy Decoding
Real Example
– Data Loading
– Iterators
– Multi-GPU Training
– Training System Add-ons: BPE, Search, Averaging
Results
– Attention Visualization
Conclusion
All annotated sections of this article are provided in citation form, and the main content comes from the original text.
1
Background
1
Background
Reducing the computational load of sequence processing tasks is a significant issue and the motivation behind networks like Extended Neural GPU, ByteNet, and ConvS2S. The aforementioned networks are all based on CNN and compute hidden representations for all input and output positions in parallel.
In these models, the number of operations required to relate signals from two arbitrary input or output positions increases with the distance between positions; for example, ConvS2S increases linearly, while ByteNet increases logarithmically, making it more difficult to learn dependencies between two positions that are far apart. In contrast, in the Transformer, the number of operations is reduced to a constant level.
Self-attention, sometimes referred to as Intra-attention, performs attention on different positions of a single sentence to obtain a representation of the sequence. It can be effectively applied to many tasks, including reading comprehension, summarization, textual entailment, and task-independent sentence representation. End-to-end networks are generally based on recurrent attention mechanisms rather than sequence-aligned recurrence, and there is evidence that they perform well on simple language question answering and language modeling tasks.
To our knowledge, the Transformer is the first model that relies entirely on Self-attention without using sequence-aligned RNNs or convolutions to compute input-output representations.
2
Model Structure
2
Model Structure
Currently, most popular neural sequence transformation models have an Encoder-Decoder structure. The Encoder maps the input sequence to a continuous representation sequence.
For the encoded z, the Decoder generates one symbol at a time until the complete output sequence is generated. For each decoding step, the model is autoregressive, meaning that it uses previously generated symbols as additional input when generating the next symbol.

The overall structure of the Transformer is shown in the figure below, where Self-attention, Point-wise, and fully connected layers are used in both the Encoder and Decoder. The general structure of the Encoder and Decoder is shown in the left and right parts of the figure, respectively.

3
Encoder and Decoder
3
Encoder and Decoder
Encoder
The Encoder consists of N=6 identical layers.

We use residual connections [11] and normalization [12] between every two sub-layers.



Each layer consists of two sub-layers. The first sub-layer implements the “multi-head” Self-attention, while the second sub-layer is a simple Position-wise fully connected feedforward network.

Decoder
The Decoder also consists of N=6 identical layers.

In addition to the two sub-layers in each encoder layer, the decoder also inserts a third sub-layer to perform “multi-head” Attention on the output of the encoder stack. Similar to the encoder, we use residual connections at both ends of each sub-layer and then perform layer normalization.



4
Attention
4
Attention



The “multi-head” mechanism allows the model to consider Attention from different positions, and additionally, “multi-head” Attention can represent different relationships in different subspaces, which is generally not achievable with single-head Attention.



5
Application of Attention in the Model
5
Application of Attention in the Model
In the Transformer, “multi-head” Attention is used in three different ways:
1) In the “Encoder-Decoder Attention” layer, the Query comes from the previous decoder layer, while the Key and Value come from the encoder’s output. Each position in the Decoder attends to all positions in the input sequence, consistent with the classic Encoder-Decoder Attention mechanism in Seq2Seq models.
2) In the Self-attention layer of the Encoder. In the Self-attention layer, all Keys, Values, and Queries come from the same place, which is the output of the previous layer in the Encoder. Each position in the current layer of the Encoder can attend to all positions in the previous layer.
3) Similarly, the Self-attention layer in the decoder allows each position in the decoder to attend to the current decoding position and all previous positions. Here, masking is required to prevent information flow to the left in the decoder to maintain autoregressive properties. The specific implementation method is to mask (set to negative infinity) all illegal connections in the input of Softmax in the scaled dot-product Attention.
6
Position-wise Feedforward Network
6
Position-wise Feedforward Network

7
Embedding and Softmax
7
Embedding and Softmax

8
Position Encoding
8
Position Encoding



We also attempted to use pre-learned position Embeddings, but found that the results of the two versions were basically the same. We chose the sinusoidal version of the implementation because this version allows the model to handle sequences longer than the maximum sequence length in the training corpus.
9
Complete Model
9
Complete Model
The function to connect the complete model and set hyperparameters is defined below.

Reference Links
[1] https://arxiv.org/abs/1706.03762[2] https://pytorch.org/[3] https://github.com/harvardnlp/annotated-transformer[4]https://drive.google.com/file/d/1xQXSv6mtAOLXxEMi8RvaW8TW-7bvYBDF/view?usp=sharing[5] http://opennmt.net[6] https://github.com/tensorflow/tensor2tensor[7] https://github.com/awslabs/sockeye[8] https://twitter.com/harvardnlp[9] https://arxiv.org/abs/1409.0473[10] https://arxiv.org/abs/1308.0850[11] https://arxiv.org/abs/1512.03385[12] https://arxiv.org/abs/1607.06450[13] https://arxiv.org/abs/1409.0473[14] https://arxiv.org/abs/1703.03906[15] https://arxiv.org/abs/1609.08144[16] https://arxiv.org/abs/1608.05859[17] https://arxiv.org/pdf/1705.03122.pdf
Scan the QR code to add the assistant WeChat