MLNLP ( Machine Learning Algorithms and Natural Language Processing ) community is a well-known natural language processing community both domestically and internationally, covering NLP master’s and doctoral students, university professors, and corporate researchers.

The vision of the community is to promote communication between the academic and industrial circles of natural language processing and machine learning, as well as among enthusiasts, especially for beginners.

This article is reprinted from | Jishi Platform

Source | Harbin Institute of Technology SCIR

Author | Alexander Rush

Below is a detailed blog post about the Transformer translated by our laboratory from Harvard University.

Understanding Transformer Architecture: A Complete PyTorch Implementation

The Transformer network structure proposed in the paper “Attention is All You Need” has recently attracted a lot of attention. The Transformer not only significantly improves translation quality but also provides a new structure for many NLP tasks. Although the original text is written clearly, many people have reported difficulty in implementing it correctly.

Therefore, we have written an annotated document for this article and provided the code for implementing the Transformer line by line. This document removes some sections of the original text and rearranges them, adding relevant annotations throughout the article. In addition, this document is completed in the form of a Jupyter notebook, which is directly executable code with a total of 400 lines of library code, capable of processing 27,000 tokens per second on 4 GPUs.

To run this work, you first need to install PyTorch. The complete notebook file and dependencies can be found on GitHub or Google Colab.

It should be noted that this annotated document and code are intended as an introductory tutorial for researchers and developers. The code provided mainly relies on OpenNMT implementation, and for more information about other implementation versions of this model, you can check Tensor2Tensor (TensorFlow version) and Sockeye (MXNet version).

Alexander Rush (@harvardnlp or [email protected])

0

Preparation Work

# !pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl numpy matplotlib spacy torchtext seaborn

Table of Contents

Preparation Work

Background

Model Structure

– Encoder and Decoder

– Encoder

– Decoder

– Attention

– Application of Attention in the Model

– Position-wise Feedforward Network

– Embedding and Softmax

– Position Encoding

– Complete Model

(Due to the length of the original text, the remaining parts will be in the next article)

Training

– Batches and Masks

– Training Loop

– Training Data and Batching

– Hardware and Training Progress

– Optimizer

– Regularization

– Label Smoothing

First Example

– Data Generation

– Loss Calculation

– Greedy Decoding

Real Example

– Data Loading

– Iterators

– Multi-GPU Training

– Training System Add-ons: BPE, Search, Averaging

Results

– Attention Visualization

Conclusion

All annotated sections of this article are provided in citation form, and the main content comes from the original text.

1

Background

Reducing the computational load of sequence processing tasks is a significant issue and the motivation behind networks like Extended Neural GPU, ByteNet, and ConvS2S. The aforementioned networks are all based on CNN and compute hidden representations for all input and output positions in parallel.

In these models, the number of operations required to relate signals from two arbitrary input or output positions increases with the distance between positions; for example, ConvS2S increases linearly, while ByteNet increases logarithmically, making it more difficult to learn dependencies between two positions that are far apart. In contrast, in the Transformer, the number of operations is reduced to a constant level.

Self-attention, sometimes referred to as Intra-attention, performs attention on different positions of a single sentence to obtain a representation of the sequence. It can be effectively applied to many tasks, including reading comprehension, summarization, textual entailment, and task-independent sentence representation. End-to-end networks are generally based on recurrent attention mechanisms rather than sequence-aligned recurrence, and there is evidence that they perform well on simple language question answering and language modeling tasks.

To our knowledge, the Transformer is the first model that relies entirely on Self-attention without using sequence-aligned RNNs or convolutions to compute input-output representations.

2

Model Structure

Currently, most popular neural sequence transformation models have an Encoder-Decoder structure. The Encoder maps the input sequence to a continuous representation sequence.

For the encoded z, the Decoder generates one symbol at a time until the complete output sequence is generated. For each decoding step, the model is autoregressive, meaning that it uses previously generated symbols as additional input when generating the next symbol.

The overall structure of the Transformer is shown in the figure below, where Self-attention, Point-wise, and fully connected layers are used in both the Encoder and Decoder. The general structure of the Encoder and Decoder is shown in the left and right parts of the figure, respectively.

3

Encoder and Decoder

Encoder

The Encoder consists of N=6 identical layers.

We use residual connections [11] and normalization [12] between every two sub-layers.

Each layer consists of two sub-layers. The first sub-layer implements the “multi-head” Self-attention, while the second sub-layer is a simple Position-wise fully connected feedforward network.

Decoder

The Decoder also consists of N=6 identical layers.

In addition to the two sub-layers in each encoder layer, the decoder also inserts a third sub-layer to perform “multi-head” Attention on the output of the encoder stack. Similar to the encoder, we use residual connections at both ends of each sub-layer and then perform layer normalization.

4

Attention

The “multi-head” mechanism allows the model to consider Attention from different positions, and additionally, “multi-head” Attention can represent different relationships in different subspaces, which is generally not achievable with single-head Attention.

5

Application of Attention in the Model

In the Transformer, “multi-head” Attention is used in three different ways:

1) In the “Encoder-Decoder Attention” layer, the Query comes from the previous decoder layer, while the Key and Value come from the encoder’s output. Each position in the Decoder attends to all positions in the input sequence, consistent with the classic Encoder-Decoder Attention mechanism in Seq2Seq models.

2) In the Self-attention layer of the Encoder. In the Self-attention layer, all Keys, Values, and Queries come from the same place, which is the output of the previous layer in the Encoder. Each position in the current layer of the Encoder can attend to all positions in the previous layer.

3) Similarly, the Self-attention layer in the decoder allows each position in the decoder to attend to the current decoding position and all previous positions. Here, masking is required to prevent information flow to the left in the decoder to maintain autoregressive properties. The specific implementation method is to mask (set to negative infinity) all illegal connections in the input of Softmax in the scaled dot-product Attention.

6

Position-wise Feedforward Network

7

Embedding and Softmax

8

Position Encoding

We also attempted to use pre-learned position Embeddings, but found that the results of the two versions were basically the same. We chose the sinusoidal version of the implementation because this version allows the model to handle sequences longer than the maximum sequence length in the training corpus.

9

Complete Model

The function to connect the complete model and set hyperparameters is defined below.

Reference Links

[1] https://arxiv.org/abs/1706.03762[2] https://pytorch.org/[3] https://github.com/harvardnlp/annotated-transformer[4]https://drive.google.com/file/d/1xQXSv6mtAOLXxEMi8RvaW8TW-7bvYBDF/view?usp=sharing[5] http://opennmt.net[6] https://github.com/tensorflow/tensor2tensor[7] https://github.com/awslabs/sockeye[8] https://twitter.com/harvardnlp[9] https://arxiv.org/abs/1409.0473[10] https://arxiv.org/abs/1308.0850[11] https://arxiv.org/abs/1512.03385[12] https://arxiv.org/abs/1607.06450[13] https://arxiv.org/abs/1409.0473[14] https://arxiv.org/abs/1703.03906[15] https://arxiv.org/abs/1609.08144[16] https://arxiv.org/abs/1608.05859[17] https://arxiv.org/pdf/1705.03122.pdf

Technical Communication Group Invitation

Understanding Transformer Architecture: A Complete PyTorch Implementation

△ Long press to add the assistant

Scan the QR code to add the assistant WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-HIT-Dialogue System)

to apply to join Natural Language Processing/PyTorch and other technical communication groups

About Us

MLNLP Community ( Machine Learning Algorithms and Natural Language Processing ) is a grassroots academic community jointly built by scholars in natural language processing at home and abroad. It has now developed into a well-known natural language processing community both domestically and internationally, including well-known brands such as 10,000 People Top Conference Group, AI Selection Exchange, AI Talent Exchange and AI Academic Exchange, aiming to promote progress among the academic and industrial circles of machine learning and natural language processing.

The community can provide an open communication platform for practitioners’ further education, employment, and research. Everyone is welcome to follow and join us.

Understanding Transformer Architecture: A Complete PyTorch Implementation

0 Preparation Work

1 Background

2 Model Structure

3 Encoder and Decoder

4 Attention

5 Application of Attention in the Model

6 Position-wise Feedforward Network

7 Embedding and Softmax

8 Position Encoding

9 Complete Model