Understanding Transformer Architecture: A PyTorch Implementation

Author: Alexander Rush

Source: Harbin Institute of Technology SCIR, Editor: Jishi Platform

Below, we share a detailed blog post about Transformers from Harvard University, translated by our lab.

Understanding Transformer Architecture: A PyTorch Implementation

The Transformer network structure proposed in the paper “Attention is All You Need” has recently attracted a lot of attention. The Transformer not only significantly improves translation quality but also provides a new structure for many NLP tasks. Although the original text is clearly written, many people find it difficult to implement correctly.

Therefore, we wrote an annotated document for this article and provided a line-by-line implementation of the Transformer code. This document removes some chapters from the original text and rearranges them, while adding corresponding annotations throughout the article. Additionally, this document is completed in the form of a Jupyter notebook and consists of runnable code implementation, totaling 400 lines of library code, capable of processing 27,000 tokens per second on 4 GPUs.

To run this work, you first need to install PyTorch. The complete notebook file and dependencies can be found on GitHub or Google Colab.

It should be noted that this annotated document and code are intended only as an introductory tutorial for researchers and developers. The code provided mainly relies on OpenNMT implementation. For more information about other implementation versions of this model, you can check Tensor2Tensor (TensorFlow version) and Sockeye (MXNet version).

Alexander Rush (@harvardnlp or [email protected])

0. Preparation

# !pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl numpy matplotlib spacy torchtext seaborn

Table of Contents

Preparation

Background

Model Structure

– Encoder and Decoder

– Encoder

– Decoder

– Attention

– Application of Attention in the Model

– Position-wise Feedforward Networks

– Embedding and Softmax

– Positional Encoding

– Complete Model

(Due to the length of the original text, the remaining parts will be in the next article)

Training

– Batches and Masks

– Training Loop

– Training Data and Batching

– Hardware and Training Progress

– Optimizer

– Regularization

– Label Smoothing

First Example

– Data Generation

– Loss Calculation

– Greedy Decoding

Real Example

– Data Loading

– Iterators

– Multi-GPU Training

– Additional Components for Training System: BPE, Search, Averaging

Results

– Attention Visualization

Conclusion

All annotated parts of this document are given in citation form, with the main content coming from the original text.

1. Background

Reducing the computational load of sequence processing tasks is an important issue and the motivation behind networks such as Extended Neural GPU, ByteNet, and ConvS2S. The networks mentioned above are CNN-based and compute hidden representations for all input and output positions in parallel.

In these models, the number of operations required to relate signals from two arbitrary input or output positions grows with the distance between positions, such as ConvS2S which grows linearly, and ByteNet which grows logarithmically, making it more difficult to learn dependencies between two positions that are far apart. In contrast, in the Transformer, the number of operations is reduced to a constant level.

Self-attention, sometimes referred to as Intra-attention, performs attention on different positions of a single sentence and obtains a representation of the sequence. It can be well applied to many tasks, including reading comprehension, summarization, textual entailment, and task-independent sentence representation. End-to-end networks are generally based on recurrent attention mechanisms rather than sequence-aligned recurrent networks, and there is evidence that they perform well on simple language question answering and language modeling tasks.

To our knowledge, the Transformer is the first model to compute input-output representations entirely based on self-attention without using sequence-aligned RNNs or convolutions.

2. Model Structure

Most popular neural sequence transformation models currently have an Encoder-Decoder structure. The Encoder maps the input sequence to a continuous representation sequence.

For the encoded z, the Decoder generates one symbol at a time until the complete output sequence is generated. For each decoding step, the model is autoregressive, meaning that the previously generated symbols are used as additional inputs when generating the next symbol.

The overall structure of the Transformer is shown in the figure below, utilizing self-attention, point-wise, and fully connected layers in both the Encoder and Decoder. The approximate structures of the Encoder and Decoder are shown in the left and right parts of the figure, respectively.

2. Encoder and Decoder

Encoder

The Encoder consists of N=6 identical layers.

We use residual connections between every two sub-layers [11] and normalization [12].

Each layer consists of two sub-layers. The first sub-layer implements “multi-head” self-attention, while the second sub-layer is a simple position-wise fully connected feedforward network.

Decoder

The Decoder also consists of N=6 identical layers.

In addition to the two sub-layers in each encoder layer, the decoder also includes a third sub-layer that performs “multi-head” attention over the output of the encoder stack. Similar to the encoder, we use residual connections at both ends of each sub-layer, followed by layer normalization.

3. Attention

The “multi-head” mechanism allows the model to consider Attention from different positions, and additionally, “multi-head” attention can represent different relationships in different subspaces, which is generally not achievable with single-head attention.

4. Application of Attention in the Model

In the Transformer, “multi-head” attention is used in three different ways:

1) In the “Encoder-Decoder Attention” layer, the Query comes from the previous decoder layer, while the Key and Value come from the output of the Encoder. Each position in the Decoder attends to all positions in the input sequence, consistent with the classic Encoder-Decoder Attention mechanism in Seq2Seq models.

2) In the Self-attention layer of the Encoder. In the Self-attention layer, all Keys, Values, and Queries come from the same source, which is the output of the previous layer in the Encoder. Each position in the current layer of the Encoder can attend to all positions in the previous layer.

3) Similarly, the Self-attention layer in the decoder allows each position in the decoder to attend to the current decoding position and all previous positions. Here, the leftward information flow in the decoder needs to be masked to maintain the autoregressive property. The specific implementation is to mask (set to negative infinity) the inputs of the Softmax corresponding to invalid connections in the scaled dot-product attention.

5. Position-wise Feedforward Networks

6. Embedding and Softmax

7. Positional Encoding

We also tried using pre-trained positional embeddings, but found that the results of both versions were basically the same. We chose the sine curve version of the implementation because using this version allows the model to handle sequences longer than the maximum sequence length in the training corpus.

8. Complete Model

Below is the function that defines the complete model and sets the hyperparameters.

END. Reference Links

[1] https://arxiv.org/abs/1706.03762[2] https://pytorch.org/[3] https://github.com/harvardnlp/annotated-transformer[4] https://drive.google.com/file/d/1xQXSv6mtAOLXxEMi8RvaW8TW-7bvYBDF/view?usp=sharing[5] http://opennmt.net[6] https://github.com/tensorflow/tensor2tensor[7] https://github.com/awslabs/sockeye[8] https://twitter.com/harvardnlp[9] https://arxiv.org/abs/1409.0473[10] https://arxiv.org/abs/1308.0850[11] https://arxiv.org/abs/1512.03385[12] https://arxiv.org/abs/1607.06450[13] https://arxiv.org/abs/1409.0473[14] https://arxiv.org/abs/1703.03906[15] https://arxiv.org/abs/1609.08144[16] https://arxiv.org/abs/1608.05859[17] https://arxiv.org/pdf/1705.03122.pdf

Original text: http://nlp.seas.harvard.edu/2018/04/03/attention.html