Author:Fareed Khan

Translator: Zhao Jiankai,Proofreader: Zhao Ruxuan

Understanding the Mathematical Principles of Transformers

The transformer architecture may seem intimidating, and you may have seen various explanations on YouTube or blogs. However, in my blog, I will clarify its principles by providing a comprehensive mathematical example. By doing so, I hope to simplify the understanding of the transformer architecture.

Let’s get started!

Inputs and Positional Encoding

Let’s tackle the initial part where we will identify our inputs and calculate their positional encodings.

Step 1 (Defining the Data)

The first step is to define our dataset (corpus).

In our dataset, there are 3 sentences (dialogues) taken from the TV series ‘Game of Thrones’. Although this dataset seems small, it is sufficient to help us understand the subsequent mathematical formulas.

Step 2 (Finding the Vocab Size)

To determine the vocabulary size, we need to find the total number of unique words in the dataset. This is crucial for encoding (i.e., converting data into numbers).

Where N is the list of all words, and each word is a single token. We will break our dataset down into a list of tokens, represented as N.

After obtaining the token list (represented as N), we can apply the formula to calculate the vocabulary size.

The specific formula works as follows:

Using set operations helps to eliminate duplicates, and then we can count the unique words to determine the vocabulary size. Therefore, the vocabulary size is 23, as there are 23 unique words in the given list.

Step 3 (Encoding and Embedding)

Next, we assign an integer as an index to each unique word in the dataset.

After encoding our entire dataset, it is time to select our input. We will choose a sentence from the corpus to start:

“When you play the game of thrones”

Each character passed as input will be represented as an encoding, and each corresponding integer value will have an associated embedding linked to it.

These embeddings can be found using Google Word2Vec (vector representation of words). In our numerical example, we will assume that the embedding vector for each word is filled with random values between (0 and 1).
Additionally, the original paper uses an embedding vector of 512 dimensions, while we will consider a very small dimension of 5 for the numerical example.

Now, each word embedding is represented by a 5-dimensional embedding vector and is filled with random values using the Excel function RAND().

Step 4 (Positional Embedding)

Let’s consider the first word, “when”, and calculate its positional embedding vector. The positional embedding has two formulas:

The POS value for the first word “when” will be zero since it corresponds to the starting index of the sequence. Additionally, the value of i (depending on whether it is even or odd) determines the formula used to calculate the PE value. The dimension value represents the dimension of the embedding vector, which in our case is 5.

Continuing to calculate the positional embedding, we will assign the pos value of 1 for the next word “you” and continue to increment the pos value for each subsequent word in the sequence.

After finding the positional embedding, we can link it with the original word embedding.

The resulting vector we obtain is e1+p1, e2+p2, e3+p3, and so forth, representing the sum of the embeddings.

The output of the initial part of the transformer architecture will later serve as the input to the encoder.

Encoder

In the encoder, we perform complex operations involving matrices of queries (query), keys (key), and values (value). These operations are crucial for transforming input data and extracting meaningful representations.

Within the multi-head attention mechanism, a single attention layer consists of several key components. These components include:

Note that the yellow box represents the single-head attention mechanism. What makes it a multi-head attention mechanism is the stacking of multiple yellow boxes. For the sake of the example, we will only consider a single-head attention mechanism as shown in the figure above.

Step 1 (Performing Single Head Attention)

The attention layer has three inputs:

Query
Key
Value

In the above-provided figure, the three input matrices (pink matrices) represent the transposed outputs obtained from the previous step of adding positional embeddings to the word embedding matrix. On the other hand, the linear weight matrices (yellow, blue, and red) represent the weights used in the attention mechanism. The columns of these matrices can have any number of dimensions, but the number of rows must match the number of columns in the input matrices used for multiplication. In our example, we will assume that the linear matrices (yellow, blue, and red) contain random weights. These weights are typically initialized randomly and then adjusted during training through techniques like backpropagation and gradient descent. So let’s compute the (Query, Key and Value matrices):

Once we have the query, key, and value matrices in the attention mechanism, we proceed with additional matrix multiplications.

Now, we will multiply the result matrix with the value matrix we computed earlier:

If we have multiple heads of attention, each attention will produce a matrix of dimension (6×3), and the next step is to concatenate these matrices together.

In the next step, we will again perform a linear transformation similar to the one used to obtain the query, key, and value matrices. This linear transformation is applied to the concatenated matrix obtained from multiple heads of attention.

Understanding the Mathematical Principles of Transformers