Analyzing Transformer From the Perspective of Development History

Click on the above “Beginner Learning Visuals” to select “Add Star” or “Pin”

Heavyweight content delivered first-hand

Source | AI Technology Review Translated by | bluemin

Proofread by | Chen Caixian

The Transformer architecture has become a popular research topic in the field of machine learning (especially in NLP), bringing us many important achievements, such as: writing bots like GPT-2 and GPT-3; the first generation of GPT and its superior successor, the BERT model, which achieved the most accurate results in many language understanding tasks with unprecedented data utilization efficiency, requiring almost no parameter tuning. What used to take a month can now be done in just 30 minutes, and with better results; and AlphaStar, among others.

It is clear that the power of the Transformer is truly exceptional!

In 2017, the Google team first proposed the Transformer model. They summarized the Transformer in one sentence: “Attention is All You Need.” However, just looking at this sentence does not give people an intuitive understanding and recognition of the Transformer model. Therefore, the author hopes to provide a straightforward explanation of the Transformer model from a historical development perspective.

Classic Fully Connected Neural Networks

In classic fully connected neural networks, each different input variable is a unique snowflake algorithm. When a fully connected neural network learns to recognize a specific variable or its set, it does not automatically generalize to other variables or their sets.

When you perform regression analysis in social science or medical research projects, the input might be demographic variables (like “age” or “weekly alcohol consumption”), and the above principle also applies. However, if the input variables include some known structured relationships, like spatial or temporal layouts, the performance of fully connected neural networks will be poor.

If the input is the pixels in an image, then a fully connected network cannot learn patterns like “the left pixels are brighter than the right pixels,” but must learn separately “(0, 0) is brighter than (1, 0),” “(1, 0) is brighter than (2, 0),” and so on.

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) understand the spatial layout of the input and process the input in relative terms: CNNs do not learn “the pixel at position (572, 35),” but rather learn “the pixel in the center I am looking at,” “the pixel to the left,” and so forth. Then, they slide while “looking” at different parts of the image, searching for the same pattern relative to the center in each region.

CNNs differ from fully connected networks in two ways: weight sharing and locality.

Weight sharing: There are some fixed calculations at the center that are applied to every position.
Locality: Each calculation can only look at things that are quite close to the center position. For example, you might look for pattern features in a 3×3 or 5×5 window.

Weight sharing is crucial for any spatial or temporal structured input (including text).

However, the locality of CNNs is not suitable for processing text.

I think this way: Each prominent object in an image (dog, dog’s nose, edge, a small patch of color) can be understood independently without needing to observe anything outside that object. For example, there are usually no substitutes in the image, nor is there a reference system that requires you to observe something else to grasp the essence of something.

Unless in some bizarre scenarios, it is generally not the case that “oh, I see a dog now, but I have to observe things outside the dog to confirm that it is a dog.” So, you can start with some small details and then think deeper: “Ah, this is the edge part –> Ah, that is a rectangular object made of edges –> Ah, that is the dog’s nose –> Ah, that is the dog’s head –> Ah, that is a dog.” Each part of the object is defined by the smaller features it contains.

But the above method cannot be used for text processing. For example, pronouns in a sentence may appear at the beginning, but the antecedents often appear at the end. We cannot accurately decompose a sentence into independently understandable clauses without changing its meaning, and then link them back together. Therefore, the locality principle of CNNs is not conducive to text processing. However, many people have used CNNs for text processing. CNNs can solve many problems in text, but they can play a larger role in other areas.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) slide along the input in sequence, performing roughly the same calculation steps at each position (with weight sharing).

However, RNNs do not observe the current position and the local small window around it, but rather observe the following information:

Current position
Output after observing the last position

When the input is in text format, it feels like “reading”: RNNs process the first word, summarizing all the information collected at that time; then they process the second word based on the summarized information, updating the summary; then they process the third word based on the new summary, updating the information again, and so on.

People usually use RNN architectures that can learn when to forget information (removing information from the summary) and when to pass information along (LSTMs or GRUs). For example, people will specifically remember “I still haven’t figured out what ‘this’ refers to,” and then pass that information as widely as possible in search of a suitable antecedent.

(3b) RNNs can be tricky

Although the work done by RNNs is somewhat like sequential reading, it has a tricky task to solve.

RNNs can only “read” in one direction at a time, which creates an asymmetry problem: near the beginning of a sentence, the output can only use information from a few words; near the end of a sentence, the output can use information from all words. (This is in contrast to CNNs, which process information at each position in the same way.)

In this case, if the words at the beginning of a sentence can only be understood based on words that appear later, problems arise. RNNs can understand later words based on earlier words (which is the core idea of RNNs), but cannot understand earlier words based on later words.

This problem can be partially avoided in two ways: one is to use multiple RNN layers, where the newer layers act like “additional reading channels”; the other is to use two RNNs to read from different directions (which is also the basic principle of “BiLSTMs”).

But beyond that, the structure of RNNs still faces a fundamental challenge: RNNs can only use a limited length of

Leave a Comment Cancel reply