BERT and GPT Outperform Transformers Without Attention or MLPs

Machine Heart reported

Editors: Du Wei, Ze Nan

This article explores the Monarch Mixer (M2), a new architecture that is sub-quadratic in both sequence length and model dimension, demonstrating high hardware efficiency on modern accelerators.

From language models like BERT, GPT, and Flan-T5 to image models like SAM and Stable Diffusion, Transformers are sweeping the world with unstoppable momentum. However, one cannot help but ask: Is the Transformer the only choice?

A research team from Stanford University and the State University of New York at Buffalo has not only provided a negative answer to this question but also proposed a new alternative technology: the Monarch Mixer. Recently, the team published a related paper along with some checkpoint models and training code on arXiv. It is worth mentioning that the paper has been selected for an Oral Presentation at NeurIPS 2023.

Paper link: https://arxiv.org/abs/2310.12109

Code link: https://github.com/HazyResearch/m2

This method eliminates the costly attention and MLP in Transformers, replacing them with expressive Monarch matrices, achieving superior performance at a lower cost in both language and image experiments.

This is not the first time Stanford has proposed an alternative technology to Transformers. In June of this year, another team from the university proposed a technique called Backpack. Refer to the Machine Heart article “Stanford Trains Transformer Alternative Model: 170 Million Parameters, Bias Removal, and Strong Controllable Interpretability”. Of course, to achieve true success, these technologies still need further validation from the research community and to be transformed into practical products in the hands of application developers.

Now, let’s take a look at the introduction to Monarch Mixer in this paper and some experimental results.

Paper Introduction

In the fields of natural language processing and computer vision, machine learning models can now handle longer sequences and higher-dimensional representations, thereby supporting longer contexts and higher quality. However, the time and space complexity of existing architectures grows quadratically with sequence length and/or model dimension, limiting context length and increasing scaling costs. For example, the attention and MLP in Transformers grow quadratically with sequence length and model dimension.

To address this issue, this research team from Stanford University and the State University of New York at Buffalo claims to have found a high-performance architecture whose complexity grows sub-quadratically with sequence length and model dimension.

Their research inspiration comes from MLP-mixer and ConvMixer; these two studies observed that many machine learning models operate by mixing information along the sequence and model dimension axes, often using a single operator for both axes.

Finding expressive, sub-quadratic, and hardware-efficient mixing operators is quite challenging. For instance, the MLP in MLP-mixer and the convolution in ConvMixer are both expressive, but they both grow quadratically with input dimensions. Recently, some studies have proposed sub-quadratic sequence mixing methods that use longer convolutions or state-space models, but these models have low FLOP utilization and still grow quadratically in terms of model dimension. Meanwhile, there have been some promising advancements in sparse dense MLP layers without sacrificing quality, but due to lower hardware utilization, certain models may actually be slower than dense models.

Based on these inspirations, this research team proposed the Monarch Mixer (M2), which utilizes a class of expressive sub-quadratic structured matrices: Monarch matrices.

Monarch matrices generalize structured matrices that extend the Fast Fourier Transform (FFT), and research shows that they encompass a wide range of linear transformations, including Hadamard transforms, Toeplitz matrices, AFDF matrices, and convolutions. They can be parameterized through the product of block diagonal matrices, known as Monarch factors, interleaved with permutations.

Their computation is sub-quadratic: if the number of factors is set to p, then when the input length is N, the computational complexity is BERT and GPT Outperform Transformers Without Attention or MLPs , allowing the computational complexity to lie between O(N log N) when p = log N and when p = 2.

M2 uses Monarch matrices to mix information along the sequence and model dimension axes. This method is not only easy to implement but also highly hardware efficient: it can be efficiently calculated using modern hardware that supports GEMM (General Matrix Multiplication algorithm) for block diagonal Monarch factors.

The research team implemented an M2 layer for proof of concept—entirely written in PyTorch, with less than 40 lines of code (including import statements), relying only on matrix multiplication, transposition, reshaping, and element-wise products (see the pseudocode in the middle of Figure 1); as a result, for an input size of 64k, this code achieved a FLOP utilization of 25.6% on an A100 GPU. On newer architectures like the RTX 4090, a simple CUDA implementation for the same input size achieved a FLOP utilization of 41.4%.

For more mathematical descriptions and theoretical analyses of the Monarch Mixer, please refer to the original paper.

Experiments

The research team compared the Monarch Mixer and Transformers on three tasks where Transformers have dominated: BERT-style non-causal masked language modeling, ViT-style image classification, and GPT-style causal language modeling.

In each task, the experimental results showed that the newly proposed method achieved performance comparable to Transformers without using attention and MLP. They also evaluated the acceleration of the new method compared to strong Transformer benchmark models in the BERT setting.

Non-Causal Language Modeling

For the non-causal language modeling task, the team constructed an M2-based architecture: M2-BERT. M2-BERT can directly replace BERT-style language models, where BERT is a major application of the Transformer architecture. For training M2-BERT, masked language modeling on C4 was used, with the tokenizer being bert-base-uncased.

M2-BERT is based on the Transformer backbone, but the attention layers and MLP are replaced with M2 layers, as shown in Figure 3.

In the sequence mixer, attention is replaced with bidirectional gated convolutions with residual connections (see the left side of Figure 3). To restore convolutions, the team set the Monarch matrix to DFT and inverse DFT matrices. They also added per-depth convolutions after the projection step.

In the dimension mixer, the two dense matrices in the MLP are replaced with learned block diagonal matrices (1st order Monarch matrix, b = 4).

The researchers pre-trained four M2-BERT models: two of them are M2-BERT-base models with sizes of 80M and 110M, and the other two are M2-BERT-large models with sizes of 260M and 341M, corresponding to BERT-base and BERT-large.

Table 3 shows the performance of the model corresponding to BERT-base, and Table 4 shows the performance of the model corresponding to BERT-large.

From the tables, we can see that on the GLUE benchmark, the performance of M2-BERT-base is comparable to BERT-base while having 27% fewer parameters; when both have the same number of parameters, M2-BERT-base outperforms BERT-base by 1.3 points. Similarly, M2-BERT-large, which has 24% fewer parameters, performs comparably to BERT-large, while when the number of parameters is the same, M2-BERT-large has an advantage of 0.7 points.

Table 5 shows the forward throughput of the model corresponding to BERT-base. It reports the number of tokens processed per millisecond on an A100-40GB GPU, reflecting inference time.

It can be seen that the throughput of M2-BERT-base even exceeds that of the highly optimized BERT model; compared to the standard HuggingFace implementation at a sequence length of 4k, the throughput of M2-BERT-base can reach 9.1 times!

Table 6 reports the CPU inference times for M2-BERT-base (80M) and BERT-base—results obtained directly from running the PyTorch implementations of these two models.

When the sequences are short, the impact of data locality still dominates the reduction in FLOP, while the cost of operations like filter generation (which is absent in BERT) is higher. However, when the sequence length exceeds 1K, the acceleration advantage of M2-BERT-base gradually emerges, reaching a speed advantage of 6.5 times when the sequence length reaches 8K.

Image Classification

In non-causal modeling, to verify that the new method also has advantages in images similar to those in language, the team also evaluated M2’s performance on image classification tasks.

Table 7 shows the performance of Monarch Mixer, ViT-b, HyenaViT-b, and ViT-b-Monarch (which replaced the MLP module in the standard ViT-b with the Monarch matrix) on ImageNet-1k.

The advantages of Monarch Mixer are very obvious: it outperforms the original ViT-b model with only half the number of parameters. Even more surprisingly, the parameter-efficient Monarch Mixer significantly outperforms ResNet-152, which is specifically designed for the ImageNet task.

Causal Language Modeling

GPT-style causal language modeling is a major application of Transformers. The team constructed an M2-based architecture for causal language modeling: M2-GPT.

For the sequence mixer, M2-GPT combines convolutional filters from Hyena, the current best non-attention language model, and cross-head parameter sharing from H3. They replaced the FFT in these architectures with causal parameterization and completely removed the MLP layers. The resulting architecture has no attention and no MLP at all.

They pre-trained M2-GPT on the standard dataset for causal language modeling, PILE. The results are shown in Table 8.

It can be seen that despite the model based on the new architecture having no attention and MLP at all, it still outperforms Transformers and Hyena on pre-training perplexity metrics. These results indicate that models significantly different from Transformers can also achieve excellent performance in causal language modeling.

For more content, please refer to the original paper.

BERT and GPT Outperform Transformers Without Attention or MLPs

For reprints, please contact this public account for authorization.

Submissions or inquiries for coverage: [email protected]

Leave a Comment Cancel reply