BERT and GPT Outperform Transformers Without Attention or MLPs

BERT and GPT Outperform Transformers Without Attention or MLPs

Machine Heart reported Editors: Du Wei, Ze Nan This article explores the Monarch Mixer (M2), a new architecture that is sub-quadratic in both sequence length and model dimension, demonstrating high hardware efficiency on modern accelerators. From language models like BERT, GPT, and Flan-T5 to image models like SAM and Stable Diffusion, Transformers are sweeping the … Read more