Transformers as Support Vector Machines: A New Perspective

Click belowCard, follow the “CVer” public account

AI/CV heavy content, delivered first time

Click to enter—>【Object Detection and Transformer】 group chat

Reprinted from: Machine Heart | Edited by: Egg Sauce, Xiao Zhou

SVM is all you need, support vector machines never go out of style.

Transformer is a support vector machine (SVM), a new theoretical discussion sparked in academia.

Last weekend, a paper from the University of Pennsylvania and the University of California, Riverside attempted to study the principles of the foundational Transformer structure of large models, establishing a formal equivalence between the optimization geometry of the attention layer and the hard boundary SVM problem that separates optimal input tokens from non-optimal tokens.

On Hackernews, the author stated that this theory solves the problem of SVM separating the “good” labels from the “bad” tokens in each input sequence. This SVM acts as an efficient token selector, fundamentally different from traditional SVMs which assign 0-1 labels to inputs.

This theory also explains how attention causes sparsity through softmax: the “bad” tokens that fall on the wrong side of the SVM decision boundary are suppressed by the softmax function, while the “good” tokens are those that ultimately have a non-zero softmax probability. It is worth mentioning that this SVM originates from the exponential nature of softmax.

After the paper was uploaded to arXiv, opinions poured in, with some stating: the direction of AI research is truly spiraling upwards, are we going to circle back again?

Transformers as Support Vector Machines: A New Perspective

After going around, support vector machines are still not out of date.

Since the classic paper “Attention is All You Need” was published, the Transformer architecture has brought revolutionary progress to the field of natural language processing (NLP). The attention layer in the Transformer accepts a series of input tokens X and evaluates the relevance between tokens through , where (K, Q) are the trainable key-query parameters, ultimately capturing long-range dependencies effectively.

Now, a new paper titled “Transformers as Support Vector Machines” establishes a formal equivalence between the optimization geometry of self-attention and the hard-margin SVM problem, using the outer product linear constraints of token pairs to separate optimal input tokens from non-optimal tokens.

Transformers as Support Vector Machines: A New Perspective

Paper link: https://arxiv.org/pdf/2308.16898.pdf

This formal equivalence is based on the paper “Max-Margin Token Selection in Attention Mechanism” by Davoud Ataee Tarzanagh et al., which describes the implicit bias of a one-layer transformer optimized through gradient descent:

(1) The optimization of the attention layer parameterized by (K, Q) converges to an SVM solution through vanishing regularization, minimizing the nuclear norm of the combined parameters . In contrast, direct parameterization through W minimizes the Frobenius norm SVM objective. The paper describes this convergence and emphasizes that it can occur in a locally optimal direction rather than a globally optimal direction.

(2) The paper also proves the local/global direction convergence of gradient descent for W parameterization under appropriate geometric conditions. Importantly, over-parameterization facilitates global convergence by ensuring the feasibility of the SVM problem and guaranteeing a benign optimization environment without stationary points.

(3) Although the theory primarily applies to linear prediction heads, the research team proposes a more general SVM equivalent that can predict the implicit bias of a one-layer transformer with a non-linear head/MLP.

Overall, the results of this research apply to general datasets, can be extended to cross-attention layers, and the practical validity of the research conclusions has been verified through thorough numerical experiments. This research establishes a new research perspective, viewing multi-layer transformers as a hierarchical SVM for separating and selecting the best tokens.

Specifically, given an input sequence of length T and embedding dimension d , the research analyzes core cross-attention and self-attention models:

Transformers as Support Vector Machines: A New Perspective

where K, Q, V are the trainable key, query, and value matrices,; S (・) denotes the softmax non-linearity, applied row-wise to . The study assumes that the first token of Z (denoted as z) is used for prediction. Specifically, given a training dataset ,, the study uses a decreasing loss function for minimization:

Here, h (・): is the prediction head containing the value weights V. In this formulation, the model f (・) accurately represents a single-layer transformer, where an MLP follows the attention layer. The authors recover self-attention in (2) by setting where x_i represents the first token of sequence X_i. Due to the non-linear nature of the softmax operation, it poses significant challenges for optimization. Even if the prediction head is fixed and linear, this problem is non-convex and non-linear. In this study, the authors focus on optimizing the attention weights (K, Q, or W) and overcoming these challenges, thereby establishing the fundamental equivalence of SVM.

The structure of the paper is as follows: Chapter 2 introduces preliminary knowledge of self-attention and optimization; Chapter 3 analyzes the optimization geometry of self-attention, showing that the attention parameters RP converge to maximum margin solutions; Chapters 4 and 5 present global and local gradient descent analyses, showing that the key-query variable W converges to solutions of (Att-SVM); Chapter 6 provides results on non-linear prediction heads and generalized SVM equivalence; Chapter 7 extends the theory to sequential prediction and causal prediction; Chapter 8 discusses related literature. Finally, Chapter 9 concludes with open questions and future research directions.

The main content of the paper is as follows:

Implicit Bias of the Attention Layer (Chapters 2-3)

Optimizing attention parameters (K, Q) under vanishing regularization converges directionally to the maximum margin solution , whose nuclear norm objective is the combined parameters . In the case of directly parameterizing cross-attention with combined parameters W, the regularization path (RP) converges directionally to the (Att-SVM) solution with the Frobenius norm as the target.

This is the first formal result distinguishing W from the optimization dynamics of (K, Q), revealing the lower-order bias of the latter. The theory of this study clearly describes the optimality of the selected tokens and naturally extends to sequence-to-sequence or causal classification settings.

Convergence of Gradient Descent (Chapters 4-5)

With appropriate initialization and linear head h (・), the gradient descent (GD) iteration of the combined key-query variable W converges directionally to the local optimal solution of (Att-SVM) (Section 5). To achieve local optimality, the selected token must score higher than adjacent tokens.

The local optimal direction is not necessarily unique and can be determined based on the geometric features of the problem [TLZO23]. As an important contribution, the authors identify geometric conditions that guarantee convergence to a global optimal direction (Chapter 4). These conditions include:

The best token has a significant score difference;
The initial gradient direction is consistent with the best token.

In addition, the paper also demonstrates that over-parameterization (i.e., large dimension d, and equivalent conditions) catalyzes global convergence by ensuring (1) the feasibility of (Att-SVM), and (2) a benign optimization landscape (i.e., no stationary points and false local optimal directions) (see Section 5.2).

Figures 1 and 2 illustrate this.

Transformers as Support Vector Machines: A New Perspective

Generality of SVM Equivalence (Chapter 6)

When optimizing with linear h (・), the attention layer inherently tends to select one token from each sequence (also known as hard attention). This is reflected in (Att-SVM), where the output token is a convex combination of the input tokens. In contrast, the authors show that non-linear heads must consist of multiple tokens, highlighting their importance in the dynamic process of transformers (Section 6.1). Leveraging insights gained from the theory, the authors propose a more general SVM equivalence approach.

Notably, they demonstrate that in universally applicable scenarios not covered by the theory (e.g., h (・) is an MLP), the method in this paper can accurately predict the implicit bias of attention trained through gradient descent. Specifically, the general formula in this paper decouples the attention weights into two parts: one is the directed part controlled by SVM, which selects tokens by applying a 0-1 mask; the other is the finite part, which determines the precise composition of the selected tokens by adjusting softmax probabilities.

An important feature of these findings is that they apply to arbitrary datasets (as long as SVM is feasible) and can be numerically validated. The authors extensively verify the maximum margin equivalence and implicit bias of transformers through experiments. They believe these findings contribute to understanding transformers as a hierarchical maximum margin token selection mechanism, laying the groundwork for upcoming research on their optimization and generalization dynamics.

References:

https://news.ycombinator.com/item?id=37367951

https://twitter.com/vboykis/status/1698055632543207862

Click to enter—&gt;【Object Detection and Transformer】 group chat

ICCV / CVPR 2023 paper and code download

Reply in the background: CVPR2023 to download the collection of CVPR 2023 papers and open-source codes
Reply in the background: ICCV2023 to download the collection of ICCV 2023 papers and open-source codes
Object Detection and Transformer group chat established

Scan the QR code below, or add WeChat: CVer333, to add CVer assistant WeChat, and apply to join CVer-Object Detection or Transformer WeChat group chat. Other vertical directions covered include: object detection, image segmentation, object tracking, face detection & recognition, OCR, pose estimation, super-resolution, SLAM, medical imaging, Re-ID, GAN, NAS, depth estimation, autonomous driving, reinforcement learning, lane detection, model pruning & compression, denoising, dehazing, deraining, style transfer, remote sensing images, behavior recognition, video understanding, image fusion, image retrieval, paper submission & communication, PyTorch, TensorFlow and Transformer, NeRF, etc.

Be sure to note: research direction + location + school/company + nickname (e.g., object detection or Transformer + Shanghai + Shanghai Jiao Tong + Kaka), according to the format, it can be approved and invited into the group faster

▲ Scan or add WeChat: CVer333, enter the group chat

CVer Computer Vision (Knowledge Planet) is here! Want to learn about the latest and best CV/DL/AI paper delivery, high-quality practical projects, cutting-edge AI industry, from beginner to advanced learning tutorials and other materials, please scan the QR code below, join CVer Computer Vision, which has gathered thousands of people!

▲ Scan to enter the planet

▲ Click the card above to follow the CVer public account

It's not easy to organize, please like and see

Click belowCard, follow the “CVer” public account

AI/CV heavy content, delivered first time Click to enter—>【Object Detection and Transformer】 group chat

Leave a Comment Cancel reply

AI/CV heavy content, delivered first time

Click to enter—>【Object Detection and Transformer】 group chat