DeepSeek Technology Interpretation: Understanding MLA

This article focuses on explaining MLA (Multi-Head Latent Attention).

Note: During my learning process, I usually encounter some knowledge blind spots or inaccuracies, and I recursively learn some extended contexts. This article also interprets the background of MLH’s proposal, the problems it aims to solve, and the final effects step by step along with some necessary background knowledge.

MLA primarily optimizes KV-cache to reduce GPU memory usage, thereby improving inference performance. Presenting this conclusion directly may not be easy to understand. First, let’s look at what a complete inference phase looks like for generative models and what issues exist regarding inference performance.

LLM Model Inference Process

LLM inference is divided into two phases: prefill phase and decode phase.

Prefill Phase: This is when the model computes all the prompt tokens in parallel at once, ultimately generating the first output token.
Decode Phase: This generates one token at a time until it produces the EOS (end-of-sequence) token, yielding the final response.

During inference, because the model stacks multiple layers of transformers, the core computational consumption occurs within the transformer, including operations like MHA and FFN. MHA calculates the Q, K, V matrices to perform multi-head attention calculations.

In the LLM generation process, it is a token-by-token prediction process based on the preceding sequence of tokens. Tokens in the sequence (whether in the prefill phase or the decode phase) only interact with the preceding tokens to compute attention, which we also refer to as Causal Attention. This is achieved through a lower triangular Causal Attention Mask that allows token interactions to only perceive the forward sequence. As shown in Figure 1, the internal details of the Transformer are presented:

DeepSeek Technology Interpretation: Understanding MLA

Taking a token at position t in a sequence as an example, we can calculate the attention process of one layer of the Transformer, as shown in the following formula:

The symbols in the formula: t represents the tth token in the sequence; the two subscripts in q, k, v, o denote the token position and the corresponding head index.

From the formula, we can see that when calculating attention, q at position t only interacts with k and v at preceding positions, leading us to the following two conclusions:

Calculating the preceding k, v is not influenced by later tokens.
For later calculations of attention at positions t+1, t+2, …, t+n, the values of k and v at preceding positions 1→t remain unchanged.

Therefore, to accelerate training and inference efficiency, during the token-by-token generation process, researchers proposed caching the pre-computed k and v, which is the current mainstream KV-cache mechanism. The essence of KV-cache is to exchange space for time. We know that the current LLM sizes are relatively large, and GPU memory is quite precious. Storing KV-cache in GPU memory inevitably leads to memory access bottlenecks. In other words, if the model computes directly without KV-cache (repeatedly computing preceding k and v), it becomes a computation-intensive task; with KV-cache, k and v are read from “storage media” instead of being computed, leading to frequent read/write operations between the GPT core and the storage media, which then becomes a memory access-intensive task. Thus, while the KV-cache mechanism solves the problem of repeated calculations, the speed of memory access directly impacts training and inference speed.

Next, we will take a closer look at the memory access rates at different levels for a typical inference architecture, what data needs to be stored during the model inference process, and how to allocate storage.

Memory Usage During LLM Inference Phase

3.1 Memory Access Rate Levels

To intuitively understand the memory access rates, let’s take a distributed inference architecture as an example.

For instance, with two machines, each having 8 A100 GPUs, the data access efficiency within the card, between cards, and between machines is illustrated in Figure 3. Note: In our example, we only describe one type of memory medium HBM (which we commonly refer to as GPU memory). We know that GPU memory mediums usually include SRAM and DRAM. SRAM, also known as on-chip storage, provides faster access for GPU computing units, and all computations must first be scheduled to SRAM before they can be executed. It typically only has a size of several tens of MB, with a bandwidth of about 20T/s. SRAM is strongly bound to computing units, and during the inference phase, it is generally not considered as a storage unit. DRAM, which we commonly refer to as CPU memory, is generally not used during inference due to its slower access rate. Therefore, the inference storage medium we discuss generally refers to HBM (GPU memory).

From the memory access bandwidth shown in the figure, we can see that the bandwidth within the card is three times that of inter-card bandwidth and twenty times that of inter-machine bandwidth. Therefore, we should prioritize storing data in the card memory, followed by within a single machine, and lastly consider cross-machine storage.

Next, let’s look at what data needs to be stored in GPU memory during the inference process.

3.2. Memory Allocation During Model Inference Phase

Below is a diagram I created, as shown in Figure 4, indicating that during the inference phase, three main types of data will be stored in GPU memory.

KV Cache: As mentioned in the previous section, the results of k and v calculated from the preceding token sequence will gradually be stored in GPU memory during the subsequent token inference process. The amount stored varies dynamically with the batch size and sequence length.
Model Parameters: This includes parameters for transformers, embeddings, etc., which will be stored in GPU memory. Once the model size is fixed, this storage space remains constant.
Runtime Intermediate Data: Some intermediate data produced during inference will be temporarily stored in GPU memory, used and released immediately, generally occupying a relatively small space.

From the above, we can see that the main storage consumption during the inference phase is from two parts: Model Parameters and KV Cache. So what proportion do model parameters occupy, and how much does KV Cache occupy?

First, let’s take an example of the computation process for one token to see how much KV needs to be stored for one token. For convenience, let’s take the Qwen-72B model as an example, the model configuration is as follows: Qwen-72B-Chat.

The model has 80 layers, each with 64 heads, and the vector dimension for each head is 128.

Note: Here we will not consider the settings for Qwen 72B GQA (where the actual KV is compressed), but will only consider the naive MHA model structure (assuming no processing has been done). GQA will be discussed in detail later.

As shown in Figure 5, calculating for one token, each head in every transformer layer needs to store a pair of k and v.

For one token, the total amount of cached data for k and v is:

In the formula, the indicates that for one token, we need to cache 10240 pairs of k and v. Isn’t that a bit surprising! How much storage do these k and v occupy? Assuming that the model inference phase uses half-precision (bf16) parameters, each parameter occupies 2 bytes. The final storage occupancy for one token, as shown in formula (2):

We now know how many k and v need to be cached after computing one token and the amount of storage required. For an actual inference scenario, we also need to consider the batch size (B) and sequence length (S) dimensions to determine the overall storage consumption of KV Cache. These two dimensions can typically vary dynamically. Let’s look at the following two scenarios:

Scenario 1: Single Short Text Scenario

Batch and sequence settings: B = 1, S = 2048. At this time, the total amount of k and v cache is:

Scenario 2: Concurrent Long Text Scenario

Batch and sequence settings: B = 32, S = 4096. At this time, the total amount of k and v cache is:

In addition to the storage space consumed by k and v, we know that model parameters also occupy storage space.

The storage space occupied by model parameters during the inference phase is fixed, and the calculation is relatively simple. Assuming the parameter size is: , using bf16 half-precision for inference, the parameter occupancy is:

Combining the above two scenarios, let’s look at the overall allocation of GPU memory:

Scenario 1: Model storage , kv storage , the model parameter storage dominates, using 80G A100, at least 2 cards are needed for inference.
Scenario 2: Model storage , kv storage , KV Cache storage dominates, using 80G A100, at least 7 cards are needed for inference.

Here, I want to elaborate a bit more. The batch size during the inference phase, depending on offline or online business scenarios, is actually a balancing process. Choosing a smaller batch may not have high concurrency, but it might allow a single card to fit the complete model parameters and KV Cache. At this time, the bandwidth within the card will be relatively high, and performance may still be outstanding. It may be worth considering increasing the batch size to fully utilize the single card memory, thereby enhancing performance further. However, as the batch size increases beyond the capacity of a single card or even a single machine, concurrency will become high, but the performance of memory access across cards or machines will decline, leading to memory access becoming a bottleneck, and the efficiency of GPU computation resources will be low, potentially resulting in overall low inference performance. Therefore, from the perspective of setting inference batch sizes, it is crucial to find the optimal balance point through empirical testing.

Currently, LLMs are relatively large, and memory capacity and access rates have hierarchical characteristics. Therefore, reducing cross-card and cross-machine memory read/write during the inference process is an effective path to optimize inference performance. On one hand, the less data read/written at a time, the overall speed will be faster; on the other hand, the less overall GPU memory used, the more data can be kept on a single card or within a single machine, allowing higher bandwidth for data read/write.

The MLA we are going to study aims to reduce KV Cache to compress memory usage, thereby optimizing inference speed. Before we delve into MLA, let’s first look at the current methods for optimizing KV Cache.

Methods to Reduce KV Cache

4.1 Summary of KV Cache Optimization Methods

In the industry, many methods have been derived for optimizing KV Cache. Here, I summarize based on my own accumulation, briefly describing the optimization ideas without going into too much detail.

There are mainly four types of methods:

Shared KV: Multiple heads share one set of KV, transforming the original one KV per head into one KV per group of heads, thereby compressing KV storage. Representative methods: GQA, MQA, etc.
Window KV: Controls a computation window for KV for long sequences, where the KV cache only retains results within the window (the window length is far less than the sequence length), discarding KV beyond the window. This method can reduce KV storage but may also lose some long-text inference effectiveness. Representative method: Longformer, etc.
Quantization Compression: Based on quantization methods, using lower bit rates to store KV, further compressing individual KV results. Representative method: flashAttention, etc.
Computational Optimization: By optimizing the computation process, reducing the number of memory accesses, allowing more computations to occur in on-chip SRAM, thereby enhancing inference performance. Representative method: flashAttention, etc.

The MLA discussed in this article is an optimization method under the shared KV branch. Next, let’s explore what shared KV methods exist, as these methods are also used for comparison in MLA.

4.2 Shared KV Optimization Methods

Shared KV mainly has two methods: MQA and GQA, both proposed by Google, see: MQA(2019), GQA(2023), as shown in Figure 6.

4.2.1 MQA (Multi-Query Attention)

The MQA method is relatively simple, as shown in the rightmost part of Figure 6, where all heads in each layer share the same k and v to compute attention. Compared to MHA, which requires caching 2∗l∗n_h for a single token, it reduces to 2×l, meaning each layer shares one k vector and one v vector.

4.2.2 GQA (Group-Query Attention)

GQA is a compromise between MQA and MHA, where not every head has one KV, nor do all heads share one KV. Instead, all heads are grouped. For example, if the number of groups is g, then each group shares one KV for g heads. When g=1, GQA is equivalent to MQA; when g=n_h, GQA is equivalent to MHA.

To help myself understand GQA and MQA more clearly, I have drawn some detailed diagrams of the KV calculation process for one token (as shown in Figure 5), showing all layers and adding some annotations.

Let’s summarize the storage amounts of KV Cache for various methods when calculating a single token (where l is the number of model layers and n_h is the number of heads per layer):

MHA caches a total of 2×l×n_h pairs of k and v.
MQA caches a total of 2×l pairs of k and v.
GQA caches a total of 2×l×g pairs of k and v, where g is the number of groups, 1≤g≤n_h, typically a value that can be divided evenly.

The MLA discussed in this article is also a variant of the shared KV optimization. Now, let’s look at the principles and details of MLA.

MLA

5.1 Overview of MLA KV Optimization

Let’s take a quick look at MLA’s computation method and compare its KV compression effects with MQA and GQA.

First, let’s look at the complete formula for calculating attention in MLA, as shown in Figure 8.

The paper mentions that each transformer layer caches only the vectors in the blue box of the above formula: k and v, which have dimensions of:

k: dimension d_c=4×d_h=512

v: dimension d_hR=d_h/2=64

Comparing with MQA (where each layer has one k of dimension d_h and one v of dimension d_h, totaling 2 elements), MLA has 2.25 times the storage, but DeepSeek claims that their method is not only stronger than MQA but also better than the original MHA without shared KV. We will discuss this further in section 5.4.

MLA claims to be fast, economical, and powerful. In the next section, we will gradually explore the specific implementation.

5.2 Interpretation of MLA Principles

Next, let’s refer to the formula in Figure 8 to examine the MHA calculation process. First, let’s explain the variables in the formula:

d: the dimension of low-rank compression for MLA, with a value of:
d_h: the vector dimension of a single head
n_h: the number of heads per layer
d_{hidden}: the hidden layer dimension,
W: the low-rank transformation matrix

1. First, let’s look at the KV calculation process

First, formula (41) applies low-rank compression to the input x, transforming the d dimensional input into d_c dimensional output. In DeepSeek-V3, d=7168 and d_c=512.

Then, through formulas (42) and (45), the two transformation matrices ( W_k and W_v) expand the dimensions of k and v back to d, meaning each head has its own k and v (consistent with the number of KVs in MHA).

Note: After the transformations, it is very similar to the logic of LoRA’s low-parameter fine-tuning. By using two low-rank matrices to first compress and then expand, the number of parameters can be reduced. However, the essence of MLA is to reduce the storage of KV-cache. LoRA emphasizes the reduction of parameter quantity, and while the operations in MLA also reduce the parameter quantity, according to DeepSeek-V3’s parameter configuration, the two low-rank matrices have a parameter quantity of 2×d_c×d=2×512×7168, while the parameter quantity of the normal MHA’s parameter matrix is d×d=7168×7168. However, MLA emphasizes the reduction of KV-cache, meaning the activation values of KV are reduced. Currently, we cannot see how the number of activation values is reduced, as in terms of KV quantity and dimensions, it is on par with MHA, and even more than GQA and MQA, while also adding an additional computation step. Currently, it is quite confusing… Let’s continue…

2. Now let’s look at the Q calculation process

Formulas (37) and (38) follow a similar logic to KV, applying two matrices ( W_q) to perform a low-rank transformation. In Deepseek-V3, d_q=1536, which is three times the KV compression dimension d_c. However, it is still significantly reduced compared to d=7168.

3. q, k increase RoPE positional encoding

We note that the addition of RoPE positional encoding does not multiply the previously calculated k by RoPE’s diagonal matrix. Instead, it calculates the two q and k with positional encoding separately, as shown in formulas (39) and (43).

Note that here, the calculations of q_t^R and k_t^R have two details:

1. q t R, k t R have a vector dimension d_h^R which is a relatively small dimension, set by DeepSeek to be half of the single attention head dimension: d h R= d h/2=642. This calculation for k is essentially an MQA calculation method, where all heads in the same layer share the same k.

Then, according to formulas (40) and (44), the calculated q_t^R and k_t^R are concatenated to form the complete q_t and k_t vectors.

Note: Here, the subscript i indicates the index of the attention head.

So far, we have obtained that q and k are composed of two parts: one part is the low-rank compressed q and k vectors, while the other part is the shared k vector with RoPE positional encoding (the latter part is calculated based on the MQA method).

How can we understand the above operation process? This is also the core of the MLA method.

Let’s refer to a paragraph from the DeepSeek-V2 paper that explains this (translated from Chinese):

Positional encoding uses RoPE, but RoPE is incompatible with low-rank KV. Specifically, RoPE is position-sensitive for both Q and K. If we apply RoPE to k, then the weight matrix W_k (of K) will couple with the position-sensitive RoPE matrix. Therefore, during inference, it will not be absorbed back into W_q (of Q) because the RoPE matrix related to the currently generated token will be located between W_k and W_q, and matrix multiplication does not satisfy the commutative property. Thus, we must recompute all prefix tokens’ k during inference, which will greatly reduce inference efficiency.

The paper mentions the concept of “matrix absorption calculation,” which is important for understanding MLA. Let’s use a simple example to understand this:

Assuming there are two vector variables x and y, both are 3-dimensional vectors. There are two fixed transformation matrices W_1 and W_2, which perform linear transformations to obtain new vectors x’ ext{ and } y’. Finally, we want to compute the product of x’ and y’.

Method 1: Conventional Calculation

Method 2: Matrix Absorption Calculation

We know that matrix multiplication satisfies the associative property. For formula (c), we can first compute the product of the two transformation matrices:

Then, we can compute x’ with W’ ext{ (the product of matrices) } and y, without performing any operations on y.

Now we can calculate the product of x’ and y’ .

Through the above example, we can see that the results obtained by the two methods are the same, but the second method first performs matrix multiplication, which is equivalent to absorbing the transformation matrix W_1 into W_2.

Understanding the above example, let’s look at the issue of “RoPE being incompatible with low-rank KV and not being able to perform matrix absorption calculations”.

a) Without Adding RoPE

Let’s assume we do not add RoPE. Then the calculation of the product q imes k is as follows, where (i) indicates the slice of the transformation matrix for the i ext{th head}:

Without adding RoPE, we can pre-compute W_k, which is what we referred to as absorbing it into W_q. This way, when performing the transformation for W_q, we also simultaneously compute the multiplication of the W_k matrix.

The benefit of this is that we only need to cache W_k, rather than caching the results of q imes k. The dimension of W_k is only the length of k, while q imes k is a transformation of dimension d, which completely restores the hidden layer dimension (set to 64 in DeepSeek-v3). This is also the core principle of MLA’s KV Cache compression.

b) Now Assume We Add RoPE

Let’s see what happens when we add RoPE. The calculation of the product q imes k will introduce a variable that incorporates relative positions, as shown in formula (2).

In this case, the intermediate component ext{RoPE} is variable and changes with the relative position. Since it is not a fixed matrix, it cannot be pre-computed. Therefore, the paper states that RoPE is incompatible with low-rank transformations.

c) Introducing RoPE with a Very Small Component

To introduce positional encoding, the authors calculate W_k ext{ using MQA method, meaning that in each layer of the network, all heads only compute one shared } k (as shown in formula (43) in the paper). The dimension of the positional encoding vector is set to be quite small: d^R ext{, where } d^R=64.

Thus, the final q and k vectors are formed by concatenating two parts, and the weight is calculated by multiplying the two parts and summing them, as shown in formula (8):

The first part q_{t} ext{ is calculated as in formula (6), and through matrix absorption, all heads only cache one } k. The second part k_{t} ext{ is calculated in the normal MQA way, where all heads only cache one shared } k.

By following a similar calculation method, we can also handle the transformation of v and absorb the transformation matrix W_v into the final result transformation matrix W ext{. This way, we do not need to actually compute and cache the value of } v ext{, but only cache the same as } k.

We have now completely introduced the principle of MLA’s KV Cache compression. Let’s review the dimensions of the vectors that MLA actually caches (as shown in the blue box of Figure 8):

k c : Dimension: d_c
v R : Dimension: d_hR

k is the low-rank compressed vector, while v is the shared k vector with positional encoding.

Note: The principle part has been explained very clearly by Su. (See: Caching and the Extreme Trade-off Between Performance and Effect: From MHA, MQA, GQA to MLA – Scientific Spaces), the principles discussed in this article also follow Su’s logic to outline the key ideas, thanks to Su for the sharing.

5.3 Comparison of MLA with MQA and GQA

Finally, let’s briefly compare the various methods, directly quoting from the DeepSeek-V2 paper as follows:

From the figure above, we can see that although the latent KV cached by MLA is relatively short (equivalent to 2.25 times the cache amount of MQA), MLA has the ability to recover full k and v, significantly stronger in feature representation than GQA and MQA. Therefore, MLA can achieve speed, economy, and power. The paper also provides the following data:

Note: Regarding the comparison of capabilities in the figure, I am somewhat skeptical about the claim of being stronger than MHA, as I have not seen any ablation experiments for comparison, and it is also difficult to explain from a theoretical perspective.

Summary

This article attempts to introduce more foundational knowledge and auxiliary information to deeply understand MLA. The content is relatively long and may seem a bit verbose. This is a recursive summary of some extended information during my understanding of MLA, ultimately organizing a systematic context for reference.

References

deepseek-v1:https://arxiv.org/pdf/2401.02954
deepseek-v2:https://arxiv.org/pdf/2405.04434
deepseek-v3:https://arxiv.org/pdf/2412.19437
Caching and the Extreme Trade-off Between Performance and Effect: From MHA, MQA, GQA to MLA – Scientific Spaces
https://zhuanlan.zhihu.com/p/659770503
GQA:https://arxiv.org/pdf/2305.13245
MQA:https://arxiv.org/pdf/1911.02150

END

Click on the card below

Follow us now