Source: Data STUDIO


This article is approximately 3600 words, recommended reading time is 9 minutes.
Why does the error message indicate enough memory, yet still encounter OOM? What is the multi-level allocation mechanism of memory? Why is it designed this way?

Analyzing the PyTorch Memory Management Mechanism Mainly Aims to Reduce “Memory Fragmentation” Impact. A Simple Example is:

Understanding PyTorch Memory Management Mechanism

As shown in the above image, suppose we want to allocate 800MB of memory. Although the total free memory is 1000MB, the free memory in the upper image consists of two 500MB blocks that are not contiguous in address, which is insufficient to allocate the 800MB of memory; while in the lower image, if the two 500MB free blocks are contiguous in address, we can consolidate the memory fragments to form a contiguous block of 1000MB, sufficient to allocate 800MB. The situation in the upper image is referred to as “Memory Fragmentation”.

“Solution”: When multiple Tensors can be released, prioritize releasing those Tensors that have free memory adjacent to them. This makes it easier for PyTorch to consolidate these free blocks into larger contiguous blocks.

“Core Issue/Prerequisite”: Can we retrieve the current block we want to release and the size of the free space adjacent to it at a very low cost (O(1) or O(logn) complexity)? With this, we can sort Tensors and prioritize releasing the Tensor with the largest “adjacent free blocks + its own size”.

This prerequisite can be transformed into finding the following two functions (pseudo code):

block = get_block(tensor) # Find the Block storing this Tensor
size = get_adjacent_free_size(block) # Return the free space size before and after this Block for sorting

After studying the PyTorch memory management mechanism, you may be able to answer the following questions:

Why does the error message indicate enough memory, yet still encounter OOM?
What is the multi-level allocation mechanism of memory? Why is it designed this way?

Source Code Interpretation

Mainly look at the DeviceCachingAllocator class in c10/cuda/CUDACachingAllocator.cpp (L403)

Link: https://github.com/pytorch/pytorch/blob/a5b848aec10b15b1f903804308eed4140c5263cb/c10/cuda/CUDACachingAllocator.cpp#L403

2.1 Main Data Structures

Block:

The basic unit for allocating/managing memory blocks, a triplet (stream_id, size, ptr) can specifically locate a Block, that is, a Block maintains a ptr pointing to a memory block of size size, belonging to the CUDA Stream of stream_id.
All contiguous Blocks (regardless of whether they are free, as long as they are obtained from Allocator::malloc) are organized in a doubly linked list, making it easy to quickly check for adjacent fragments when releasing a Block, and if they exist, directly merge these three Blocks into one.

struct Block {     int device; // gpu
    // Block is bound to the stream of the caller at allocation, even if released, it cannot be used by other streams    cudaStream_t stream; // allocation stream
    stream_set stream_uses; // streams on which the block was used    size_t size; // block size in bytes    BlockPool* pool; // owning memory pool    void* ptr; // memory address    bool allocated; // in-use flag    Block* prev; // prev block if split from a larger allocation    Block* next; // next block if split from a larger allocation    int event_count; // number of outstanding CUDA events};

BlockPool:

A memory pool that stores Block pointers in std::set, sorted in ascending order by (cuda_stream_id -> block size -> addr) priority. All Blocks stored in BlockPool are free.

struct BlockPool {    BlockPool(        Comparison comparator,        bool small,        PrivatePool* private_pool = nullptr)        : blocks(comparator), is_small(small), owner_PrivatePool(private_pool) {}    std::set<Block*, Comparison> blocks;    const bool is_small;    PrivatePool* owner_PrivatePool; // for use with CUDA Graphs, not detailed here};

The DeviceCachingAllocator maintains two types of BlockPool (large_blocks, small_blocks), which store smaller and larger blocks respectively (to accelerate small and large requests separately), simply categorizing Blocks <= 1MB as small blocks and > 1MB as large blocks. Intuitively understand Block and BlockPool as shown in the figure below:

BlockPool diagram, the two 500MB blocks on the left are of the same size, hence the address of the upper one is smaller than that of the lower one.

“Summary”: Blocks within the Allocator are organized in two ways: one is explicitly organized in BlockPool (red-black tree) sorted by size; the other is implicitly organized in a doubly linked list for contiguous address Blocks (via the prev and next pointers within the structure), allowing for O(1) time to check if the previous and next Blocks are free, facilitating the merging of fragments when releasing the current Block.

2.2 malloc Function Analysis: Returns an Available Block (L466)

A Distinction:

For simplicity, the following text will differentiate between these two expressions without further elaboration.

“size”: The requested size of memory allocation.
“alloc_size”: The size of memory allocation required for cudaMalloc.

A Hard-Coded Rule:

The actual alloc_size is determined based on size (get_allocation_size function, L1078):

Allocate 2MB for sizes less than 1MB;
Allocate 20MB for sizes between 1MB and 10MB;
Allocate {size rounded up to the nearest multiple of 2MB} MB for sizes >= 10MB.

Finding a suitable block will involve “five steps”, if none of these five steps find a suitable Block, the classic [CUDA out of memory. Tried to allocate …] error will be raised (see below for “failure to allocate”).

2.3 Step One: get_free_block Function (L1088)

“TLDR”: Attempt to find a suitably sized free Block in the Allocator’s own maintained pool.

TLDR = Too Long; Didn’t Read
Create a Block Key using the current (size, stream_id) pair to search in the corresponding BlockPool;
The environment variable PYTORCH_CUDA_ALLOC_CONF specifies a threshold max_split_size_mb, and there are two scenarios where allocation will not occur in this step:

The required size is less than the threshold but the found Block is larger than the threshold (to avoid wasting blocks);
Both are greater than the threshold but the block size exceeds the required size by more than the buffer (here it is 20MB, so the largest fragment does not exceed the buffer size).

This threshold max_split_size_mb involves an interesting feature, which will be discussed later.
If a Block is successfully allocated, it will be removed from the BlockPool. When it is used up and released (free), it will be inserted back, see free_block: L976)

2.4 Step Two: trigger_free_memory_callbacks Function (L1106)

“TLDR”: Manually perform a round of garbage collection, reclaiming unused Blocks, then execute step one.

If the first step’s get_free_block fails, then in the second step, first call trigger_free_memory_callbacks, then call get_free_block again;
trigger_free_memory_callbacks essentially calls the CudaIPCCollect function through registration, which in turn calls the CudaIPCSentDataLimbo::collect function (torch/csrc/CudaIPCTypes.cpp: L64);
The CudaIPCSentDataLimbo class manages all Blocks with a reference count not equal to 0, so the actual call to collect is a form of lazy update, only called when there are no Blocks available for allocation to clean up those “Blocks with reference count already at 0” (it is worth mentioning that the destructor of this class will first call the collect function, see torch/csrc/CudaIPCTypes.cpp: L58);
Related source code can be found in torch/csrc/CudaIPCTypes.h & .cpp.

2.5 Step Three: alloc_block Function (L1115)

“TLDR”: When the Allocator cannot find a suitable Block among the existing ones, it calls cudaMalloc to create a new Block.

If the reuse of blocks in steps one and two fails, then use cudaMalloc to allocate memory, sized alloc_size;
Note that there is a parameter set_fraction that limits the allocatable memory to the current remaining memory * fraction (if allocation exceeds this limit, it fails), but it is unclear where this is specified (TODO);
The pointer to the newly allocated memory will be used to create a new Block, with the new Block’s device and cuda_stream_id consistent with the caller.

The above steps attempt to find some “free memory”, while the following two steps attempt to perform “fragmentation management” to create a larger contiguous block of memory.

2.6 Step Four: release_available_cached_blocks Function (L1175)

“TLDR”: First release some larger Blocks in its own pool, then try to allocate using cudaMalloc

If the alloc_block above fails, it will try to call this function first, finding the largest Block among those smaller than size, and release Blocks sequentially from large to small until the total size of released Blocks >= size (note that this step will only release Blocks larger than the threshold max_split_size_mb, which can be understood as first releasing some larger ones);
The function to release blocks is seen in release_block (L1241), primarily involving cudaFree the pointer, handling some CUDA graphs dependencies, updating other data structures, and finally removing this Block from the BlockPool;
The Blocks that can be released in the entire BlockPool (as a reminder, the Allocator has two BlockPools, here referring to the initially specified large or small pool based on size) need to meet two conditions: the same cuda_stream_id and the size must be greater than the threshold max_split_size_mb. If the total space released by such Blocks is still less than size, then this step will fail.
After releasing a batch of Blocks, the alloc_block function from step three is called again to create a new Block.

2.7 Step Five: release_cached_blocks Function (L1214)

If releasing some Blocks is still insufficient, then release all Blocks in the entire Allocator’s large/small pool (also calling release_block: L1241), and call the alloc_block function again.

2.8 Cases of malloc Allocation Failure

The classic CUDA out of memory. Tried to allocate … error will be raised, for example:

CUDA out of memory. “Tried to allocate” 1.24 GiB (GPU 0; 15.78 GiB “total capacity”; 10.34 GiB “already allocated”; 435.50 MiB “free”; 14.21 GiB “reserved” in total by PyTorch)

“Tried to allocate”: Refers to the expected alloc_size during this malloc;
“Total capacity”: The total memory of the device returned by cudaMemGetInfo;
“Already allocated”: The total size recorded of the requested allocations so far;
“Free”: The remaining memory of the device returned by cudaMemGetInfo;
“Reserved”: The size of all Blocks in the BlockPool, along with the total size of already allocated Blocks.

That is, [reserved] = [already allocated] + [sum size of 2 BlockPools]

Note that reserved + free does not equal total capacity, because reserved only records the memory allocated by PyTorch; if the user manually calls cudaMalloc or allocates memory by other means, it cannot be tracked in this error message (and since generally, the memory allocated by PyTorch accounts for most, the failure allocation error message is also generally feedback from PyTorch).

In this example, the device only has 435.5MB left, which is insufficient for 1.24GB, while PyTorch has reserved 14.21GB (stored in Blocks), of which 10.3GB has been allocated, leaving 3.9GB. So why can’t 1.2GB be allocated from this remaining 3.9GB? The reason must be fragmentation, and it is a situation where even cleanup cannot help.

3. Key Features for Implementing Automatic Fragmentation Management: Split

As mentioned earlier, the environment variable specifies a threshold max_split_size_mb, which actually indicates the maximum size of a Block that can be “split”.

In the beginning of this section, it was explained that the size of the Block created by the alloc_block function through cuda is determined by the alloc_size rather than the size requested by the user through malloc. Therefore, if malloc successfully returns a Block in the above five steps, the Allocator will check using the function should_split (L1068):

Due to excessive memory allocation, the size of the fragmentation inside the Block called remaining is alloc_size – size;
If this Block belongs to small_blocks and remaining >= 512 Bytes; or remaining >= 1MB and the Block size does not exceed the above threshold, then this Block needs to be split.

The operation of splitting is very simple; the current Block will be divided into two Blocks, the first of size exactly equal to the requested allocation size, and the second of size remaining, which will be linked to the next pointer of the current Block (this process can be seen in the source code L570–L584). This way, the addresses of these two Blocks naturally become contiguous. As the program runs, larger Blocks (as long as they remain below the threshold max_split_size_mb) will continually be split into smaller Blocks. It is worth noting that since the only way to generate new Blocks is through the alloc_block function in step three via cudaMalloc, there is no guarantee that the new Blocks will have contiguous addresses with other Blocks. Therefore, all contiguous address spaces maintained in the doubly linked list are derived from an initially allocated Block.

Within a contiguous space (composed of Blocks organized by a doubly linked list) as shown in the figure below:

Internal doubly linked list of contiguous space (the arrows of the list have been omitted)

When a Block is released, it will check whether its prev and next pointers are null, and if not, whether they are in use. If they are not in use, it will use try_merge_blocks (L1000) to merge adjacent Blocks. Since each release of a Block will check, there will never be two adjacent free Blocks, thus only the adjacent Blocks need to be checked for free status. This check process can be seen in the free_block function (L952). Also, since only releasing a Block can potentially make multiple free Blocks address-contiguous, fragmentation management only needs to be performed when releasing a Block.

Regarding the threshold max_split_size_mb, intuitively it should be set such that larger Blocks are suitable for splitting into smaller ones. However, here it is set such that Blocks smaller than this threshold are split. My personal understanding is that PyTorch believes that statistically, most memory requests are below a certain threshold, and these sized Blocks are conventionally processed for splitting and fragmentation management; while for Blocks larger than the threshold, PyTorch believes that these large Block requests incur high overhead (time, risk of failure) and can be reserved for larger future requests, hence are not suitable for splitting. By default, the threshold variable max_split_size_mb is set to INT_MAX, meaning all Blocks can be split.

Author | Mi Aro

Source | https://zhuanlan.zhihu.com/p/486360176

Editor: Huang Jiyan