Understanding PyTorch Memory Management Mechanism

Author丨Mialo@Zhihu

Source丨https://zhuanlan.zhihu.com/p/486360176

1. Background Introduction

Analyzing the PyTorch memory management mechanism primarily aims to reduce the impact of “memory fragmentation”. A simple example is as follows:

Understanding PyTorch Memory Management Mechanism

As shown in the figure above, suppose we want to allocate 800MB of memory. Although the total free memory is 1000MB, the free memory shown in the upper figure consists of two non-contiguous blocks of 500MB each, which is insufficient to allocate the 800MB; however, in the lower figure, if the two 500MB free blocks are contiguous, we can organize the fragmented memory into a single 1000MB block, which is sufficient to allocate 800MB. The situation in the upper figure is referred to as “memory fragmentation”.

“Solution”: When multiple Tensors can be released, we can prioritize releasing those Tensors that have free memory adjacent to them. This facilitates PyTorch in organizing these free fragments into a larger block.

“Core Issue/Prerequisite”: Can we obtain the block we want to release, along with its adjacent free space sizes, at a very low cost (O(1) or O(logn) complexity)? With this, we can sort the Tensors and prioritize selecting the Tensor with the maximum size of “adjacent free blocks + itself” for release.

This prerequisite can be transformed into finding the following two functions (pseudo code):

block = get_block(tensor) # Find the Block storing this Tensor
size = get_adjacent_free_size(block) # Return the free space before and after this Block for sorting

Questions that may be answered after studying the PyTorch memory management mechanism:

Why does the error message indicate there is enough memory, but I still encounter OOM?
What is the multi-level allocation mechanism of memory? Why is it designed this way?

2. Source Code Interpretation

Mainly looking at c10/cuda/CUDACachingAllocator.cpp and the DeviceCachingAllocator class (L403)

Link: https://github.com/pytorch/pytorch/blob/a5b848aec10b15b1f903804308eed4140c5263cb/c10/cuda/CUDACachingAllocator.cpp#L403

2.1 Main Data Structures

Block:

The basic unit for allocating/managing memory blocks, with a triplet (stream_id, size, ptr) uniquely identifying a Block, meaning the Block maintains a ptr pointing to a memory block of size.
All contiguous Blocks (whether free or not, as long as they are obtained from Allocator::malloc) are organized in a doubly linked list for quick checks of adjacent fragments when releasing a Block, allowing these three Blocks to be merged into one.

struct Block { 
    int device; // gpu

    // Block is bound to the stream of the caller at allocation, and cannot be used by other streams even if released
    cudaStream_t stream; // allocation stream

    stream_set stream_uses; // streams on which the block was used
    size_t size; // block size in bytes
    BlockPool* pool; // owning memory pool
    void* ptr; // memory address
    bool allocated; // in-use flag
    Block* prev; // prev block if split from a larger allocation
    Block* next; // next block if split from a larger allocation
    int event_count; // number of outstanding CUDA events
};

BlockPool:

A memory pool that stores pointers to Blocks in std::set, sorted from small to large according to (cuda_stream_id -> block size -> addr) priority. All Blocks stored in BlockPool are free.

struct BlockPool {
    BlockPool(
        Comparison comparator,
        bool small,
        PrivatePool* private_pool = nullptr)
        : blocks(comparator), is_small(small), owner_PrivatePool(private_pool) {}
    std::set<Block*, Comparison> blocks;
    const bool is_small;
    PrivatePool* owner_PrivatePool; // for CUDA Graphs, not discussed here
};

The DeviceCachingAllocator maintains two types of BlockPool (large_blocks, small_blocks) for storing smaller and larger blocks (to speed up small and large requests respectively), simply categorizing blocks <= 1MB as small and > 1MB as large. Intuitively understand Block and BlockPool as shown in the figure below:

“Summary”: Blocks are organized in two ways within the Allocator: explicitly organized in BlockPool (red-black tree) sorted by size; and implicitly organized in a doubly linked list for Blocks with contiguous addresses (through prev and next pointers in the structure), allowing O(1) time to check if adjacent Blocks are free, facilitating the merging of fragments when releasing the current Block.

2.2 malloc Function Analysis: Returns a Usable Block (L466)

A Clarification:

For brevity, the following text distinguishes between the two expressions without further elaboration

“size”: The requested size of memory allocation
“alloc_size”: The size of memory needed during cudaMalloc

A Hard-Coded:

The actual alloc_size is determined based on size (get_allocation_size function, L1078):

For sizes less than 1MB, allocate 2MB;
For sizes between 1MB and 10MB, allocate 20MB;
For sizes >= 10MB, allocate {size rounded up to the nearest 2MB multiple} MB.

Searching for a suitable block involves “five steps”; if none of these steps find a suitable Block, the classic [CUDA out of memory. Tried to allocate …] error will be raised (see the section on “allocation failure” for details).

2.3 Step One: get_free_block Function (L1088)

“TLDR”: Attempt to find a suitably sized free Block in the pool maintained by the Allocator.

TLDR = Too Long; Didn’t Read
Create a Block Key using the current (size, stream_id) pair to search in the corresponding BlockPool;
The environment variable PYTORCH_CUDA_ALLOC_CONF specifies a threshold max_split_size_mb, and there are two cases where allocation will not occur in this step:

The required size is less than the threshold but the found Block is larger than the threshold (to avoid wasting blocks);
Both sides are larger than the threshold, but the block size is excessively larger than the required size (the buffer here is 20MB, ensuring that the largest fragment does not exceed the buffer size).

This threshold max_split_size_mb involves an interesting characteristic, which will be discussed later.
If a Block is successfully allocated, it will be removed from the BlockPool. When it is freed later, it will be inserted back in (see free_block: L976).

2.4 Step Two: trigger_free_memory_callbacks Function (L1106)

“TLDR”: Perform a manual garbage collection to reclaim unused Blocks, then execute step one.

If step one’s get_free_block fails, step two first calls trigger_free_memory_callbacks, then calls step one’s get_free_block again;
trigger_free_memory_callbacks essentially calls the CudaIPCCollect function through registered callbacks, which in turn calls the CudaIPCSentDataLimbo::collect function (torch/csrc/CudaIPCTypes.cpp : L64);
The CudaIPCSentDataLimbo class manages all Blocks with non-zero references, so the collect function’s invocation serves as a lazy update, cleaning up those “Blocks with references already at 0” only when there are no Blocks left to allocate (notably, the class’s destructor first calls the collect function, see torch/csrc/CudaIPCTypes.cpp : L58);
Relevant source code can be found in torch/csrc/CudaIPCTypes.h & .cpp.

2.5 Step Three: alloc_block Function (L1115)

“TLDR”: If the Allocator cannot find a suitable Block, it calls cudaMalloc to create a new Block.

If reusing blocks in steps one and two fails, cudaMalloc is used to allocate memory of size alloc_size;
Note that a parameter set_fraction limits the allocatable memory to the current remaining memory * fraction (if the requested allocation exceeds this limit, it fails), but it is still unclear where this is specified (TODO);
The newly allocated memory pointer will be used to create a new Block, with the new Block’s device and cuda_stream_id consistent with the caller.

The above steps all attempt to find some “free memory”, while the next two steps attempt to perform “fragmentation management” to create a larger block of memory.

2.6 Step Four: release_available_cached_blocks Function (L1175)

“TLDR”: First release some larger Blocks in its pool, then try to allocate with cudaMalloc.

If the alloc_block above fails, this function will try to find the largest Block smaller than size and release it in descending order until the total released Block size >= size (note that this step only releases Blocks larger than the threshold max_split_size_mb, which can be understood as releasing larger ones first);
The function for releasing blocks is seen in release_block (L1241), which primarily involves cudaFree on the pointer, handling some CUDA graph dependencies, updating other data structures, etc., and finally removing this Block from the BlockPool;
Currently, the entire BlockPool (as a reminder, the Allocator has two BlockPools, here referring to the initially specified large or small pool based on size) must satisfy two conditions to release: the cuda_stream_id must be the same, and the size must be greater than the threshold max_split_size_mb. If the total released space from these Blocks is still smaller than size, this step will fail.
After releasing a batch of Blocks, it will again execute the alloc_block function in step three to create a new Block.

2.7 Step Five: release_cached_blocks Function (L1214)

If releasing some Blocks is still insufficient, all large/small pools in the Allocator will be released (also calling release_block: L1241), and the alloc_block function will be called again.

2.8 Conditions for malloc Allocation Failure

It will raise the classic CUDA out of memory. Tried to allocate ... error, for example:

CUDA out of memory. “Tried to allocate” 1.24 GiB (GPU 0; 15.78 GiB “total capacity”; 10.34 GiB “already allocated”; 435.50 MiB “free”; 14.21 GiB “reserved” in total by PyTorch)

“Tried to allocate”: refers to the expected alloc_size for this malloc;
“total capacity”: total memory amount returned by cudaMemGetInfo;
“already allocated”: total recorded current requested sizes;
“free”: remaining memory returned by cudaMemGetInfo;
“reserved”: total size of all Blocks in the BlockPool, along with the size of already allocated Blocks.

That is, [reserved] = [already allocated] + [sum size of 2 BlockPools]

Note that reserved + free does not equal total capacity, as reserved only records memory allocated through PyTorch; if the user manually calls cudaMalloc or allocates memory through other means, it cannot be tracked in this error message (and since generally most memory allocated by PyTorch, the error message is usually reported by PyTorch).

In this case, the device has only 435.5MB left, which is insufficient for 1.24GB, while PyTorch has reserved 14.21GB (stored in Blocks), of which 10.3GB has been allocated, leaving 3.9GB. So why can’t we allocate 1.2GB from this remaining 3.9GB? The reason is certainly fragmentation, and that it is not feasible even after organization.

3. Key Features for Implementing Automatic Fragmentation Management: Split

As previously mentioned, an environment variable specifies a threshold max_split_size_mb, which indicates the maximum size of a Block that can be “split”.

In the beginning of this section, it was explained that the alloc_block function calculates the size of the newly requested Block based on alloc_size rather than the size requested by the user through malloc. Therefore, if malloc successfully returns a Block in the aforementioned five steps, the Allocator will check using the should_split function (L1068):

Due to excessive memory allocation, the size of the fragmentation within the Block, termed remaining, is calculated as alloc_size – size;
If this Block belongs to small_blocks and remaining >= 512 Bytes; or remaining >= 1MB and the Block size does not exceed the aforementioned threshold, then this Block needs to be split.

The splitting operation is straightforward; the current Block will be divided into two Blocks, the first of size equal to the requested size, and the second of size remaining, which will be linked to the next pointer of the current Block (this process is seen in the source code L570–L584). In doing so, these two Blocks’ addresses naturally become contiguous. As the program runs, larger Blocks (as long as they remain below the threshold max_split_size_mb) will continue to be split into smaller Blocks. It is noteworthy that since the only means of generating new Blocks is through the alloc_block function via cudaMalloc, there is no guarantee that new Blocks will have contiguous addresses with other Blocks, thus all Blocks maintained in the doubly linked list with contiguous address space are derived from an initially allocated Block.

A segment of contiguous space composed of Blocks organized by a doubly linked list is shown in the figure below:

When a Block is released, it checks whether its prev and next pointers are null, and if not null, whether they are in use. If they are not in use, it will use try_merge_blocks (L1000) to merge adjacent Blocks. Since every time a Block is released, it checks, there will be no two adjacent free Blocks, thus only adjacent Blocks need to be checked for availability. This check process is seen in the free_block function (L952). Furthermore, since only releasing a Block can potentially make multiple free blocks contiguous, fragmentation management can be conducted only when releasing a Block.

Regarding the threshold max_split_size_mb, intuitively, it should be set so that larger Blocks can be split into smaller ones, but here it is set so that only Blocks smaller than this threshold can be split. My understanding is that PyTorch statistically assumes that most memory requests are below a certain threshold, and these smaller Blocks are treated conventionally for splitting and fragmentation management; however, for larger Blocks, PyTorch considers that the overhead of requesting these large Blocks (time, risk of failure) is significant, so they are better left for future larger requests and are not suitable for splitting. By default, the threshold variable max_split_size_mb is set to INT_MAX, meaning all Blocks can be split.

4. Summary

Analyzing the PyTorch memory management mechanism can help us better find optimization strategies for memory. Currently, we are collaborating with the author of Dynamic Tensor Rematerialization (DTR) @RoundCornerKnightMarisa to promote the industrial application of DTR technology on PyTorch, one of which is to address the memory fragmentation issue mentioned in this article during the training process. A forthcoming article will introduce the problems encountered in the implementation of DTR, along with solutions and benchmarks, so stay tuned.

For reference links, click the bottom left corner to read the original text. Copyright belongs to the original author or platform, used only for academic sharing. If there is any infringement, please delete immediately.

Editor /Garvey

Review / Fan Ruiqiang

Verification / Fan Ruiqiang

Click below