Teaching You to Implement PyTorch Operators with CUDA

Introduction

CUDA (Compute Unified Device Architecture) is a general parallel computing architecture launched by NVIDIA, enabling GPUs to solve complex computational problems. Developers can use C language to write programs for the CUDA architecture, which can run at ultra-high performance on CUDA-supported processors.

Editor | Heart of Autonomous Driving Author | Yuppie@Zhihu

Link | https://zhuanlan.zhihu.com/p/595851188

Recently, I have been working on optimizing operators, and during discussions with other students, I found that writing operators with CUDA has a certain threshold. Although there are many excellent teaching guides on Zhihu, and PyTorch has provided tutorials (the specific address will be at the end of the article), the introduction to each step and the pitfalls seem to lack detailed explanations.

Combining my painful experience when I started, a simple and clear demo can greatly reduce the time cost of getting started. Therefore, I will take array summation (hereinafter referred to as sum_single) and the addition of two arrays (hereinafter referred to as sum_double) as examples to detail the complete framework for implementing PyTorch operators with CUDA. For specific code, please refer to [1]:

https://github.com/Yuppie898988/CudaDemo

Framework Structure

├── ops
│   ├── __init__.py
│   ├── ops_py
│   │   ├── __init__.py
│   │   └── sum.py
│   └── src
│       ├── reduce_sum
│       │   ├── sum.cpp
│       │   └── sum_cuda.cu
│       └── sum_two_arrays
│           ├── two_sum.cpp
│           └── two_sum_cuda.cu
├── README.md
├── setup.py
└── test_ops.py

The demo structure is as follows:

ops/src/ contains CUDA/C++ code
setup.py is the configuration file for compiling the operator
ops/ops_py/ contains the operator functions wrapped with PyTorch
test_ops.py is the test file that calls the operator

CUDA/C++

To implement an operator, you need to use .cu (CUDA) to write the kernel function and .cpp (C++) to write the wrapper function and call PYBIND11_MODULE to encapsulate the operator.

Note: CUDA files and Cpp files cannot have the same name!!! Otherwise, compilation will fail!!!

We will explain using src/sum_two_arrays/ as an example.

// src/sum_two_arrays/two_sum_cuda.cu
#include <cstdio>

#define THREADS_PER_BLOCK 256
#define WARP_SIZE 32
#define DIVUP(m, n) ((m + n - 1) / n)


__global__ void two_sum_kernel(const float* a, const float* b, float * c, int n){
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n){
        c[idx] = a[idx] + b[idx];
    }
}


void two_sum_launcher(const float* a, const float* b, float* c, int n){
    dim3 blockSize(DIVUP(n, THREADS_PER_BLOCK));
    dim3 threadSize(THREADS_PER_BLOCK);
    two_sum_kernel<<<blockSize, threadSize>>>(a, b, c, n);
}

The key here is the two_sum_kernel kernel function that implements the array addition function. The following two_sum_launcher function is responsible for allocating thread blocks and calling the kernel function.

// src/sum_two_arrays/two_sum.cpp
#include <torch/extension.h>
#include <torch/serialize/tensor.h>

#define CHECK_CUDA(x) \
  TORCH_CHECK(x.type().is_cuda(), #x, " must be a CUDAtensor ")
#define CHECK_CONTIGUOUS(x) \
  TORCH_CHECK(x.is_contiguous(), #x, " must be contiguous ")
#define CHECK_INPUT(x) \
  CHECK_CUDA(x);       \
  CHECK_CONTIGUOUS(x)


void two_sum_launcher(const float* a, const float* b, float* c, int n);


void two_sum_gpu(at::Tensor a_tensor, at::Tensor b_tensor, at::Tensor c_tensor){
    CHECK_INPUT(a_tensor);
    CHECK_INPUT(b_tensor);
    CHECK_INPUT(c_tensor);

    const float* a = a_tensor.data_ptr<float>();
    const float* b = b_tensor.data_ptr<float>();
    float* c = c_tensor.data_ptr<float>();
    int n = a_tensor.size(0);
    two_sum_launcher(a, b, c, n);
}


PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("forward", &two_sum_gpu, "sum two arrays (CUDA)");
}

In the C++ file, the macro definitions at the beginning are to ensure that the incoming vectors are on CUDA (CHECK_CUDA) and that the addresses of the elements in the incoming vectors are contiguous (CHECK_CONTIGUOUS). The two_sum_launcher is a declaration of the CUDA file.

The two_sum_gpu function is the interface with Python, and the parameters passed are Tensors in PyTorch. In this part, it is necessary to perform CHECK validation on the Tensor (optional) and obtain the pointer to the Tensor variable using .data_ptr. For the use of Tensor in C++, please refer to [2].

Finally, the purpose of PYBIND11_MODULE is to encapsulate the entire operator, allowing the C++ function to be called from Python [3]. For other custom operators, just change the three parameters in m.def()

"forward": the method name of the operator. If the entire module is named sum_double, it can be called in Python using sum_double.forward
&two_sum_gpu: the function to bind, which can be changed according to the different functions implemented
"sum two arrays (CUDA)": the operator comment, which will appear when calling help(sum_double.forward) on the Python side

Some may wonder why to separate the operator and the module. If the entire sum_double has many different functions, I can bind multiple operators in one module, and just write multiple m.def() in PYBIND11_MODULE to call different operators through sum_double.xxx

setup.py Compilation Configuration

Create a setup.py file in the root directory of the entire project to configure the compilation information, using setuptools to package the operator [4]

from setuptools import find_packages, setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name='CudaDemo',
    packages=find_packages(),
    version='0.1.0',
    author='xxx',
    ext_modules=[
        CUDAExtension(
            'sum_single', # operator name
            ['./ops/src/reduce_sum/sum.cpp',
             './ops/src/reduce_sum/sum_cuda.cu',]
        ),
        CUDAExtension(
            'sum_double',
            ['./ops/src/sum_two_arrays/two_sum.cpp',
             './ops/src/sum_two_arrays/two_sum_cuda.cu',]
        ),
    ],
    cmdclass={
        'build_ext': BuildExtension
    }
)

The items that need to be modified in the file are:

name: package name
version: package version number
author: author’s name
ext_modules: compile C/C++ extensions, of list type, each element contains information related to a module (the module is mentioned at the end of the section discussing CUDA/C++, a module can contain multiple specific operators)

CUDAExtension

In ext_modules, use CUDAExtension to specify the file paths of CUDA/C++, where the first parameter is the name of the corresponding module, and the second parameter is a list containing all file paths.

The module name here and the operator name defined in CUDA/C++ using m.def() jointly determine how the operator is called. For example, the module name for adding two arrays is sum_double, and the operator method name is forward, so the way to call this operator in Python is sum_double.forward().

It is worth mentioning that the value of packages is of type list[str], indicating the packages that need to be packaged locally. Here, find_packages() is used to find all local packages. Of course, we only need to consider the files in ops/src/, so packages=['ops/src'] can also compile normally, but for convenience, we still use find_packages().

PyTorch Wrapping

To ensure that the custom operator can propagate forward and backward normally, we need to inherit from torch.autograd.Function to wrap the operator [5]. Here we will introduce it using sum_double as an example.

# ops/ops_py/sum.py
import torch
from torch.autograd import Function
import sum_double

class SumDouble(Function):

    @staticmethod
    def forward(ctx, array1, array2):
        """sum_double function forward.
        Args:
            array1 (torch.Tensor): [n,]
            array2 (torch.Tensor): [n,]
        
        Returns:
            ans (torch.Tensor): [n,]
        """
        array1 = array1.float()
        array2 = array2.float()
        ans = array1.new_zeros(array1.shape)
        sum_double.forward(array1.contiguous(), array2.contiguous(), ans)

        # ctx.mark_non_differentiable(ans) # if the function is no need for backpropogation

        return ans

    @staticmethod
    def backward(ctx, g_out):
        # return None, None   # if the function is no need for backpropogation

        g_in1 = g_out.clone()
        g_in2 = g_out.clone()
        return g_in1, g_in2


sum_double_op = SumDouble.apply

The line import sum_double at the beginning of the file is the module name defined in setup.py.

The custom type torch.autograd.Function must implement forward and backward functions and declare them as static member functions.

forward

The first half of the forward propagation is to normally pass the Tensor into the interface. If the incoming vector was indexed in previous code, it is likely to be non-contiguous, so it is recommended to make it contiguous when passing to the operator.

If the operator does not need to consider backward propagation, you can use ctx.mark_non_differentiable(ans) to mark the output of the function as not needing differentiation [6].

backward

backward input corresponds to forward output, and output corresponds to forward input. For example, here backward input g_out corresponds to forward output ans, and backward output g_in1, g_in2 corresponds to forward input array1, array2.

If the operator does not need to consider backward propagation, just return None, None. Otherwise, compute according to the gradients of the corresponding input variables.

It is worth noting that if backward propagation requires information from forward, you can use ctx to record and access it. For example, to sum an array, the gradient of backward propagation would be a vector of the length of the original array. You can record the length of the input array in forward using ctx.shape=array.shape[0] and read it in backward using n=ctx.shape.

If storing Tensors, it is recommended to use save_for_backward(x, y, z, ...) to store vectors and use x, y, z, ...=ctx.saved_tensors to retrieve them, rather than directly using ctx[7].

To prevent incorrect gradients and memory leaks, and enable the application of saved tensor hooks.

Note: save_for_backward() can only store vectors; for scalars, use ctx to store directly.

Finally, use sum_double_op = SumDouble.apply to obtain the final function form.

_init_.py

To call the wrapped PyTorch function externally, declare it through ops/ops_py/__init__.py

from .sum import sum_single_op, sum_double_op
__all__ = ['sum_single_op', 'sum_double_op']

In ops/__init__.py

from .ops_py import *

Build & Test

Ensure that the PyTorch environment is installed, and run pip install -e . in the root directory of the demo.

Test the results using python test_ops.py; if there are no issues, the output should be:

Average time cost of sum_single is 2.8257 ms
Average time cost of sum_double is 0.1128 ms

If it fails to compile, it may be because nvcc has not been added to the environment variable. Check if there is a CUDA folder by running ls /usr/local/. For example, if I have a folder called cuda-11.6, then enter ~/.bashrc and add the following lines at the end:

export PATH=$PATH:/usr/local/cuda-11.6/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.6/lib64

If the CUDA environment is not configured, you can refer to my previous article:

https://zhuanlan.zhihu.com/p/558605762

Other related reference materials can be found at [8][9][10].

References

CudaDemo code repository https://github.com/Yuppie898988/CudaDemo
PyTorch C++ API https://pytorch.org/cppdocs/api/library_root.html
pybind11 https://pybind11.readthedocs.io/en/latest/basics.html
setuptools https://setuptools.pypa.io/en/latest/userguide/quickstart.html#
autograd https://pytorch.org/docs/stable/autograd.html
mark_non_differentiable https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.mark_non_differentiable.html#torch.autograd.function.FunctionCtx.mark_non_differentiable
save_for_backward https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.save_for_backward.html#torch.autograd.function.FunctionCtx.save_for_backward
Detailed explanation of three ways to compile and call custom CUDA operators in PyTorch https://zhuanlan.zhihu.com/p/358778742
Official PyTorch tutorial https://pytorch.org/tutorials/advanced/cpp_extension.html
THE C++ FRONTEND https://pytorch.org/cppdocs/frontend.html