Introduction
CUDA (Compute Unified Device Architecture) is a general parallel computing architecture launched by NVIDIA, enabling GPUs to solve complex computational problems. Developers can use C language to write programs for the CUDA architecture, which can run at ultra-high performance on CUDA-supported processors.
Editor | Heart of Autonomous Driving Author | Yuppie@Zhihu
Link | https://zhuanlan.zhihu.com/p/595851188
Recently, I have been working on optimizing operators, and during discussions with other students, I found that writing operators with CUDA has a certain threshold. Although there are many excellent teaching guides on Zhihu, and PyTorch has provided tutorials (the specific address will be at the end of the article), the introduction to each step and the pitfalls seem to lack detailed explanations.
Combining my painful experience when I started, a simple and clear demo can greatly reduce the time cost of getting started. Therefore, I will take array summation (hereinafter referred to as sum_single
) and the addition of two arrays (hereinafter referred to as sum_double
) as examples to detail the complete framework for implementing PyTorch operators with CUDA. For specific code, please refer to [1]:
https://github.com/Yuppie898988/CudaDemo
Framework Structure
├── ops
│ ├── __init__.py
│ ├── ops_py
│ │ ├── __init__.py
│ │ └── sum.py
│ └── src
│ ├── reduce_sum
│ │ ├── sum.cpp
│ │ └── sum_cuda.cu
│ └── sum_two_arrays
│ ├── two_sum.cpp
│ └── two_sum_cuda.cu
├── README.md
├── setup.py
└── test_ops.py
The demo structure is as follows:
ops/src/
contains CUDA/C++ codesetup.py
is the configuration file for compiling the operatorops/ops_py/
contains the operator functions wrapped with PyTorchtest_ops.py
is the test file that calls the operator
CUDA/C++
To implement an operator, you need to use .cu
(CUDA) to write the kernel function and .cpp
(C++) to write the wrapper function and call PYBIND11_MODULE
to encapsulate the operator.
Note: CUDA files and Cpp files cannot have the same name!!! Otherwise, compilation will fail!!!
We will explain using src/sum_two_arrays/
as an example.
// src/sum_two_arrays/two_sum_cuda.cu
#include <cstdio>
#define THREADS_PER_BLOCK 256
#define WARP_SIZE 32
#define DIVUP(m, n) ((m + n - 1) / n)
__global__ void two_sum_kernel(const float* a, const float* b, float * c, int n){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n){
c[idx] = a[idx] + b[idx];
}
}
void two_sum_launcher(const float* a, const float* b, float* c, int n){
dim3 blockSize(DIVUP(n, THREADS_PER_BLOCK));
dim3 threadSize(THREADS_PER_BLOCK);
two_sum_kernel<<<blockSize, threadSize>>>(a, b, c, n);
}
The key here is the two_sum_kernel
kernel function that implements the array addition function. The following two_sum_launcher
function is responsible for allocating thread blocks and calling the kernel function.
// src/sum_two_arrays/two_sum.cpp
#include <torch/extension.h>
#include <torch/serialize/tensor.h>
#define CHECK_CUDA(x) \
TORCH_CHECK(x.type().is_cuda(), #x, " must be a CUDAtensor ")
#define CHECK_CONTIGUOUS(x) \
TORCH_CHECK(x.is_contiguous(), #x, " must be contiguous ")
#define CHECK_INPUT(x) \
CHECK_CUDA(x); \
CHECK_CONTIGUOUS(x)
void two_sum_launcher(const float* a, const float* b, float* c, int n);
void two_sum_gpu(at::Tensor a_tensor, at::Tensor b_tensor, at::Tensor c_tensor){
CHECK_INPUT(a_tensor);
CHECK_INPUT(b_tensor);
CHECK_INPUT(c_tensor);
const float* a = a_tensor.data_ptr<float>();
const float* b = b_tensor.data_ptr<float>();
float* c = c_tensor.data_ptr<float>();
int n = a_tensor.size(0);
two_sum_launcher(a, b, c, n);
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("forward", &two_sum_gpu, "sum two arrays (CUDA)");
}
In the C++ file, the macro definitions at the beginning are to ensure that the incoming vectors are on CUDA (CHECK_CUDA
) and that the addresses of the elements in the incoming vectors are contiguous (CHECK_CONTIGUOUS
). The two_sum_launcher
is a declaration of the CUDA file.
The two_sum_gpu
function is the interface with Python, and the parameters passed are Tensors in PyTorch. In this part, it is necessary to perform CHECK validation on the Tensor (optional) and obtain the pointer to the Tensor variable using .data_ptr
. For the use of Tensor in C++, please refer to [2].
Finally, the purpose of PYBIND11_MODULE
is to encapsulate the entire operator, allowing the C++ function to be called from Python [3]. For other custom operators, just change the three parameters in m.def
()
"forward"
: the method name of the operator. If the entire module is namedsum_double
, it can be called in Python usingsum_double.forward
&two_sum_gpu
: the function to bind, which can be changed according to the different functions implemented"sum two arrays (CUDA)"
: the operator comment, which will appear when callinghelp(sum_double.forward)
on the Python side
Some may wonder why to separate the operator and the module. If the entire
sum_double
has many different functions, I can bind multiple operators in one module, and just write multiplem.def()
inPYBIND11_MODULE
to call different operators throughsum_double.xxx
setup.py Compilation Configuration
Create a setup.py
file in the root directory of the entire project to configure the compilation information, using setuptools to package the operator [4]
from setuptools import find_packages, setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
setup(
name='CudaDemo',
packages=find_packages(),
version='0.1.0',
author='xxx',
ext_modules=[
CUDAExtension(
'sum_single', # operator name
['./ops/src/reduce_sum/sum.cpp',
'./ops/src/reduce_sum/sum_cuda.cu',]
),
CUDAExtension(
'sum_double',
['./ops/src/sum_two_arrays/two_sum.cpp',
'./ops/src/sum_two_arrays/two_sum_cuda.cu',]
),
],
cmdclass={
'build_ext': BuildExtension
}
)
The items that need to be modified in the file are:
name
: package nameversion
: package version numberauthor
: author’s nameext_modules
: compile C/C++ extensions, of list type, each element contains information related to a module (the module is mentioned at the end of the section discussing CUDA/C++, a module can contain multiple specific operators)
CUDAExtension
In ext_modules
, use CUDAExtension
to specify the file paths of CUDA/C++, where the first parameter is the name of the corresponding module, and the second parameter is a list containing all file paths.
The module name here and the operator name defined in CUDA/C++ using m.def()
jointly determine how the operator is called. For example, the module name for adding two arrays is sum_double
, and the operator method name is forward
, so the way to call this operator in Python is sum_double.forward()
.
It is worth mentioning that the value of packages
is of type list[str], indicating the packages that need to be packaged locally. Here, find_packages()
is used to find all local packages. Of course, we only need to consider the files in ops/src/
, so packages=['ops/src']
can also compile normally, but for convenience, we still use find_packages()
.
PyTorch Wrapping
To ensure that the custom operator can propagate forward and backward normally, we need to inherit from torch.autograd.Function
to wrap the operator [5]. Here we will introduce it using sum_double
as an example.
# ops/ops_py/sum.py
import torch
from torch.autograd import Function
import sum_double
class SumDouble(Function):
@staticmethod
def forward(ctx, array1, array2):
"""sum_double function forward.
Args:
array1 (torch.Tensor): [n,]
array2 (torch.Tensor): [n,]
Returns:
ans (torch.Tensor): [n,]
"""
array1 = array1.float()
array2 = array2.float()
ans = array1.new_zeros(array1.shape)
sum_double.forward(array1.contiguous(), array2.contiguous(), ans)
# ctx.mark_non_differentiable(ans) # if the function is no need for backpropogation
return ans
@staticmethod
def backward(ctx, g_out):
# return None, None # if the function is no need for backpropogation
g_in1 = g_out.clone()
g_in2 = g_out.clone()
return g_in1, g_in2
sum_double_op = SumDouble.apply
The line import sum_double
at the beginning of the file is the module name defined in setup.py
.
The custom type torch.autograd.Function
must implement forward
and backward
functions and declare them as static member functions.
forward
The first half of the forward propagation is to normally pass the Tensor into the interface. If the incoming vector was indexed in previous code, it is likely to be non-contiguous, so it is recommended to make it contiguous when passing to the operator.
If the operator does not need to consider backward propagation, you can use ctx.mark_non_differentiable(ans)
to mark the output of the function as not needing differentiation [6].
backward
backward
input corresponds to forward
output, and output corresponds to forward
input. For example, here backward
input g_out
corresponds to forward
output ans
, and backward
output g_in1, g_in2
corresponds to forward
input array1, array2
.
If the operator does not need to consider backward propagation, just return None, None
. Otherwise, compute according to the gradients of the corresponding input variables.
It is worth noting that if backward propagation requires information from forward
, you can use ctx
to record and access it. For example, to sum an array, the gradient of backward propagation would be a vector of the length of the original array. You can record the length of the input array in forward
using ctx.shape=array.shape[0]
and read it in backward
using n=ctx.shape
.
If storing Tensors, it is recommended to use save_for_backward(x, y, z, ...)
to store vectors and use x, y, z, ...=ctx.saved_tensors
to retrieve them, rather than directly using ctx[7]
.
To prevent incorrect gradients and memory leaks, and enable the application of saved tensor hooks.
Note: save_for_backward() can only store vectors; for scalars, use ctx to store directly.
Finally, use sum_double_op = SumDouble.apply
to obtain the final function form.
_init_.py
To call the wrapped PyTorch function externally, declare it through ops/ops_py/__init__.py
from .sum import sum_single_op, sum_double_op
__all__ = ['sum_single_op', 'sum_double_op']
In ops/__init__.py
from .ops_py import *
Build & Test
Ensure that the PyTorch environment is installed, and run pip install -e .
in the root directory of the demo.
Test the results using python test_ops.py
; if there are no issues, the output should be:
Average time cost of sum_single is 2.8257 ms
Average time cost of sum_double is 0.1128 ms
If it fails to compile, it may be because nvcc
has not been added to the environment variable. Check if there is a CUDA folder by running ls /usr/local/
. For example, if I have a folder called cuda-11.6
, then enter ~/.bashrc
and add the following lines at the end:
export PATH=$PATH:/usr/local/cuda-11.6/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.6/lib64
If the CUDA environment is not configured, you can refer to my previous article:
https://zhuanlan.zhihu.com/p/558605762
Other related reference materials can be found at [8][9][10].
References
- CudaDemo code repository https://github.com/Yuppie898988/CudaDemo
- PyTorch C++ API https://pytorch.org/cppdocs/api/library_root.html
- pybind11 https://pybind11.readthedocs.io/en/latest/basics.html
- setuptools https://setuptools.pypa.io/en/latest/userguide/quickstart.html#
- autograd https://pytorch.org/docs/stable/autograd.html
- mark_non_differentiable https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.mark_non_differentiable.html#torch.autograd.function.FunctionCtx.mark_non_differentiable
- save_for_backward https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.save_for_backward.html#torch.autograd.function.FunctionCtx.save_for_backward
- Detailed explanation of three ways to compile and call custom CUDA operators in PyTorch https://zhuanlan.zhihu.com/p/358778742
- Official PyTorch tutorial https://pytorch.org/tutorials/advanced/cpp_extension.html
- THE C++ FRONTEND https://pytorch.org/cppdocs/frontend.html