The project authors stated that in the future, they will improve the inference engine Kernl in several aspects, including preheating speed, training support, multi-GPU support, quantization, and hardware support.
How powerful can one line of code be? Today we introduce the Kernl library, where users can run Pytorch transformer models on a GPU with several times the speed using just one line of code, greatly accelerating the inference speed of the model.
Specifically, with the support of Kernl, the inference speed of Bert is 12 times faster than the Hugging Face baseline. This achievement is mainly due to Kernl writing custom GPU kernels using the new OpenAI programming language Triton and TorchDynamo. The project authors are from Lefebvre Sarrut.
GitHub address: https://github.com/ELS-RD/kernl/
Below is a comparison of Kernl with other inference engines, where the numbers in brackets on the x-axis represent batch size and sequence length, and the y-axis represents inference acceleration.
The benchmark test was run on a 3090 RTX GPU and a 12-core Intel CPU.
From the above results, it can be seen that for long sequence inputs, Kernl can be considered the fastest inference engine (the right half of the above image), and it is close to NVIDIA’s TensorRT for short input sequences (the left half of the above image). Additionally, the Kernl kernel code is very short, easy to understand, and modify. The project even added a Triton debugger and tools (based on Fx) to simplify kernel replacement, so there is no need to modify the PyTorch model source code.
The project author Michaël Benesty summarized this research, stating that the Kernl they released is a library for accelerating transformer inference, which is very fast and sometimes reaches SOTA performance, capable of matching most transformer architectures.
They also tested on T5, achieving a 6x speedup. Benesty stated that this is just the beginning.
1
Why Create Kernl?
1
Why Create Kernl?
At Lefebvre Sarrut, the project authors run several transformer models in production, some of which are latency-sensitive, mainly for search and recommendation systems. They also use OnnxRuntime and TensorRT and even created the transformer-deploy OSS library to share knowledge with the community.
Recently, the authors have been testing generative language models and trying to accelerate them. However, it turned out to be very difficult to achieve this using traditional tools. In their view, Onnx is another interesting format, an open file format designed for machine learning to store trained models, with broad hardware support.
However, when dealing with new LLM architectures, the Onnx ecosystem (mainly inference engines) has the following limitations:
-
Exporting models with no control flow to Onnx is straightforward, as it can rely on tracing. However, dynamic behavior is harder to achieve;
-
Unlike PyTorch, ONNX Runtime/TensorRT does not yet have native support for tensor-parallel multi-GPU tasks;
-
TensorRT cannot manage two dynamic axes for transformer models with the same profile. However, since it is usually desired to provide inputs of different lengths, one model has to be built for each batch size;
-
Very large models are common, but Onnx (as a protobuff file) has some limitations in file size, requiring weights to be stored outside the model to solve the problem.
A very frustrating fact is that new models are never accelerated; you have to wait for someone else to write custom CUDA kernels for that. Existing solutions are not bad; one of the main advantages of OnnxRuntime is its multi-hardware support, while TensorRT is known for being very fast.
So, the project authors wanted an optimizer as fast as TensorRT on Python/PyTorch, which is why they created Kernl.
2
How to Achieve This?
2
How to Achieve This?
Memory bandwidth is often a bottleneck in deep learning, and to accelerate inference, reducing memory access is often a good strategy. For short input sequences, the bottleneck is usually related to CPU overhead, which must be eliminated. The project authors mainly utilized the following three techniques:
The first is OpenAI Triton, a language for writing GPU kernels like CUDA, not to be confused with NVIDIA Triton inference server, which is more efficient. The fusion of several operations has been implemented to improve performance, allowing them to link computations without retaining intermediate results in GPU memory. The authors used it to rewrite attention (replaced by Flash Attention), linear layers, activations, and Layernorm/Rmsnorm.
The second is CUDA graphs. In the warmup step, it saves each launched kernel and their parameters. Then, the project authors reconstructed the entire inference process.
The last is TorchDynamo, a prototype introduced by Meta that helps project authors deal with dynamic behavior. In the warmup step, it tracks the model and provides an Fx graph (static computation graph). They replaced some operations in the Fx graph with their own kernels and recompiled in Python.
In the future, the project roadmap will cover faster warmup, ragged inference (no loss of computation in padding), training support (long sequence support), multi-GPU support (multiple parallelization modes), quantization (PTQ), new batch Cutlass kernel testing, and enhanced hardware support.
For more details, please refer to the original project.
Scan the QR code to add the assistant on WeChat