This work was mainly completed by me and Hu Jie, the author of SENet. I would also like to thank my two mentors at HKUST, Chen Qifeng and Zhang Tong, for their discussions and suggestions.
This article introduces our paper accepted at CVPR 2021, Involution: Inverting the Inherence of Convolution for Visual Recognition, and shares some of our understanding of network structure design (CNN and Transformer).
Summary
Our contributions can be summarized as follows:
(1) We propose a new neural network operator (operator or op) called involution, which is lighter and more efficient than convolution, and is more concise than self-attention in form, achieving both accuracy and efficiency improvements in various visual task models.
(2) Through the structural design of involution, we are able to understand the classical convolution operation and the recently popular self-attention operation from a unified perspective.
Paper link: https://arxiv.org/abs/2103.06255
Code and model link: https://github.com/d-li14/involution
[Motivation] The Antisymmetry with Convolution
This part mainly comes from the original text Section 2, Section 3
The kernel of ordinary convolution enjoys two basic characteristics: spatial invariance (spatial-agnostic) and channel specificity (channel-specific); whereas involution is exactly the opposite, exhibiting channel invariance (channel-agnostic) and spatial specificity (spatial-specific).
Convolution
The size of the convolution kernel is expressed as , where and are the number of output and input channels, respectively, and represents the kernel size, is generally not written, indicating that the same kernel is shared across pixels, i.e., spatial invariance, while each channel C has its own corresponding kernel, which is called channel specificity. The operation of convolution can be expressed as:where is the input tensor, is the output tensor, and is the convolution kernel.Next, we will look at the two main characteristics of convolution:
Spatial Invariance
On one hand, the advantages brought by spatial invariance include: 1. Parameter sharing, otherwise the number of parameters would increase to , 2. Translation equivariance, which can also be understood as producing similar responses to similar patterns in space. Its shortcomings are also very evident: the features extracted are relatively single and cannot flexibly adjust the parameters of the convolution kernel based on different inputs.On the other hand, because both the number of parameters and the computational load of the convolution operation contain , where the number of channels C is often hundreds or even thousands, to limit the scale of parameters and computational load, K is often set to a small value. We have become accustomed to using sized kernels since VGGNet, which limits the convolution operation’s ability to capture long-range relationships at once and relies on stacking multiple sized kernels, which is less effective for modeling the receptive field compared to directly using larger convolution kernels.
Channel Specificity
Some previous studies on low-rank approximations suggest that there is redundancy in the channel dimension of convolution kernels, so the size of convolution kernels in the channel dimension could potentially be reduced without significantly affecting expressive capacity. Intuitively, we can lay out the convolution kernels corresponding to each output channel into a sized matrix, where the rank of the matrix does not exceed , indicating that many kernels are approximately linearly dependent.
Involution
Based on the above analysis, we propose involution, which is designed to be the opposite of convolution’s characteristics, i.e., sharing kernels in the channel dimension while employing spatial-specific kernels for more flexible modeling in the spatial dimension. The size of the involution kernel is , where indicates that all channels share G kernels. The operation of involution is expressed as:where is the involution kernel.In involution, we do not use a fixed weight matrix as learnable parameters like convolution, but consider generating the corresponding involution kernel based on the input feature map, ensuring that the kernel size and input feature size can automatically align in the spatial dimension. Otherwise, for instance, if we train weights with a fixed sized image as input, the weights trained cannot be transferred to downstream tasks with larger input image sizes (such as detection, segmentation, etc.). The general form of the involution kernel generation is as follows:where is an index set of the neighborhood at coordinate (i,j), thus represents a patch containing on the feature map.Regarding the above kernel generation function , various design approaches can be explored further. Starting from a simple and effective design concept, we provide a bottleneck structure similar to SENet for experimentation: is taken as this single point set, i.e., is taken as a single pixel at coordinate (i, j) on the feature map, thus obtaining one instantiation of the involution kernel generation.where and are linear transformation matrices, r is the channel reduction ratio, and is the intermediate BN and ReLU.Note that designing different kernel generation functions can yield different instantiations of involution:For example, exploring more sophisticated designs to further uncover the potential of involution, and by adopting specific instantiation methods, it can also be specialized into the form of self-attention (see the next section).Under the above simple instantiation of the involution kernel, we obtain a complete schematic of involution:For the feature vector at a coordinate point on the input feature map, it is first transformed through (FC-BN-ReLU-FC) and reshape (channel-to-space) to expand into the shape of the kernel, thus obtaining the corresponding involution kernel at this coordinate point, then performing Multiply-Add with the feature vector of the neighborhood of this coordinate point on the input feature map to obtain the final output feature map. The specific operation process and tensor shape changes are as follows:where is the neighborhood around the coordinate (i,j). A simple pseudo-code implementation based on PyTorch API is as follows:The number of parameters in the involution operation , the computational load is divided into kernel generation and Multiply-Add (MAdd) two parts , which is significantly lower than the number of parameters and computational load .Conversely, considering the comparison with the existing advantages of convolution:
Sharing kernels across channels (only G kernels) allows us to use larger spatial spans (increasing K), thus enhancing performance through spatial dimension design while maintaining efficiency through channel dimension design (see ablation in Tab. 6a, 6b). Even if weights are not shared across different spatial positions, it will not lead to a significant increase in the number of parameters and computational load.
Although we do not directly share kernel parameters at every pixel in space, we share meta-parameters (meta-weights, i.e., the parameters of the kernel generation function) at a higher level, so we can still share and transfer knowledge across different spatial positions. In contrast, even if we completely relax the restrictions on sharing kernel parameters in space for convolution, allowing each pixel to freely learn its corresponding kernel parameters, such an effect would not be achievable.
In summary, this design from convolution to involution is actually a reallocation of computational power at a micro-granularity (op level), where the essence of network design is the allocation of computational power, aimed at adjusting limited computational power to the positions where performance can be maximized. For example, NAS is an optimal configuration of computational power at a macro-granularity (network level) through automatic search.
[Discussion] Correlation with Self-Attention
This part mainly comes from the original text Section 4.2
Self-Attention
We know that self-attention can be expressed as (to simplify the expression, the position encoding part is omitted),where are the input after linear transformation to obtain query, key, and value, H is the number of heads in multi-head self-attention. The subscript indicates the query-key matching between pixels (i, j) and (p, q), indicates the range of query(i,j) corresponding to the key, which may be a local patch (local self-attention), or the full image (global self-attention).We further expand self-attention as follows:By comparing the expression of involution, we can easily find similarities:
Different heads in self-attention correspond to different groups in involution (split in the channel dimension)
The attention map of each pixel in self-attention corresponds to each pixel’s kernel in involution
If the kernel generation function of involution is instantiated as, then self-attention is also a certain instantiation of involution, thus we find that involution is a more general expression form.Moreover, corresponds to the linear transformation before the attention matrix multiplication for , self-attention operations are generally followed by another linear transformation and residual connection, this structure corresponds exactly to our use of involution to replace the convolution in the ResNet bottleneck structure, where there are also two convolutions for linear transformations before and after.Here, let’s also talk about the position encoding issue:Since self-attention is permutation-invariant, it requires position encoding to distinguish positional information, while in the involution unit instantiated in this paper, each element in the generated involution kernel is inherently ordered by position, so no additional positional information is required.Moreover, some works based on pure self-attention to construct backbones (such as stand-alone self-attention, lambda networks) have noted that using only position encoding rather than query-key relations can achieve considerable performance, i.e., the attention map uses instead of ( is the position encoding matrix). From our involution perspective, this is merely another form of kernel generation to instantiate involution.Therefore, we rethink that the effective essence of self-attention in backbone network structures may be to capture long-range and self-adaptive interactions, simply put, using a large and dynamic kernel, and it is not necessary to construct this kernel using query-key relations. On the other hand, since our involution kernel is generated from a single pixel, this kernel is not well-suited for scaling to apply across the entire image, but applying it within a relatively large neighborhood is still feasible (e.g., , ), which also highlights that locality in CNN design remains a treasure, because even with global self-attention, the shallow layers of the network struggle to truly utilize complex global information.Thus, the involution we adopt removes many complexities present in self-attention, such as generating the involution kernel using only the feature vector of a single pixel (rather than relying on pixel-to-pixel correspondence to generate the attention map), implicitly encoding the positional information of pixels during kernel generation (discarding explicit position encoding), thereby constructing a very clean and efficient op.
Vision Transformer
Since we have discussed self-attention, we must mention the recent popular ViT-related works. Regarding backbone structures, we believe that pure self-attention or involution models outperform CNNs, which outperform ViTs (transformers as decoders may have other merits, such as in DETR, which will not be discussed here).Many have discussed ViTs; the lower-level linear projection is actually similar to convolution, while the higher level uses global self-attention to extract relations, which can be abstracted as a hybrid model of convolution and self-attention. The proposal of such a hybrid structure is reasonable, extracting low-level information with convolution at lower levels and modeling higher-order semantic relationships using self-attention at higher levels.However, the convolution part at the lower level of ViT is insufficient, constrained by the explosive computational load of self-attention. The input image is directly divided into 16×16 patches, and the input feature size is roughly equivalent to the feature size at the second-to-last stage of deep ResNet, making the network’s lower layers poorly utilize more detailed image information, and there is no transformation for feature size reduction in the intermediate processing stage.Therefore, the structural design of ViT proposed at ICLR’21 inherently has very unscientific aspects, and recent works improving ViT can generally be summarized as incorporating more spatially refined self-attention operations into ViT (localizing or further subdividing patches, introducing convolution characteristics), in some sense making ViT more like a pure self-attention/involution-based model.Therefore, whether it is convolution, self-attention, or the new involution, they are all combinations of message passing and feature aggregation. Despite their outward differences, there is no need to view them in isolation.
Experimental Results
In general:(1) The reduction in parameters and computational load leads to performance improvement.(2) Involution can be applied in various positions of different models to replace convolution, generally speaking, the more parts replaced, the higher the cost-performance ratio of the model.
ImageNet Image Classification
We replaced the convolution in the ResNet bottleneck block with involution to obtain a new family of backbone networks called RedNet, which outperforms ResNet and other SOTA models using self-attention as ops in terms of performance and efficiency.
COCO Object Detection and Instance Segmentation
Fully involution-based detectors (where involution is used in the backbone, neck, and head) can reduce computational complexity by 40% while maintaining similar or improved performance.
Cityscapes Semantic Segmentation
It is worth mentioning that in the COCO detection task, the metrics for large objects improved the most (3%-4%), and in the Cityscapes segmentation task, the single-class IoU for large objects (such as walls, trucks, buses, etc.) also saw significant improvements (up to 10% or even more than 20%), which also verifies the significant advantage of involution over convolution in dynamically modeling long-range relationships in space.Finally, this work leaves some open questions for further exploration:
Further exploration of the kernel generation function space in generalized involution;
Like deformable convolution, adding offset generation functions to enhance the flexibility of this op space modeling capability;
Combining NAS technology to search for convolution-involution hybrid structures (original text Section 4.3);
We discussed that self-attention is just one expression form, but hope that (self-)attention mechanisms can inspire us to design better visual models, similarly, many good works in the detection field have also benefited greatly from the DETR architecture.
We hope that 2021 will see more essential and diverse developments in backbone network structure design!Author BiographyLi Duo, a second-year graduate student in the Department of Computer Science at the Hong Kong University of Science and Technology, graduated from the Department of Automation at Tsinghua University, has published 10 papers at top international computer vision conferences such as ICCV, CVPR, and ECCV, interned at Intel, NVIDIA, SenseTime, and ByteDance, and won the 2020 CCF-CV Academic Rising Star Award. Homepage: https://duoli.org
Chinese NLP: 2021 Haihua AI Challenge is Open for Registration!
The “2021 Haihua AI Challenge: Chinese Reading Comprehension” jointly organized by the Haihua Research Institute and Tsinghua University’s Institute of Cross-Information Research is ongoing. In the previous finals defense ceremony, Academician Yao Qizhi delivered a heartfelt speech, and all award-winning participants received certificates signed by Mr. Yao.
This year, the competition retains two tracks: the middle school group and the technical group, with a total prize pool of 300,000 yuan. Poetry AI, Smart Chinese, click to read the original text or scan the code to register now!