Involution: A Powerful New Operator for Neural Networks

Machine Heart Release

Author: Li Duo

This work was mainly completed by me and Hu Jie, the author of SENet. I would also like to thank my two mentors at HKUST, Chen Qifeng and Zhang Tong, for their discussions and suggestions.

This article introduces our paper accepted at CVPR 2021, Involution: Inverting the Inherence of Convolution for Visual Recognition, and shares some of our understanding of network structure design (CNN and Transformer).

Involution: A Powerful New Operator for Neural Networks

Summary

Our contributions can be summarized as follows:

(1) We propose a new neural network operator (operator or op) called involution, which is lighter and more efficient than convolution, and is more concise than self-attention in form, achieving both accuracy and efficiency improvements in various visual task models.

(2) Through the structural design of involution, we are able to understand the classical convolution operation and the recently popular self-attention operation from a unified perspective.

Paper link: https://arxiv.org/abs/2103.06255
Code and model link: https://github.com/d-li14/involution

[Motivation] The Antisymmetry with Convolution

This part mainly comes from the original text Section 2, Section 3

The kernel of ordinary convolution enjoys two basic characteristics: spatial invariance (spatial-agnostic) and channel specificity (channel-specific); whereas involution is exactly the opposite, exhibiting channel invariance (channel-agnostic) and spatial specificity (spatial-specific).

Convolution

The size of the convolution kernel is expressed as Involution: A Powerful New Operator for Neural Networks

, where

and

are the number of output and input channels, respectively, and Involution: A Powerful New Operator for Neural Networks

represents the kernel size, Involution: A Powerful New Operator for Neural Networks

is generally not written, indicating that the same kernel is shared across Involution: A Powerful New Operator for Neural Networks

pixels, i.e., spatial invariance, while each channel C has its own corresponding kernel, which is called channel specificity. The operation of convolution can be expressed as:

where

is the input tensor, Involution: A Powerful New Operator for Neural Networks

is the output tensor, and Involution: A Powerful New Operator for Neural Networks

is the convolution kernel.

Next, we will look at the two main characteristics of convolution:

Spatial Invariance

On one hand, the advantages brought by spatial invariance include: 1. Parameter sharing, otherwise the number of parameters would increase to Involution: A Powerful New Operator for Neural Networks

, 2. Translation equivariance, which can also be understood as producing similar responses to similar patterns in space. Its shortcomings are also very evident: the features extracted are relatively single and cannot flexibly adjust the parameters of the convolution kernel based on different inputs.

On the other hand, because both the number of parameters and the computational load of the convolution operation contain Involution: A Powerful New Operator for Neural Networks

, where the number of channels C is often hundreds or even thousands, to limit the scale of parameters and computational load, K is often set to a small value. We have become accustomed to using Involution: A Powerful New Operator for Neural Networks

sized kernels since VGGNet, which limits the convolution operation’s ability to capture long-range relationships at once and relies on stacking multiple Involution: A Powerful New Operator for Neural Networks

sized kernels, which is less effective for modeling the receptive field compared to directly using larger convolution kernels.

Channel Specificity

Some previous studies on low-rank approximations suggest that there is redundancy in the channel dimension of convolution kernels, so the size of convolution kernels in the channel dimension could potentially be reduced without significantly affecting expressive capacity. Intuitively, we can lay out the convolution kernels corresponding to each output channel into a Involution: A Powerful New Operator for Neural Networks

sized matrix, where the rank of the matrix does not exceed Involution: A Powerful New Operator for Neural Networks

, indicating that many kernels are approximately linearly dependent.

Involution

Based on the above analysis, we propose involution, which is designed to be the opposite of convolution’s characteristics, i.e., sharing kernels in the channel dimension while employing spatial-specific kernels for more flexible modeling in the spatial dimension. The size of the involution kernel is Involution: A Powerful New Operator for Neural Networks

, where

indicates that all channels share G kernels. The operation of involution is expressed as:

where

is the involution kernel.

In involution, we do not use a fixed weight matrix as learnable parameters like convolution, but consider generating the corresponding involution kernel based on the input feature map, ensuring that the kernel size and input feature size can automatically align in the spatial dimension. Otherwise, for instance, if we train weights with a fixed Involution: A Powerful New Operator for Neural Networks

sized image as input, the weights trained cannot be transferred to downstream tasks with larger input image sizes (such as detection, segmentation, etc.). The general form of the involution kernel generation is as follows:

where

is an index set of the neighborhood at coordinate (i,j), thus Involution: A Powerful New Operator for Neural Networks

represents a patch containing Involution: A Powerful New Operator for Neural Networks

on the feature map.

Regarding the above kernel generation function Involution: A Powerful New Operator for Neural Networks

, various design approaches can be explored further. Starting from a simple and effective design concept, we provide a bottleneck structure similar to SENet for experimentation: Involution: A Powerful New Operator for Neural Networks

is taken as Involution: A Powerful New Operator for Neural Networks

this single point set, i.e., Involution: A Powerful New Operator for Neural Networks

is taken as a single pixel at coordinate (i, j) on the feature map, thus obtaining one instantiation of the involution kernel generation.

where

and

are linear transformation matrices, r is the channel reduction ratio, and Involution: A Powerful New Operator for Neural Networks

is the intermediate BN and ReLU.

Note that designing different kernel generation functions can yield different instantiations of involution:

For example, exploring more sophisticated designs to further uncover the potential of involution, and by adopting specific instantiation methods, it can also be specialized into the form of self-attention (see the next section).

Under the above simple instantiation of the involution kernel, we obtain a complete schematic of involution:

For the feature vector at a coordinate point on the input feature map, it is first transformed through Involution: A Powerful New Operator for Neural Networks

(FC-BN-ReLU-FC) and reshape (channel-to-space) to expand into the shape of the kernel, thus obtaining the corresponding involution kernel at this coordinate point, then performing Multiply-Add with the feature vector of the neighborhood of this coordinate point on the input feature map to obtain the final output feature map. The specific operation process and tensor shape changes are as follows:

where

is the neighborhood around the coordinate (i,j). A simple pseudo-code implementation based on PyTorch API is as follows:

The number of parameters in the involution operation Involution: A Powerful New Operator for Neural Networks

, the computational load is divided into kernel generation and Multiply-Add (MAdd) two parts Involution: A Powerful New Operator for Neural Networks

, which is significantly lower than the number of parameters Involution: A Powerful New Operator for Neural Networks

and computational load Involution: A Powerful New Operator for Neural Networks

Conversely, considering the comparison with the existing advantages of convolution:

Sharing kernels across channels (only G kernels) allows us to use larger spatial spans (increasing K), thus enhancing performance through spatial dimension design while maintaining efficiency through channel dimension design (see ablation in Tab. 6a, 6b). Even if weights are not shared across different spatial positions, it will not lead to a significant increase in the number of parameters and computational load.
Although we do not directly share kernel parameters at every pixel in space, we share meta-parameters (meta-weights, i.e., the parameters of the kernel generation function) at a higher level, so we can still share and transfer knowledge across different spatial positions. In contrast, even if we completely relax the restrictions on sharing kernel parameters in space for convolution, allowing each pixel to freely learn its corresponding kernel parameters, such an effect would not be achievable.

In summary, this design from convolution to involution is actually a reallocation of computational power at a micro-granularity (op level), where the essence of network design is the allocation of computational power, aimed at adjusting limited computational power to the positions where performance can be maximized. For example, NAS is an optimal configuration of computational power at a macro-granularity (network level) through automatic search.

[Discussion] Correlation with Self-Attention

This part mainly comes from the original text Section 4.2

Self-Attention

We know that self-attention can be expressed as (to simplify the expression, the position encoding part is omitted)

where

are the input Involution: A Powerful New Operator for Neural Networks

after linear transformation to obtain query, key, and value, H is the number of heads in multi-head self-attention. The subscript indicates the query-key matching between pixels (i, j) and (p, q), Involution: A Powerful New Operator for Neural Networks

indicates the range of query(i,j) corresponding to the key, which may be Involution: A Powerful New Operator for Neural Networks

a local patch (local self-attention), or Involution: A Powerful New Operator for Neural Networks

the full image (global self-attention).

We further expand self-attention as follows:

By comparing the expression of involution, we can easily find similarities:

Different heads in self-attention correspond to different groups in involution (split in the channel dimension)
The attention map of each pixel in self-attention corresponds to each pixel’s kernel in involution

If the kernel generation function of involution is instantiated as

, then self-attention is also a certain instantiation of involution, thus we find that involution is a more general expression form.

Moreover,

corresponds to the linear transformation before the attention matrix multiplication for Involution: A Powerful New Operator for Neural Networks

, self-attention operations are generally followed by another linear transformation and residual connection, this structure corresponds exactly to our use of involution to replace the convolution in the ResNet bottleneck structure, where there are also two Involution: A Powerful New Operator for Neural Networks

convolutions for linear transformations before and after.

Here, let’s also talk about the position encoding issue:

Since self-attention is permutation-invariant, it requires position encoding to distinguish positional information, while in the involution unit instantiated in this paper, each element in the generated involution kernel is inherently ordered by position, so no additional positional information is required.

Moreover, some works based on pure self-attention to construct backbones (such as stand-alone self-attention, lambda networks) have noted that using only position encoding rather than query-key relations can achieve considerable performance, i.e., the attention map uses Involution: A Powerful New Operator for Neural Networks

instead of Involution: A Powerful New Operator for Neural Networks

(

is the position encoding matrix). From our involution perspective, this is merely another form of kernel generation to instantiate involution.

Therefore, we rethink that the effective essence of self-attention in backbone network structures may be to capture long-range and self-adaptive interactions, simply put, using a large and dynamic kernel, and it is not necessary to construct this kernel using query-key relations. On the other hand, since our involution kernel is generated from a single pixel, this kernel is not well-suited for scaling to apply across the entire image, but applying it within a relatively large neighborhood is still feasible (e.g., Involution: A Powerful New Operator for Neural Networks

), which also highlights that locality in CNN design remains a treasure, because even with global self-attention, the shallow layers of the network struggle to truly utilize complex global information.

Thus, the involution we adopt removes many complexities present in self-attention, such as generating the involution kernel using only the feature vector of a single pixel (rather than relying on pixel-to-pixel correspondence to generate the attention map), implicitly encoding the positional information of pixels during kernel generation (discarding explicit position encoding), thereby constructing a very clean and efficient op.

Vision Transformer

Since we have discussed self-attention, we must mention the recent popular ViT-related works. Regarding backbone structures, we believe that pure self-attention or involution models outperform CNNs, which outperform ViTs (transformers as decoders may have other merits, such as in DETR, which will not be discussed here).

Many have discussed ViTs; the lower-level linear projection is actually similar to convolution, while the higher level uses global self-attention to extract relations, which can be abstracted as a hybrid model of convolution and self-attention. The proposal of such a hybrid structure is reasonable, extracting low-level information with convolution at lower levels and modeling higher-order semantic relationships using self-attention at higher levels.

However, the convolution part at the lower level of ViT is insufficient, constrained by the explosive computational load of self-attention. The input image is directly divided into 16×16 patches, and the input feature size is roughly equivalent to the feature size at the second-to-last stage of deep ResNet, making the network’s lower layers poorly utilize more detailed image information, and there is no transformation for feature size reduction in the intermediate processing stage.

Therefore, the structural design of ViT proposed at ICLR’21 inherently has very unscientific aspects, and recent works improving ViT can generally be summarized as incorporating more spatially refined self-attention operations into ViT (localizing or further subdividing patches, introducing convolution characteristics), in some sense making ViT more like a pure self-attention/involution-based model.

Therefore, whether it is convolution, self-attention, or the new involution, they are all combinations of message passing and feature aggregation. Despite their outward differences, there is no need to view them in isolation.

Experimental Results

In general:

(1) The reduction in parameters and computational load leads to performance improvement.

(2) Involution can be applied in various positions of different models to replace convolution, generally speaking, the more parts replaced, the higher the cost-performance ratio of the model.

ImageNet Image Classification

We replaced the Involution: A Powerful New Operator for Neural Networks

convolution in the ResNet bottleneck block with involution to obtain a new family of backbone networks called RedNet, which outperforms ResNet and other SOTA models using self-attention as ops in terms of performance and efficiency.

COCO Object Detection and Instance Segmentation

Fully involution-based detectors (where involution is used in the backbone, neck, and head) can reduce computational complexity by Involution: A Powerful New Operator for Neural Networks

40% while maintaining similar or improved performance.

Cityscapes Semantic Segmentation

It is worth mentioning that in the COCO detection task, the metrics for large objects Involution: A Powerful New Operator for Neural Networks

improved the most (3%-4%), and in the Cityscapes segmentation task, the single-class IoU for large objects (such as walls, trucks, buses, etc.) also saw significant improvements (up to 10% or even more than 20%), which also verifies the significant advantage of involution over convolution in dynamically modeling long-range relationships in space.

Finally, this work leaves some open questions for further exploration:

Further exploration of the kernel generation function space in generalized involution;
Like deformable convolution, adding offset generation functions to enhance the flexibility of this op space modeling capability;
Combining NAS technology to search for convolution-involution hybrid structures (original text Section 4.3);
We discussed that self-attention is just one expression form, but hope that (self-)attention mechanisms can inspire us to design better visual models, similarly, many good works in the detection field have also benefited greatly from the DETR architecture.

We hope that 2021 will see more essential and diverse developments in backbone network structure design!

Author Biography

Li Duo, a second-year graduate student in the Department of Computer Science at the Hong Kong University of Science and Technology, graduated from the Department of Automation at Tsinghua University, has published 10 papers at top international computer vision conferences such as ICCV, CVPR, and ECCV, interned at Intel, NVIDIA, SenseTime, and ByteDance, and won the 2020 CCF-CV Academic Rising Star Award. Homepage: https://duoli.org

Chinese NLP: 2021 Haihua AI Challenge is Open for Registration!

The “2021 Haihua AI Challenge: Chinese Reading Comprehension” jointly organized by the Haihua Research Institute and Tsinghua University’s Institute of Cross-Information Research is ongoing. In the previous finals defense ceremony, Academician Yao Qizhi delivered a heartfelt speech, and all award-winning participants received certificates signed by Mr. Yao.

This year, the competition retains two tracks: the middle school group and the technical group, with a total prize pool of 300,000 yuan. Poetry AI, Smart Chinese, click to read the original text or scan the code to register now!

For reprinting, please contact this public account for authorization

Submissions or inquiries: [email protected]

Summary

Convolution

Involution

[Discussion] Correlation with Self-Attention

Self-Attention

Vision Transformer

Experimental Results

ImageNet Image Classification

COCO Object Detection and Instance Segmentation

Cityscapes Semantic Segmentation

Leave a Comment Cancel reply