GCNet: Integrating Non-Local and SENet Attention Mechanisms

Introduction: Previously, SENet and Non-Local Neural Network (NLNet) were introduced, both of which are effective attention modules. The author found that the attention maps in NLNet respond almost consistently at different positions, and after integrating with SENet, proposed the Global Context block for global context modeling, achieving better results than SENet and NLNet in mainstream benchmarks.

The title of the GCNet paper is: GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond, a proposed attention model from Tsinghua University, similar to SE block and Non Local block, which introduces the GC block. To overcome the drawback of excessive computation in NL block, a Simplified NL block was proposed. Due to its structural similarity to SE block, the GC block was improved based on it.

The SE block introduced in SENet uses global context for weight recalibration across different channels, adjusting channel dependencies. However, this method does not fully leverage global context information.

The goal of capturing long-range dependencies is to achieve global understanding of visual scenes, which is effective for many computer vision tasks such as image classification, video classification, object detection, and semantic segmentation. NLNet models long-range dependencies through the self-attention mechanism.

The author experimented with NLNet, selecting 6 images from the COCO dataset, visualizing Attention maps for different query points, and obtained the following results:

GCNet: Integrating Non-Local and SENet Attention Mechanisms

It can be seen that for different query points, the attention maps are almost consistent, indicating that NLNet learns a query-independent dependency. This suggests that although NLNet aims to perform specific global context calculations for each position, the visualization results and experimental data prove that global context is not position-dependent.

Based on these findings, the author aims to reduce unnecessary computation, lowering the computational burden while combining SENet’s design, proposing GCNet which integrates the advantages of both, utilizing NLNet’s global context modeling capability while being lightweight like SENet.

The author first addresses the issues with NLNet by proposing a Simplified NLNet, significantly reducing the computational load.

GCNet: Integrating Non-Local and SENet Attention Mechanisms

The Non-Local block in NLNet can be represented as:

The input feature map is defined as , with being the number of positions. and are the input and output of the NL block. is the position index, enumerating all possible positions. represents the relationship between position and, and is a normalization factor. $W_z and W_v are linear transformation matrices.

NLNet proposed four similarity computation models, which have roughly similar effects. The author improved upon Embedded Gaussian, which can be expressed as:

The simplified version of Simplified NLNet aims to compute a global attention, which can be expressed as:

Here, are all convolutions, with specific implementation details available in the above figure.

Although the simplified NLNet reduces computational load, its accuracy does not improve, leading the author to observe certain similarities between SENet and the current module, thus combining the SENet module to propose GCNet.

GCNet: Integrating Non-Local and SENet Attention Mechanisms

It can be seen that GCNet utilizes the mechanism from the Simplified NL block for context information modeling, effectively leveraging global context information, while borrowing from SE block in the Transform phase.

The GC block is positioned between every two stages in ResNet, below is the official implementation of the GC block (modified based on mmdetection):

Code Implementation:

import torch
from torch import nn

class ContextBlock(nn.Module):
    def __init__(self,inplanes,ratio,pooling_type='att',
                 fusion_types=('channel_add',  )):
        super(ContextBlock, self).__init__()
        valid_fusion_types = ['channel_add', 'channel_mul']

        assert pooling_type in ['avg', 'att']
        assert isinstance(fusion_types, (list, tuple))
        assert all([f in valid_fusion_types for f in fusion_types])
        assert len(fusion_types) > 0, 'at least one fusion should be used'

        self.inplanes = inplanes
        self.ratio = ratio
        self.planes = int(inplanes * ratio)
        self.pooling_type = pooling_type
        self.fusion_types = fusion_types

        if pooling_type == 'att':
            self.conv_mask = nn.Conv2d(inplanes, 1, kernel_size=1)
            self.softmax = nn.Softmax(dim=2)
        else:
            self.avg_pool = nn.AdaptiveAvgPool2d(1)
        if 'channel_add' in fusion_types:
            self.channel_add_conv = nn.Sequential(
                nn.Conv2d(self.inplanes, self.planes, kernel_size=1),
                nn.LayerNorm([self.planes, 1, 1]),
                nn.ReLU(inplace=True),  # yapf: disable
                nn.Conv2d(self.planes, self.inplanes, kernel_size=1))
        else:
            self.channel_add_conv = None
        if 'channel_mul' in fusion_types:
            self.channel_mul_conv = nn.Sequential(
                nn.Conv2d(self.inplanes, self.planes, kernel_size=1),
                nn.LayerNorm([self.planes, 1, 1]),
                nn.ReLU(inplace=True),  # yapf: disable
                nn.Conv2d(self.planes, self.inplanes, kernel_size=1))
        else:
            self.channel_mul_conv = None


    def spatial_pool(self, x):
        batch, channel, height, width = x.size()
        if self.pooling_type == 'att':
            input_x = x
            # [N, C, H * W]
            input_x = input_x.view(batch, channel, height * width)
            # [N, 1, C, H * W]
            input_x = input_x.unsqueeze(1)
            # [N, 1, H, W]
            context_mask = self.conv_mask(x)
            # [N, 1, H * W]
            context_mask = context_mask.view(batch, 1, height * width)
            # [N, 1, H * W]
            context_mask = self.softmax(context_mask)
            # [N, 1, H * W, 1]
            context_mask = context_mask.unsqueeze(-1)
            # [N, 1, C, 1]
            context = torch.matmul(input_x, context_mask)
            # [N, C, 1, 1]
            context = context.view(batch, channel, 1, 1)
        else:
            # [N, C, 1, 1]
            context = self.avg_pool(x)
        return context

    def forward(self, x):
        # [N, C, 1, 1]
        context = self.spatial_pool(x)
        out = x
        if self.channel_mul_conv is not None:
            # [N, C, 1, 1]
            channel_mul_term = torch.sigmoid(self.channel_mul_conv(context))
            out = out * channel_mul_term
        if self.channel_add_conv is not None:
            # [N, C, 1, 1]
            channel_add_term = self.channel_add_conv(context)
            out = out + channel_add_term
        return out

if __name__ == "__main__":
    in_tensor = torch.ones((12, 64, 128, 128))

    cb = ContextBlock(inplanes=64, ratio=1./16.,pooling_type='att')
    
    out_tensor = cb(in_tensor)

    print(in_tensor.shape)
    print(out_tensor.shape)

The module was tested, and it should be noted that if ratio × inplanes < 1, it will cause issues related to the number of channels, which cannot be less than 1.

Experimental Section

The author modified based on mmdetection, adding the GC block, and the following are the ablation experiments.

GCNet: Integrating Non-Local and SENet Attention Mechanisms
  • From the perspective of block design, it can be seen that Simplified NL is almost identical to NL, but with a smaller number of parameters. Using GC block at every stage can improve performance by 2-3% compared to the baseline.
  • From the perspective of added position, the best effect was achieved by adding it after the add operation in the residual block.
  • From the perspective of different stages, applying it in three stages yields the best results, outperforming the baseline by 1-3%.
GCNet: Integrating Non-Local and SENet Attention Mechanisms
  • In terms of bottleneck design, tests using scaling, ReLU, LayerNorm, etc., found that using the simplified version of NLNet with 1×1 convolution as transform yielded the best results, but its computational load is too high.

  • For scaling factor design: the best effect was found at ratio=1/4.

  • For pooling and feature fusion design: experiments combining average pooling and attention pooling with add and scale methods showed that attention pooling + add yielded the best results.

GCNet: Integrating Non-Local and SENet Attention Mechanisms

Experiments were conducted on the ImageNet dataset, achieving an improvement of about 1%.

GCNet: Integrating Non-Local and SENet Attention Mechanisms

In the action recognition dataset Kinetics, a similar improvement of about 1% was also achieved.

Conclusion: GCNet combines the advantages of SENet and Non Local, optimizing global context modeling capabilities while maintaining relatively low computational load. Detailed ablation experiments demonstrated its effectiveness in visual tasks such as object detection, image classification, and action recognition. This paper is worth reading multiple times.

References:

Paper link: https://arxiv.org/abs/1904.11492

Official implementation code: https://github.com/xvjiarui/GCNet

Core code in the article: https://github.com/pprp/SimpleCVReproduction/tree/master/attention/GCBlock

Recommended Reading

Attention Mechanisms in CV – The Simplest and Easiest SE Module

Attention Mechanisms in CV – Selective-Kernel-Networks – An Evolution of SE

Attention Mechanisms in CV – CBAM Module

Attention Mechanisms in CV – Parallel Version of CBAM – BAM Module

Attention Mechanisms in CV – scSE Module in Semantic Segmentation

Follow the GiantPandaCV public account for exclusive deep learning content sharing, with new knowledge every day.

If you want to join the discussion group or have discussions, feel free to contact me, you can scan the QR code to add me on WeChat:

GCNet: Integrating Non-Local and SENet Attention Mechanisms

Leave a Comment