Attention Mechanism in Computer Vision

Click on the above “Beginner’s Guide to Vision“, choose to add “Star” or “Pin“

Important content delivered first

This article is reproduced from Zhihu, with the author’s permission.

https://zhuanlan.zhihu.com/p/146130215

Previously, I was looking at the self-attention in the DETR paper, and combined with the attention mechanism often mentioned in the lab meetings, I spent time this week organizing the attention mechanism and reading relevant source code to understand its implementation mechanism.

Attention Mechanism

The attention mechanism can be regarded as a resource allocation mechanism, which can be understood as reallocating resources originally distributed evenly based on the importance of the attention object. Important units get a bit more, while unimportant or poor units get a bit less. In the design of deep neural network structures, the resources that attention needs to allocate are basically weights.

Visual attention is divided into several types, the core idea is to find the correlation between existing data and highlight certain important features. There are channel attention, pixel attention, multi-level attention, etc., and self-attention from NLP is also introduced.

Self-Attention

References: http://papers.nips.cc/paper/7181-attention-is-all-you-need

Reference Material: https://zhuanlan.zhihu.com/p/48508221

GitHub: https://github.com/huggingface/transformers

Self-attention, sometimes referred to as internal attention, is an attention mechanism related to different positions of a single sequence. Its purpose is to calculate the representation of the sequence, because of the decoder’s position invariance, and in DETR, each pixel not only contains numerical information, but also the positional information of each pixel is very important.

All encoders are structurally the same, but they do not share parameters. Each encoder can be decomposed into two sub-layers:

In the transformer, each encoder sub-layer consists of Multi-head Self-Attention and Position-wise FFN.

Each input word is transformed into a word vector through embedding, encoded through self-attention, and then sent to the FFN to obtain a hierarchical encoding.

The decoder is also structurally composed of multiple identical stacks, with a structure similar to the encoder that includes Multi-head Self-Attention and Position-wise FFN, and an additional attention layer to focus on relevant parts of the input sentence.

Self-Attention

Self-Attention is the core content of the Transformer, which can be understood as mapping a queue and a set of values corresponding to the input, forming a mapping from query, key, value to output. The output can be viewed as a weighted sum of the values, with the weights derived from Self-Attention.

The specific implementation details are as follows:

In self-attention, each word has three different vectors: the Query vector, the Key vector, and the Value vector, all of length 64. They are obtained by multiplying the embedding vector X by three different weight matrices, where the sizes of the three matrices are also the same, all being 512×64.

The computation process of Self_attention is as follows:

1) Convert the input word into an embedding vector;

2) Obtain the q, k, v vectors from the embedding vector;

3) Calculate a score for each vector: score=q×v;

4) For gradient stability, the Transformer uses score normalization, dividing by sqrt(dk);

5) Apply the softmax activation function to the score;

6) Multiply the softmax by the Value vector v to obtain the weighted score for each input vector v;

7) Sum to get the final output result z.

Matrix form of the computation process:

For Multi-head self-attention, it can be seen from the paper that it fuses single dot-product attention, combining both to derive the transformer.

Specific implementation can refer to the detr’s models/transformer

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""DETR Transformer class.Copy-paste from torch.nn.Transformer with modifications:
    * positional encodings are passed in MHattention
    * extra LN at the end of encoder is removed
    * decoder returns a stack of activations from all decoding layers"""
import copy
from typing import Optional, List
import torch
import torch.nn.functional as F
from torch import nn, Tensor

class Transformer(nn.Module):
    def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
                 num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False,
                 return_intermediate_dec=False):
        super().__init__()
        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        decoder_norm = nn.LayerNorm(d_model)
        self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
                                          return_intermediate=return_intermediate_dec)
        self._reset_parameters()
        self.d_model = d_model
        self.nhead = nhead
    def _reset_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
    def forward(self, src, mask, query_embed, pos_embed):
        # flatten NxCxHxW to HWxNxC
        bs, c, h, w = src.shape
        src = src.flatten(2).permute(2, 0, 1)
        pos_embed = pos_embed.flatten(2).permute(2, 0, 1)
        query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)
        mask = mask.flatten(1)
        tgt = torch.zeros_like(query_embed)
        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
        hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
                          pos=pos_embed, query_pos=query_embed)
        return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)

class TransformerEncoder(nn.Module):
    def __init__(self, encoder_layer, num_layers, norm=None):
        super().__init__()
        self.layers = _get_clones(encoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm
    def forward(self, src,
                mask: Optional[Tensor] = None,
                src_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None):
        output = src
        for layer in self.layers:
            output = layer(output, src_mask=mask,
                           src_key_padding_mask=src_key_padding_mask, pos=pos)
        if self.norm is not None:
            output = self.norm(output)
        return output

class TransformerDecoder(nn.Module):
    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
        super().__init__()
        self.layers = _get_clones(decoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm
        self.return_intermediate = return_intermediate
    def forward(self, tgt, memory,
                tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None,
                query_pos: Optional[Tensor] = None):
        output = tgt
        intermediate = []
        for layer in self.layers:
            output = layer(output, memory, tgt_mask=tgt_mask,
                           memory_mask=memory_mask,
                           tgt_key_padding_mask=tgt_key_padding_mask,
                           memory_key_padding_mask=memory_key_padding_mask,
                           pos=pos, query_pos=query_pos)
            if self.return_intermediate:
                intermediate.append(self.norm(output))
        if self.norm is not None:
            output = self.norm(output)
            if self.return_intermediate:
                intermediate.pop()
                intermediate.append(output)
        if self.return_intermediate:
            return torch.stack(intermediate)
        return output

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.activation = _get_activation_fn(activation)
        self.normalize_before = normalize_before
    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
        return tensor if pos is None else tensor + pos
    def forward_post(self,
                     src,
                     src_mask: Optional[Tensor] = None,
                     src_key_padding_mask: Optional[Tensor] = None,
                     pos: Optional[Tensor] = None):
        q = k = self.with_pos_embed(src, pos)
        src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,
                              key_padding_mask=src_key_padding_mask)[0]
        src = src + self.dropout1(src2)
        src = self.norm1(src)
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        return src
    def forward_pre(self, src,
                    src_mask: Optional[Tensor] = None,
                    src_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None):
        src2 = self.norm1(src)
        q = k = self.with_pos_embed(src2, pos)
        src2 = self.self_attn(q, k, value=src2, attn_mask=src_mask,
                              key_padding_mask=src_key_padding_mask)[0]
        src = src + self.dropout1(src2)
        src2 = self.norm2(src)
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))
        src = src + self.dropout2(src2)
        return src
    def forward(self, src,
                src_mask: Optional[Tensor] = None,
                src_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None):
        if self.normalize_before:
            return self.forward_pre(src, src_mask, src_key_padding_mask, pos)
        return self.forward_post(src, src_mask, src_key_padding_mask, pos)

class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
        self.activation = _get_activation_fn(activation)
        self.normalize_before = normalize_before
    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
        return tensor if pos is None else tensor + pos
    def forward_post(self, tgt, memory,
                     tgt_mask: Optional[Tensor] = None,
                     memory_mask: Optional[Tensor] = None,
                     tgt_key_padding_mask: Optional[Tensor] = None,
                     memory_key_padding_mask: Optional[Tensor] = None,
                     pos: Optional[Tensor] = None,
                     query_pos: Optional[Tensor] = None):
        q = k = self.with_pos_embed(tgt, query_pos)
        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt = self.norm1(tgt)
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt = self.norm2(tgt)
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
        tgt = tgt + self.dropout3(tgt2)
        tgt = self.norm3(tgt)
        return tgt
    def forward_pre(self, tgt, memory,
                    tgt_mask: Optional[Tensor] = None,
                    memory_mask: Optional[Tensor] = None,
                    tgt_key_padding_mask: Optional[Tensor] = None,
                    memory_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None,
                    query_pos: Optional[Tensor] = None):
        tgt2 = self.norm1(tgt)
        q = k = self.with_pos_embed(tgt2, query_pos)
        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt2 = self.norm2(tgt)
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt2 = self.norm3(tgt)
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
        tgt = tgt + self.dropout3(tgt2)
        return tgt
    def forward(self, tgt, memory,
                tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None,
                query_pos: Optional[Tensor] = None):
        if self.normalize_before:
            return self.forward_pre(tgt, memory, tgt_mask, memory_mask,
                                    tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
        return self.forward_post(tgt, memory, tgt_mask, memory_mask,
                                 tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)

def _get_clones(module, N):
    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])

def build_transformer(args):
    return Transformer(
        d_model=args.hidden_dim,
        dropout=args.dropout,
        nhead=args.nheads,
        dim_feedforward=args.dim_feedforward,
        num_encoder_layers=args.enc_layers,
        num_decoder_layers=args.dec_layers,
        normalize_before=args.pre_norm,
        return_intermediate_dec=True,
    )

def _get_activation_fn(activation):
    """Return an activation function given a string"""
    if activation == "relu":
        return F.relu
    if activation == "gelu":
        return F.gelu
    if activation == "glu":
        return F.glu
    raise RuntimeError(F"activation should be relu/gelu, not {activation}.")

Soft Attention

Soft attention is a continuous distribution problem between [0,1], focusing more on areas or channels. Soft attention is deterministic attention, which can be generated by the network after learning, and is differentiable, allowing gradients to be computed through neural networks and to learn the attention weights through forward and backward propagation.

1. Spatial Domain Attention (Spatial Transformer Network)

Paper link: http://papers.nips.cc/paper/5854-spatial-transformer-networks

GitHub link: https://github.com/fxia22/stn.pytorch

Spatial region attention can be understood as allowing the neural network to see where it is looking. Through the attention mechanism, the spatial information in the original image is transformed into another space while retaining key information. Many existing methods use this type of network; one I have encountered is ALPHA Pose.

Spatial transformer is actually the implementation of the attention mechanism, as the trained spatial transformer can identify the regions in the image that need attention, and this transformer can also have the function of rotation and scaling transformation, allowing the extraction of important local information from the image through transformation.

The main focus is on learning the spatial transformation matrix.

class STN(Module):
    def __init__(self, layout = 'BHWD'):
        super(STN, self).__init__()
        if layout == 'BHWD':
            self.f = STNFunction()
        else:
            self.f = STNFunctionBCHW()
    def forward(self, input1, input2):
        return self.f(input1, input2)

class STNFunction(Function):
    def forward(self, input1, input2):
        self.input1 = input1
        self.input2 = input2
        self.device_c = ffi.new("int *")
        output = torch.zeros(input1.size()[0], input2.size()[1], input2.size()[2], input1.size()[3])
        #print('device %d' % torch.cuda.current_device())
        if input1.is_cuda:
            self.device = torch.cuda.current_device()
        else:
            self.device = -1
        self.device_c[0] = self.device
        if not input1.is_cuda:
            my_lib.BilinearSamplerBHWD_updateOutput(input1, input2, output)
        else:
            output = output.cuda(self.device)
            my_lib.BilinearSamplerBHWD_updateOutput_cuda(input1, input2, output, self.device_c)
        return output
    def backward(self, grad_output):
        grad_input1 = torch.zeros(self.input1.size())
        grad_input2 = torch.zeros(self.input2.size())
        #print('backward device %d' % self.device)
        if not grad_output.is_cuda:
            my_lib.BilinearSamplerBHWD_updateGradInput(self.input1, self.input2, grad_input1, grad_input2, grad_output)
        else:
            grad_input1 = grad_input1.cuda(self.device)
            grad_input2 = grad_input2.cuda(self.device)
            my_lib.BilinearSamplerBHWD_updateGradInput_cuda(self.input1, self.input2, grad_input1, grad_input2, grad_output, self.device_c)
        return grad_input1, grad_input2

2. Channel Attention (Channel Attention, CA)

Channel attention can be understood as letting the neural network see what it is looking at. A typical representative is SENet. Each layer of the convolutional network has many convolution kernels, each corresponding to a feature channel. Compared to spatial attention mechanisms, channel attention allocates resources between each convolution channel, with a coarser granularity than the former.

Paper: Squeeze-and-Excitation Networks (https://arxiv.org/abs/1709.01507)

GitHub link: https://github.com/moskomule/senet.pytorch

Squeeze operation: Uses global spatial features of each channel as the representation of that channel, generating statistics for each channel using global average pooling.

Excitation operation: Learn the dependency of each channel and adjust different feature maps based on the dependency, obtaining the final output, which requires examining the dependency of each channel.

The overall structure is shown in the figure:

The output of the convolution layer does not consider the dependencies of each channel. The purpose of the SEBlock is to allow the network to selectively enhance the features with the most information, so that subsequent processing can fully utilize these features and suppress useless features.

SE-Inception Module

SE-ResNet‘s SE-Block

class SEBasicBlock(nn.Module):
    expansion = 1
    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None,
                 *, reduction=16):
        super(SEBasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes, 1)
        self.bn2 = nn.BatchNorm2d(planes)
        self.se = SELayer(planes, reduction)
        self.downsample = downsample
        self.stride = stride
    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.se(out)
        if self.downsample is not None:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)
        return out

class SELayer(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channel, channel // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel, bias=False),
            nn.Sigmoid()
        )
    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y.expand_as(x)

Basic Block of ResNet‘s

class BasicBlock(nn.Module):
    def __init__(self, inplanes, planes, stride=1):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
        if inplanes != planes:
            self.downsample = nn.Sequential(nn.Conv2d(inplanes, planes, kernel_size=1, stride=stride, bias=False),
                                            nn.BatchNorm2d(planes))
        else:
            self.downsample = lambda x: x
        self.stride = stride
    def forward(self, x):
        residual = self.downsample(x)
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out += residual
        out = self.relu(out)
        return out

The main difference between the two lies in the addition of a SELayer. For details, please refer to the source code.

3. Hybrid Domain Model (Integrating Spatial and Channel Attention)

(1) Paper: Residual Attention Network for Image Classification (CVPR 2017 Open Access Repository)

The attention mechanism in this article is based on soft attention with a masking mechanism, but the difference is that this attention mechanism’s mask draws on the idea of residual networks. It not only adds the mask based on the information of the current network layer but also passes down the information from the previous layer, preventing the issue of insufficient information due to masking that could hinder the stacking of network layers.

The innovation of the attention mechanism in this article is the proposal of residual attention learning, which not only uses the masked feature tensor as the input for the next layer but also passes the unmasked feature tensor as input for the next layer, allowing for richer features to be obtained and thus better attention to key features. Additionally, a third-order attention module is used to construct the entire attention.

(2) Paper: Dual Attention Network for Scene Segmentation (CVPR 2019 Open Access Repository)

4. Non-Local

Paper: Non-local Neural Networks (CVPR 2018 Open Access Repository)

GitHub link: https://github.com/AlexHex7/Non-local_pytorch

The term local refers to the receptive field. For example, the size of a single convolution operation’s receptive field is the size of the convolution kernel, and we generally choose kernels like 3*3 or 5*5, which only consider local areas, thus performing local operations. Pooling is also local. In contrast, non-local refers to a receptive field that can be large, not confined to a local domain. Fully connected layers are non-local and global.

However, fully connected layers bring a large number of parameters, complicating optimization. Stacking convolutional layers can increase the receptive field, but if we look at the receptive field of a specific layer’s convolution kernel on the original image, it is still limited. This is an unavoidable aspect of local operations.

However, some tasks may require more information from the original image, such as attention. If global information can be introduced at certain layers, it can effectively resolve the situation where local operations cannot clearly see the global context, providing richer information for subsequent layers.

The general non-local computation defined for neural networks is as follows:

If implemented using a for loop according to the above formula, it would definitely be very slow. Additionally, applying non-local layers on large input sizes would also be computationally expensive.

A solution for the latter is to only introduce non-local layers at higher semantic levels. We can also further reduce computation by adding pooling layers to the embedding results.

First, perform linear mapping on the input feature map X (through 1×1 convolution to compress the number of channels), then obtain the θ, ϕ, g features.

By reshaping, forcibly merging the dimensions of the three features except for the number of channels, and then performing matrix dot multiplication, we obtain something similar to a covariance matrix (this process is important, as it calculates the self-correlation of the features, i.e., the relationship between each pixel in each frame and all other pixels in all frames).

Then apply softmax to the self-correlation features by column or row (depending on the form of the matrix g) to obtain weights between 0 and 1, which are the self-attention coefficients we need.

Finally, multiply the attention coefficients back to the feature matrix g, and then expand the number of channels, adding residual connections to the original input feature map X.

5. Position Attention (Position-wise Attention)

Paper: CCNet: Criss-Cross Attention for Semantic Segmentation (ICCV 2019 Open Access Repository)

GitHub link: https://github.com/speedinghzl/CCNet

The highlight of this article is the clever method used to reduce the number of parameters. In the above DANet, the attention map calculates the similarity between all pixels and all pixels, with spatial complexity of (HxW)x(HxW), while this article adopts the criss-cross idea, calculating only the similarity between each pixel and its corresponding row and column, thus indirectly calculating the similarity between each pixel and every other pixel, reducing spatial complexity to (HxW)x(H+W-1).

When performing matrix multiplication, each pixel only extracts the corresponding pixel from the feature map for point multiplication to calculate similarity. Compared to the non-local method, this greatly reduces computational load while using second-order attention to obtain dense and rich contextual information from all pixels to generate a new feature map with abundant context information.

def _check_contiguous(*args):
    if not all([mod is None or mod.is_contiguous() for mod in args]):
        raise ValueError("Non-contiguous input")

class CA_Weight(autograd.Function):
    @staticmethod
    def forward(ctx, t, f):
        # Save context
        n, c, h, w = t.size()
        size = (n, h+w-1, h, w)
        weight = torch.zeros(size, dtype=t.dtype, layout=t.layout, device=t.device)
        _ext.ca_forward_cuda(t, f, weight)                # Output
        ctx.save_for_backward(t, f)
        return weight
    @staticmethod
    @once_differentiable
    def backward(ctx, dw):
        t, f = ctx.saved_tensors
        dt = torch.zeros_like(t)
        df = torch.zeros_like(f)
        _ext.ca_backward_cuda(dw.contiguous(), t, f, dt, df)
        _check_contiguous(dt, df)
        return dt, df
class CA_Map(autograd.Function):
    @staticmethod
    def forward(ctx, weight, g):
        # Save context
        out = torch.zeros_like(g)
        _ext.ca_map_forward_cuda(weight, g, out)                # Output
        ctx.save_for_backward(weight, g)
        return out
    @staticmethod
    @once_differentiable
    def backward(ctx, dout):
        weight, g = ctx.saved_tensors
        dw = torch.zeros_like(weight)
        dg = torch.zeros_like(g)
        _ext.ca_map_backward_cuda(dout.contiguous(), weight, g, dw, dg)
        _check_contiguous(dw, dg)
        return dw, dg
ca_weight = CA_Weight.apply
ca_map = CA_Map.apply

class CrissCrossAttention(nn.Module):
    """ Criss-Cross Attention Module"""
    def __init__(self,in_dim):
        super(CrissCrossAttention,self).__init__()
        self.chanel_in = in_dim
        self.query_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//8 , kernel_size= 1)
        self.key_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//8 , kernel_size= 1)
        self.value_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim , kernel_size= 1)
        self.gamma = nn.Parameter(torch.zeros(1))
    def forward(self,x):
        proj_query = self.query_conv(x)
        proj_key = self.key_conv(x)
        proj_value = self.value_conv(x)
        energy = ca_weight(proj_query, proj_key)
        attention = F.softmax(energy, 1)
        out = ca_map(attention, proj_value)
        out = self.gamma*out + x
        return out

__all__ = ["CrissCrossAttention", "ca_weight", "ca_map"]

Hard Attention

0/1 problem, which points receive attention and which do not. Hard attention focuses more on specific points, where each point in the image may extend attention, and hard attention is a random prediction process, emphasizing dynamic changes and is non-differentiable, so the training process often relies on reinforcement learning.

Reference Materials:

【1】blog.csdn.net/xys43038

【2】https://zhuanlan.zhihu.com/p/33345791

【3】https://zhuanlan.zhihu.com/p/54150694

Good news!
The Beginner's Guide to Vision knowledge community
is now open to the public👇👇👇




Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial
Reply "Extension Module Chinese Tutorial" in the background of "Beginner's Guide to Vision" public account to download the first OpenCV extension module tutorial in Chinese, covering installation of extension modules, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, etc.

Download 2: Python Vision Practical Project 52 Lectures
Reply "Python Vision Practical Project" in the background of "Beginner's Guide to Vision" public account to download 31 practical vision projects including image segmentation, mask detection, lane line detection, vehicle counting, eye line addition, license plate recognition, character recognition, emotion detection, text content extraction, face recognition, etc.

Download 3: OpenCV Practical Project 20 Lectures
Reply "OpenCV Practical Project 20 Lectures" in the background of "Beginner's Guide to Vision" public account to download 20 practical projects based on OpenCV, achieving advanced learning of OpenCV.

Discussion Group

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups on SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (these will be gradually subdivided). Please scan the WeChat number below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format; otherwise, entry will not be allowed. After successfully adding, you will be invited to the relevant WeChat group based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding~

1. Spatial Domain Attention (Spatial Transformer Network)

2. Channel Attention (Channel Attention, CA)

3. Hybrid Domain Model (Integrating Spatial and Channel Attention)

5. Position Attention (Position-wise Attention)

Leave a Comment Cancel reply