Understanding Capsule Neural Networks

Click on the above “Beginner’s Guide to Vision”, select to add “Bookmark” or “Pin”

Important content delivered immediately

From | Blog Garden Author | CZiFan

Background

Geoffrey Hinton is one of the pioneers of deep learning and the inventor of classic algorithms for neural networks like backpropagation. He and his team proposed a novel neural network based on a structure called a capsule, and also published a dynamic routing algorithm for training capsule networks.

Research Problem

Traditional CNNs have flaws (which will be explained in detail below). To address the shortcomings of CNNs, Hinton proposed a network that is more effective for image processing—capsule networks, which integrate the advantages of CNNs while considering other information such as relative position and angle that CNNs lack, thus improving recognition performance.

Research Motivation

The Flaws of CNNs

CNNs focus on detecting important features in image pixels. Consider a simple face detection task: a face is composed of an oval representing the face shape, two eyes, a nose, and a mouth. Based on CNN principles, as long as these objects exist, there is a strong stimulus, so the spatial relationships of these objects are not that important.

As shown in the figure below, the right image is not a face but contains the objects needed for a face, so CNNs are likely to activate the judgment of it being a face based on the existing objects, resulting in incorrect outcomes.

Revisiting how CNNs work, high-level features are a weighted sum of low-level features, where the activation of the previous layer is multiplied and summed with the weights of the next layer’s neurons, followed by activation through a nonlinear activation function. In such an architecture, the spatial relationships between high-level and low-level features become blurred (I believe there is still some relationship, but it is not well utilized). CNNs address this issue by using max pooling layers or convolutional layers to expand the field of view of subsequent convolutional kernels (I believe max pooling layers, in any case, will lose information, even critical information).

Inverse Graphics

Computer graphics constructs visual images based on hierarchical representations of geometric data, considering the relative positions of objects, with the geometric relationships and orientations between objects represented in matrices. Specific software accepts these representations as input and transforms them into images on the screen (rendering).

Inspired by this, Hinton believes that what the brain does is the opposite of rendering, called inverse graphics. From the visual information received by the eyes, the brain parses a hierarchical representation of the world it exists in and attempts to match learned patterns with relationships stored in the brain, thus achieving recognition. Note that the object representations in the brain do not depend on the viewing angle.

Therefore, what we need to consider now is how to model these hierarchical relationships in neural networks. In computer graphics, the relationships between three-dimensional objects in three-dimensional graphics can be represented using pose, which essentially involves translation and rotation. Hinton proposed that preserving the hierarchical pose relationships between object parts is crucial for correctly classifying and recognizing objects. Capsule networks combine the relative relationships between objects, numerically represented as a 4D pose matrix. Once the model has pose information, it can easily understand that what it sees is merely a different angle of something it has seen before. As shown in the figure below, the human eye can easily distinguish the Statue of Liberty from different angles, but CNNs struggle with this, while capsule networks that incorporate pose information can also recognize the Statue of Liberty from different angles.

Advantages of Capsule Networks

Since capsule networks incorporate pose information, they can achieve good representation with just a small amount of data, making this a significant improvement over CNNs. For example, to recognize handwritten digits, the human brain needs dozens to hundreds of examples, while CNNs require tens of thousands of data points to train effectively, which is clearly too brute-force!
They are closer to the human brain’s way of thinking, better modeling the hierarchical relationships of internal knowledge representations in neural networks. The intuition behind capsules is very simple and elegant.

Disadvantages of Capsule Networks

The current implementations of capsule networks are significantly slower than other modern deep learning models (I think this is due to the coupling coefficients and the stacking of convolution layers), making improving training efficiency a major challenge.

Research Content

What is a Capsule

Excerpt from Hinton et al.’s “Transforming Autoencoders” regarding the understanding of the capsule concept as follows.

Artificial neural networks should not pursue viewpoint invariance in “neuronal” activity (using a single scalar output to summarize the activity of a local pool of repeated feature detectors), but should use local “capsules” that perform some quite complex internal computations on their inputs and then encapsulate the results of these computations into a small vector containing rich information. Each capsule learns to recognize a visual entity implicitly defined within a limited observation condition and deformation range and outputs the probability of the entity’s existence within that limited range along with a set of “instance parameters”, which may include the precise pose, lighting conditions, and deformation information relative to that implicitly defined typical version of the visual entity. When a capsule operates correctly, the probability of the visual entity’s existence has local invariance—when the entity moves along the appearance manifold within the limited range covered by the capsule, the probability does not change. However, the instance parameters are “covariant”—as the observation conditions change and the entity moves along the appearance manifold, the instance parameters will change accordingly, as they represent the intrinsic coordinates of the entity on the appearance manifold.

In simpler terms, it can be understood as:

Artificial neurons output a single scalar. Convolutional networks use convolutional kernels to stack the results calculated by the same kernel across different regions of a two-dimensional matrix to form the output of the convolutional layer.
Viewpoint invariance is achieved through max pooling methods, as max pooling continuously searches the regions of a two-dimensional matrix and selects the largest number in the region, thus satisfying our desired activity invariance (i.e., we slightly adjust the input, and the output remains the same). In other words, if we slightly alter the object we want to detect in the input image, the model can still detect the object.
Pooling layers lose valuable information and do not consider the relative spatial relationships between encoded features, thus we should use capsules. All the important information regarding the states of features detected by capsules will be encapsulated in vector form (neurons are scalars).

The comparison between capsules and artificial neurons is as follows:

Dynamic Routing Algorithm

Lower-level capsule i needs to decide how to send its output vector to higher-level capsule j. The lower-level capsule alters the scalar weight c_ij, and the output vector is multiplied by this weight before being sent to the higher-level capsule as input. Regarding the weight c_ij, the following needs to be known:

All weights are non-negative scalars
For each lower-level capsule i, the sum of all weights c_ij equals 1
For each lower-level capsule i, the number of weights equals the number of higher-level capsules
These weights are determined by an iterative dynamic routing algorithm

Lower-level capsules send their output to higher-level capsules that indicate “agreement.” The pseudo-code for the algorithm is as follows:

The weight updates can be intuitively understood using the following diagram.

Where the outputs of two higher-level capsules are represented by the purple vectors v₁ and v₂, the orange vector represents input from a certain lower-level capsule, and other black vectors represent inputs from other lower-level capsules. The left purple output v₁ and orange input u_1|1 point in opposite directions, so they are not similar, meaning their dot product is negative. When updating the routing coefficients, c₁₁ will decrease. The right purple output v₂ and orange input u_2|1 point in the same direction, meaning they are similar, thus when updating parameters, the routing coefficient c₁₂ will increase. This process is repeated for all higher-level capsules and their inputs, resulting in a set of routing parameters that achieve the best match between the outputs from lower-level capsules and the outputs from higher-level capsules.

How Many Routing Iterations to Use? The paper tested a range of values on the MNIST and CIFAR datasets and reached the following conclusions:

More iterations often lead to overfitting
In practice, it is recommended to use 3 iterations

Overall Framework

CapsNet consists of two parts: an encoder and a decoder. The first three layers are the encoder, and the last three layers are the decoder:

First layer: Convolutional layer
Second layer: PrimaryCaps (Primary Capsule) layer
Third layer: DigitCaps (Digit Capsule) layer
Fourth layer: First fully connected layer
Fifth layer: Second fully connected layer
Sixth layer: Third fully connected layer

Encoder

The encoder takes a 28×28 MNIST digit image as input and encodes it into a 16-dimensional vector of instance parameters.

Convolutional Layer

Input: 28×28 image (grayscale)
Output: 20×20×256 tensor
Convolutional kernels: 256 kernels of size 9×9×1 with a stride of 1
Activation function: ReLU

PrimaryCaps Layer (32 Capsules)

Input: 20×20×256 tensor
Output: 6×6×8×32 tensor (total of 32 capsules)
Convolutional kernels: 8 kernels of size 9×9×256 per capsule with a stride of 1

DigitCaps Layer (10 Capsules)

Input:

6×6×8×32 tensor
Output:

16×10 matrix

Loss Function

Decoder

The decoder receives a 16-dimensional vector from the correct DigitCap and learns to encode it back into a digit image (note that only the correct DigitCap vector is used during training, ignoring incorrect DigitCaps). The decoder serves as a regularizer, taking the output of the correct DigitCap as input and reconstructing a 28×28 pixel image, with the loss function being the Euclidean distance between the reconstructed image and the input image. The closer the reconstructed image is to the input image, the better, as shown in the examples of reconstructed images below.

First Fully Connected Layer

Input: 16×10 matrix
Output: 512 vector

Second Fully Connected Layer

Input: 512 vector
Output: 1024 vector

Third Fully Connected Layer

Input: 1024 vector
Output: 784 vector

Good news!
The Beginner's Guide to Vision Knowledge Group
is now open to the public 👇👇👇




Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial
Reply "Extension Module Chinese Tutorial" in the "Beginner's Guide to Vision" public account backend to download the first OpenCV extension module tutorial in Chinese on the internet, covering installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters of content.

Download 2: Python Vision Practical Project 52 Lectures
Reply "Python Vision Practical Project" in the "Beginner's Guide to Vision" public account backend to download 31 practical vision projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to assist in quickly learning computer vision.

Download 3: OpenCV Practical Projects 20 Lectures
Reply "OpenCV Practical Projects 20 Lectures" in the "Beginner's Guide to Vision" public account backend to download 20 practical projects based on OpenCV, achieving advanced learning in OpenCV.

Group Chat

Welcome to join the reader group of the public account to exchange ideas with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will gradually subdivide in the future). Please scan the WeChat ID below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format for notes, otherwise, you will not be approved. Once added successfully, invitations will be sent to the relevant WeChat group based on research direction. Please do not send advertisements in the group, otherwise, you will be removed from the group. Thank you for your understanding~

Dynamic Routing Algorithm

Leave a Comment Cancel reply