CNN Pruning Using LSTM Concepts: Peking University's Gate Decorator

Selected from arXiv

Author: Zhonghui You et al.

Translated by Machine Heart

Contributors: Siyuan, Yiming

Using the basic idea of LSTM’s gating mechanism for pruning? Let the model decide which convolution kernels can be discarded.

Remember when we understood LSTM, we found that it uses a gating mechanism to remember important information and forget unimportant information. After that, many machine learning methods have been influenced by the gating mechanism, including Highway Network and GRU, etc. Researchers from Peking University are no exception; they incorporated the gating mechanism into CNN pruning, allowing the model to decide which filters are less important and can therefore be removed.In fact, pruning filters is one of the most effective methods for accelerating and compressing convolutional neural networks. In this paper, researchers from Peking University proposed a global filter pruning algorithm called “Gate Decorator.” This algorithm can change the standard CNN module by multiplying the output with a scale factor (gate) in the channel direction. When this scale factor is set to zero, it is as if the corresponding filter has been removed.The researchers used Taylor expansion to estimate the impact on the loss function when the scale factor is set to 0 and used this estimate to score the importance of global filters. Then, the researchers removed the less important filters. After pruning, the researchers merged all the scale factors back into the original module, thus avoiding the need for special computations or architectures. Additionally, to improve pruning accuracy, the researchers proposed an iterative pruning architecture called Tick-Tock.

Figure 1: Illustration of filter pruning. The i-th layer has 4 filters (channels). If one is removed, the corresponding feature map will disappear, and the input to layer i+1 will become 3 channels.Extended experiments demonstrated the effectiveness of the proposed method. For example, researchers achieved the best pruning ratio on ResNet-56, reducing the floating-point operations per second (FLOPs) by 70% without a significant drop in accuracy. On ResNet-50 trained on ImageNet, they reduced the FLOPs by 40% and exceeded the baseline model by 0.31% in top-1 accuracy. Various datasets were used in the study, including CIFAR-10, CIFAR-100, CUB-200, ImageNet ILSVRC-12, and PASCAL VOC 2011.The main contributions of this paper include two parts: the first part is the “Gate Decorator” algorithm to address the GFIR problem. The second part is the Tick-Tock pruning framework to enhance pruning accuracy.Specifically, the researchers demonstrated how to apply the Gate Decorator to batch normalization operations, naming this method Gate Batch Normalization (GBN). Given a pre-trained model, the researchers converted the normalization module into GBN before pruning. After pruning, they reverted the GBN back to batch normalization. This way, there is no need to introduce special computations or architectures into the model.

Paper link: https://arxiv.org/abs/1909.08174
Implementation link: https://github.com/youzhonghui/gate-decorator-pruning

How to Implement Gated PruningSo how exactly do we use the gating mechanism to solve the global filter importance ranking? The researchers stated that they first apply the Gate Decorator to the batch normalization mechanism, then use an iterative pruning framework called Tick-Tock to achieve better pruning accuracy, and finally employ group pruning techniques to address conditional pruning issues, such as pruning networks with residual connections.The above briefly outlines the three steps of gated pruning, and a simple introduction will follow; of course, more detailed content can be found in the original paper.Gated Batch NormalizationThe researchers applied the Gate Decorator to batch normalization, and this module is referred to as Gated Batch Normalization (GBN). The GBN is shown in the following equation 7, differing from standard batch normalization in that the φ arrow represents the gate selection. Here, φ arrow is a vector of φ, and c is the number of channels in Z_in.

If an element in φ arrow is zero, it indicates that the corresponding channel has been pruned. Furthermore, for networks not using BN, we can directly apply the Gate Decorator to convolutional operations to achieve gated pruning.Tick-Tock Pruning FrameworkThe researchers also introduced an iterative pruning framework to improve pruning accuracy, which they called Tick-Tock. In the Tick phase, operations are performed on a subset of training data, and the convolution kernels are set to be non-updatable. The Tock phase uses all training data and adds a sparse constraint φ to the loss function.

Figure 2: Illustration of the Tick-Tock pruning framework.The Tick phase mainly aims to achieve the following three goals: accelerate the pruning process; compute the importance score Θ for each filter; and reduce the internal covariate shift issues caused by previous pruning.In the Tick phase, the researchers train for one epoch on a subset of the training data, only allowing the gating φ and the final linear layer to update, which significantly reduces the risk of overfitting on small datasets. After training, the model ranks all filters based on their importance scores Θ and removes the less important ones.The Tick phase can be repeated T times before the Tock phase. The Tock phase fine-tunes the network to reduce overall error, which may be caused by a single filter. Additionally, the Tock phase differs from a typical fine-tuning process in two main ways: fine-tuning trains for more epochs than Tock; and fine-tuning does not add a sparsity constraint to the loss function.Group Pruning: Addressing Constrained Pruning IssuesResNet and its variants contain residual connections, which perform element-wise addition on feature maps produced by two residual blocks. If filters are pruned individually from each layer, it may lead to misalignment of feature maps in the residual connections. This can be viewed as a constrained pruning problem, where we want the pruning to be done under the condition of aligned feature maps.To address the misalignment issue, the authors proposed group pruning: assigning GBN connected purely through residual connections to the same group. Pure residual connections refer to a method where there are no convolutional layers on the side branches, as shown in Figure 3.

Figure 3: Group pruning demonstration. GBNs of the same color belong to the same group.Each group can be seen as a Virtual GBN, where all constituent convolutions share the same pruning pattern. Moreover, within the group, the importance score of the filter is the sum of the scores of its member convolutions.Experimental Setup and DatasetsDatasetsThe researchers used various datasets, including CIFAR-10, CIFAR-100, CUB-200, ImageNet ILSVRC-12, and PASCAL VOC 2011. The CIFAR-10 dataset includes 50K training data and 10K testing data. CIFAR-100 is similar to CIFAR-10 but has 100 categories with 600 images per category. CUB-200 includes nearly 6000 training images and 5700 testing images, covering 200 species of birds. ImageNet ILSVRC-12 has 1.28 million training images and 50K testing images, covering 1000 categories. The researchers also used the PASCAL VOC 2011 segmentation dataset and its extended dataset SBD, which has 20 categories, with 8498 training sample images and 2857 testing sample images.Pruned ModelsThe researchers used three network architectures for pruning: VGGNet, ResNet, and FCN. All networks were trained using SGD, with weight decay and momentum hyperparameters set to 10^-4 and 0.9, respectively.The researchers trained these networks using various training data and different batch sizes, while also incorporating some data augmentation techniques.During the pruning phase, the researchers pruned 0.2% of filters from ResNet at each Tick phase, and removed 1% of filters from VGG and FCN. A Tock operation was performed after every 10 Tick operations.Pruning Results

Table 1: Performance of the pruned model on ResNet-56 trained on CIFAR-10. The baseline accuracy is 93.1%.

Table 2: Performance of the pruned model on ResNet-50 trained on ImageNet. P.Top-1 and P.Top-5 represent the pruning model’s single-center accuracy on the validation set for top-1 and top-5, respectively. [Top-1] ↓ and [Top-5] ↓ indicate the accuracy drop of the pruned model compared to the baseline model. Global indicates whether this pruning method is a global filter pruning algorithm.

Figure 4: Pruning results of VGG-16-M on the CUB-200 dataset.The baseline model shown in Figure 5 is VGG-16-M, which has a test accuracy of 73.19% on CIFAR-100. The “shrunk” version indicates that the number of channels in all convolutional layers is halved, thus reducing the FLOPs to 1/4 of the baseline model, resulting in a 1.98% drop in test accuracy when trained from scratch. The “pruned” version represents the result of pruning using the Tick-Tock framework, with a 1.3% drop in test accuracy. If we train the “pruned” version model from scratch, its accuracy can reach 71.02%, representing a drop of 2.17%. Importantly, the “pruned” version model has only 1/3 of the parameters of the “shrunk” version model.

Figure 5: Comparison of the effects and channel counts of the two networks, which have the same FLOPs.This article is compiled by Machine Heart. For reprints, please contact our official account for authorization..✄————————————————Join Machine Heart (Full-time Reporter / Intern): [email protected]Submissions or seeking coverage: content@jiqizhixin.comAdvertising & Business Cooperation: [email protected]

CNN Pruning Using LSTM Concepts: Peking University’s Gate Decorator

Leave a Comment Cancel reply