Designing CNN Networks: NAS vs Handcraft

Click the blue text above to follow us!

When training a Convolutional Neural Network (CNN), it is common to first select a well-known backbone (such as ResNet-50) and then adjust the architecture to balance performance and efficiency based on requirements. Often, this adjustment concept relies heavily on experience, which needs to be cultivated through extensive reading and practical experience.

In the development of CNNs, many classic architectures have emerged, which can be categorized into two directions: Handcrafted design and Neural Architecture Search (NAS). In handcrafted design, from the early LaNet and AlexNet to the recent well-known VGG, Inception, ResNet, and MobileNet, each proposed network reflects the author’s insights into network architecture. On the other hand, network search directly finds the optimal architecture through neural architecture search (NAS). This article will share some insights on CNN network design and adjustments through EfficientNet and RegNet, which represent the pinnacle of recent NAS and handcrafted design.

This article discusses the design methods of convolutional neural networks, starting from the significance of feature map width, resolution, and depth scaling, then introducing NAS (EfficientNet) and the design process of handcrafted methods (RegNet), and finally discussing the relationship between the two. This article focuses primarily on the discussion of network architecture design, not just the introduction of EfficientNet or RegNet. Recommended background knowledge: convolutional neural network (CNN), neural architecture search (NAS), EfficientNet, RegNet.

1 Basic Design Framework of CNN

The design of CNN networks generally includes three main parts:

Convolution layer extracts spacial information;
Pooling layer reduces the resolution of images or feature maps, decreasing computational load and obtaining semantic information;
Fully-connected (FC) layer regresses targets.

As times change, although pooling layers are often replaced by convolutions with larger strides, global average pooling and 1×1 convolution have also occasionally replaced FC layers, this general idea remains largely unchanged.

1.1 Stem -> Body -> Head

In the infinite possibilities of network architecture, even using neural architecture search (NAS) to explore can seem challenging. In recent developments of CNN architectures, common CNN architectures can be divided into three parts:

Stem: The input image is processed with a small number of convolutions, adjusting the resolution.
Body: The main part of the network, which can be divided into multiple stages, usually each stage performs a resolution-reducing operation, internally consisting of one or more repeated combinations of building blocks (such as residual bottlenecks).
Head: Uses the features extracted from the stem and body to perform predictions for the target task.

Not all CNNs follow this framework; it is merely a common framework.

1.2 Building Block

Building blocks are also a commonly used term, referring to small network combinations that are repeatedly used, such as the residual block and residual bottleneck block in ResNet, or the depthwise convolution block and inverted bottleneck block in MobileNet.

In addition to the above, there are many other details that can be adjusted in designing convolutional neural networks: designs for shortcuts (like FPN), adjustments to receptive fields, etc. However, to avoid lengthy complexity, these will not be discussed in this article.

Having understood the concept of CNN networks, let’s first consider a question: if we have a known framework, how can we enlarge or compress this network according to our needs?

2 Scale An Existing Structure

Without significantly altering the main architecture, the three commonly adjusted items are the following:

Depth D (depth): Represents the number of stacked building blocks or convolution layers from input to output.
Width W (width): Represents the width of the feature map output from the building block or convolution layer (number of channels or filters).
Resolution R (resolution): Represents the length and width of the feature map tensor output from the building block or convolution layer.

3 Strengthening One Individually

Regarding depth, we believe that deeper networks can capture more complex features and bring better generalization ability. However, overly deep networks can still be difficult to train due to gradient vanishing, even with the use of skip connections and batch normalization.

In terms of width, generally, wider networks can capture more fine-grained information and are easier to train. However, wide and shallow networks struggle to capture complex features.

In terms of resolution, high resolution undoubtedly provides more detailed information; in most papers, this is essentially a good way to enhance performance. The evident downside is the computational load, and adjustments to the receptive fields are needed for localization issues.

The following experiments from the EfficientNet paper show the effects of individually increasing depth, width, and resolution on Top-1 accuracy on ImageNet. From left to right, the top-1 accuracy of width, depth, and resolution can be observed; the experiments demonstrate that enhancing any one of these aspects effectively improves performance, but this effect quickly saturates.

3.1 Compound Scaling

Based on the experiments of single enhancements, the authors of EfficientNet suggest that depth, width, and resolution should be considered together. However, under a certain computational resource setting, determining the adjustment ratio among these three is an open question.

Considering that the computational load increases differently when increasing depth, width, and resolution by two times (doubling depth increases computational load by two times; doubling width or resolution increases computational load by four times), the paper proposes compound scaling, which takes into account the proportional increase in computational load to define an optimization problem and uses a small grid search to approximate the optimal solution.

Now that we understand the concept of CNN networks and how to enlarge or shrink a known CNN network based on requirements in terms of width, depth, and resolution, the question returns: how do we find a good network architecture? Since deep learning has surpassed human knowledge in many fields, can machines also design neural network architectures by themselves? This question has become a hot research topic: neural architecture search (NAS).

4 Neural Architecture Search (NAS)

Although it is called automatic search for neural networks, NAS is not a panacea; its implementation generally requires defining three elements:

Search Space: Defines the basic elements of the network that can be selected (such as convolution, batch normalization) and the adjustable contents (such as kernel size, filter number).
Search Strategy: Defines the method of searching, such as reinforcement learning, which is one of the common methods.
Performance Estimation: Defines how to evaluate the quality of an architecture, such as accuracy, computational load, etc.

This article does not discuss search strategy but focuses on the search space and performance estimation related to network design.

4.1 EfficientNet

EfficientNet is arguably the strongest network architecture for mobile devices (using CPU computation) as of 2020. EfficientNet adopts the search space of MnasNet but changes the estimation of computational resources from latency to FLOPs, controlling the trade-off between accuracy and FLOPs using a hyperparameter w for a given network m and target computational load T, optimizing Accuracy(m)*[FLOPs(m)/T]^w.

The following shows the search space of MnasNet and EfficientNet, giving an overview of the options available for convolution forms, kernel and filter sizes, the use and parameters of squeeze-and-excitation, skip connections (shortcuts), and the number of layers per block (or stage), etc.

After finding a base architecture EfficientNet-B0, the EfficientNet paper derived the optimal scaling parameters (α=1.2, β=1.1, γ=1.15) based on the principle of compound scaling, assuming the computational resources are doubled, and sequentially inferred larger architectures from EfficientNet-B1 to EfficientNet-B7. EfficientNet is one of the most powerful and efficient network architectures as of 2020. EfficientNet-B0 architecture is as follows.

Now that we know how NAS finds a good network architecture, how do humans find a good network architecture?

5 Handcraft Design with Insights

Generally, the process of manually designing a network is as follows: start with a rough network concept, hypothesize an adjustable range (design space, termed search space in NAS), and conduct numerous experiments within this space to find directions that yield positive benefits. At this point, we continue experiments based on this direction, which is equivalent to converging to a smaller design space. This iterative process is like moving from A to B to C as shown in the diagram below.

Reflecting on the three main elements of NAS: search space, search strategy, and performance evaluation, it turns out that manual design and NAS are fundamentally similar; the only difference is that the search strategy is based on our own knowledge. However, this leads to significant differences in outcomes: NAS primarily relies on performance as the best reference, excelling at finding the strongest model. In contrast, the manual design process observes performance while also relying on accumulated knowledge and experience, the significance of each parameter, the effects of each added module, etc., to hypothesize directions, resulting in a directional insight.

Some may feel that neural networks are black boxes, and many insights are forced explanations for publishing papers. However, without these insights, I would truly not know how to modify a network.

5.1 Analysis of Design Space

In fact, rigorously defining directions that yield positive benefits is not a simple task. In the paper “Designing Network Design Spaces,” the process of optimizing network architecture is systematically organized and analyzed. The authors use sampling to analyze the error distribution of models in the defined design space (referred to as empirical distribution function, EDF) to identify beneficial directions. Below is the analysis of models in the design space using EDF.

5.2 RegNet

RegNet is a network discovered through analyzing EDF in the paper “Designing Network Design Spaces.” Below, I share the authors’ process of searching the RegNet design space, giving insight into the design process of a network architecture.

Since the design space of networks is nearly infinite, just as researchers generally have an initial model (like ResNet), the paper first defines a broad but not infinite design space: AnyNet + X block = AnyNetX.

AnyNetX inherits much of the spirit of CNNs, serving as a template similar to ResNet, dividing the network into stem, body, and head, focusing the search process on the body. The body is divided into four stages, starting with an input resolution of 224×224, with each stage halving the resolution at the beginning. For controllable parameters, each stage has four degrees of freedom: block count (b), block width (w), bottleneck ratio (b), and width of group convolution (g), totaling 16 degrees of freedom.

According to the analysis method defined in the paper, each search samples N=500 reasonable models and uses EDF to analyze their strengths and weaknesses. Thus, the design space evolves from the initial AnyNetX_A to AnyNetX_E, discovering key points such as:

The bottleneck ratio and the width of group convolution have nearly identical results when analyzed across stages, suggesting that sharing parameters across stages does not affect outcomes.
Width and block count (depth) significantly improve performance as stage increases, indicating this is a good design direction.

After simplification, the remaining design space for AnyNetX consists of six degrees of freedom, which is RegNetX. The six dimensions include: model depth d, bottleneck ratio b, number of group convolutions g, target slope for network width increase w_a, width increase interval multiplier w_m, and initial width w_0.

The width increase interval multiplier (width multiplier) and target slope for network width increase (width slope) may sound convoluted; I have translated them based on their meanings. Even looking at the original text can be confusing, but the practical meaning is as follows: the target slope is a fixed value defining how each block’s width should grow, while the interval multiplier is a convenient integer multiplier that can be manipulated to ensure each block’s width does not fluctuate too erratically or to keep repeated blocks within each stage at the same width.

By analyzing some popular network architectures, the paper further restricts the design space of RegNetX: bottleneck ratio b = 1, block depth 12 ≤ d ≤ 28, and the basic multiplier for width increase w_m ≥ 2. To further optimize model inference latency, RegNetX imposes additional restrictions on parameter count and activation number: assuming the number of FLOPs is f, restrict activation count #act ≤ 6.5 sqrt(f), and restrict parameter count #param ≤ 3.0 + 5.5*f.

So can the networks designed by humans compete with those found by NAS?

6 Handcraft vs NAS

I believe that many of the current settings in NAS (such as search space) still rely on human wisdom, which means NAS must first borrow from many handcrafted insights. On the other hand, humans also attempt to gain insights from architectures discovered by NAS. Thus, NAS and handcrafted design can currently be said to complement each other.

The advantage of NAS lies in finding the most advantageous architecture for a task, but this strong purpose also brings uncertainty in generalization (could it lead to overfitting architectures). This is why the EfficientNet paper not only demonstrates the power of its architecture on ImageNet but also spends additional time showcasing its transferability to other classification datasets. However, even if transferability to other datasets is feasible, transferring to other tasks may present another challenge.

In fact, EfficientNet has already demonstrated strong capabilities in other visual tasks, whether in object detection or semantic segmentation, although this evaluation is based on the premise of

Leave a Comment Cancel reply