Dynamic Neural Networks: Key Challenges and Solutions

Originally from Zhiyuan Community

[Column: Key Issues]In recent years, we have witnessed increasingly powerful neural network models, such as AlexNet, VGG, GoogleNet, ResNet, DenseNet, and the recently popular Transformer.

The processes used by these neural networks can be summarized as follows: 1) Fixed network architecture, initializing network parameters; 2) Training phase: optimizing network parameters on the training set; 3) Inference phase: fixing network architecture and parameters, inputting test samples for forward propagation to obtain predicted results.

This paradigm leads to the situation where, after training is complete, the same network architecture and parameters are used for inference on all input samples during the testing phase. This somewhat limits the model’s representation ability, inference efficiency, and interpretability.

A very obvious example is shown below. For common images of “horses” or “owls”, perhaps a small network is sufficient for correct recognition; however, for “non-classical” images of “horses” or “owls”, a larger network may be required for correct identification.

Another example is shown below. When recognizing an image containing a “cat”, we can see that increasing the resolution can indeed improve accuracy, but it also comes with a rapid increase in computational load. People naturally wonder if it is possible to reduce the resolution of input samples without significantly affecting accuracy to save computational resources.

Summarizing this series of requirements for “adaptive inference” leads us to the research area of “dynamic neural networks”.

Unlike static networks, the essence of dynamic networks is that they can dynamically adjust their structure/parameters when processing different test samples, thereby demonstrating remarkable advantages in inference efficiency, expressive power, and adaptability.

Written by: Han Yizeng

Edited by: Jia Wei

Content Directory

1. How Do Neural Networks Move?

1) Sample Adaptive Dynamic Networks

Dynamic Structure

Dynamic Parameters

2) Spatial Adaptive Dynamic Networks

Pixel Level

Region Level

Resolution Level

3) Temporal Adaptive Dynamic Networks

2. Six Major Open Questions

1) Structure Design;

2) Applicability under More Diverse Tasks;

3) The Gap Between Practical Efficiency and Theory;

4) Robustness;

5) Interpretability;

6) Dynamic Network Theory.

How Do Neural Networks Move?

The most classic approach is to construct a dynamic ensemble of multiple models in a serial or parallel manner, and then adaptively activate one of the models based on the input sample.

In fact, surrounding this idea, there are already a large number of related research works, which can be roughly divided into several categories, including “sample adaptive dynamic networks”, “spatial adaptive dynamic networks”, and “temporal adaptive networks”.

1) Sample Adaptive Dynamic Networks

Sample adaptive dynamic networks perform adaptive computation for each input sample. Based on the main dynamic changes of the network, they can be divided into two major categories: dynamic structure and dynamic parameters.

The former is mainly aimed at allocating fewer computational resources when processing “simple” samples to enhance computational efficiency; the latter is primarily to enhance the model’s expressive power with as little additional computation as possible.

Dynamic Structure

Dynamic structure has two dimensions: depth and width.

Depth dynamics refer to changes in the number of layers in the network. Since almost all networks are composed of stacked layers, a relatively natural way to implement dynamic structure is to selectively execute different layers of the network for different samples.

There are two approaches to this: early exit mechanisms and skip-layer mechanisms.

The so-called “early exit mechanism” sets exit points in the middle layers of the model and decides adaptively whether to terminate the reasoning process for a sample based on the outputs at these intermediate exit points. The early exit mechanism effectively skips the computation of all layers after a certain classifier.

Two Basic Implementation Strategies for the “Early Exit” Mechanism

The skip-layer mechanism is relatively flexible, as it adaptively decides whether to execute each intermediate layer of the network for each input sample.

Several Implementation Methods for Dynamic Skipping Layers

In terms of the “width” dimension of the network, dynamic activation of neurons, dynamic channel pruning, and mixture of experts (MoE) can be classified from different scales.

Dynamic activation of neurons is easy to understand. As shown in the figure below, it usually controls the activation of neurons in linear layers through low-rank decomposition and other methods.

Dynamic Activation of Neurons

In CNNs, dynamic channel pruning, compared to static pruning methods that permanently remove certain “unimportant” channels, can adaptively activate different convolutional channels based on the sample, thus improving computational efficiency while maintaining model capacity.

The mixture of experts system establishes multiple “experts” through a parallel structure. These experts can be complete models or network modules, and then dynamically weight the outputs of these “experts” to obtain the final prediction result.

MoE Structure

In addition to dynamic adaptation in the width and depth of the network, another type of work first establishes a super network (SuperNet) with multiple forward paths, and then uses certain strategies to dynamically route different input samples.

Dynamic Routing in Super Networks

Dynamic Parameters

Networks with dynamic structures often require special structural designs, training strategies, or adjustments of hyperparameters. Another category of work keeps the network structure unchanged during inference and adaptively adjusts (part of) the model parameters according to the input samples to enhance the model’s expressive power.

Let the inference process of a static network be represented as

Then the output of a network with dynamic parameters can be represented as

Where

is the operation that generates dynamic parameters.

Overall, research on dynamic parameters can be divided into three major categories: dynamic adjustment of parameters, parameter prediction, and attention-based dynamic features.

Dynamic Adjustment of Parameters

The core ideas of dynamic adjustment of parameters are twofold: 1) Reweighting the parameters of the backbone network using attention during the testing phase to enhance the network’s expressive power; 2) Adaptively adjusting the shape of convolution kernels to allow the network to have a dynamic receptive field.

Parameter Prediction

Parameter prediction is more direct than dynamic adjustment, as it directly predicts the (partial) parameters of the network from the input.

The main effect of dynamic parameter methods is to generate more dynamic and diverse features, thereby enhancing the model’s expressive power. To achieve this goal, an equivalent solution is to dynamically weight the features directly.

Attention-Based Dynamic Features

For a linear transformation Dynamic Neural Networks: Key Challenges and Solutions

, weighting its output yields results equivalent to first weighting the parameters and then performing the transformation, i.e.,

In addition to channels, attention can also be used to adaptively weight different spatial locations of features. These two attention paradigms (channel, spatial) can be combined in various forms to further enhance the model’s expressive power.

These dynamic feature methods typically reweight features before nonlinear activation functions (e.g., ReLU). Recently, some works have directly designed dynamic activation functions to replace traditional static activation functions in networks, which can also significantly improve the model’s expressive power.

2) Spatial Adaptive Dynamic Networks

In visual tasks, existing studies have shown that different spatial locations in the input play different roles in the final predictions of CNNs. In other words, to make an accurate prediction, it may only be necessary to adaptively process a portion of the spatial locations in the input, rather than performing computations of the same amount across all positions of the input image.

Other studies have shown that using lower resolution for input images can already lead to good accuracy for the network. Therefore, the traditional approach of using the same resolution representation for all input images in CNNs results in inevitable redundant computations.

To address this, spatial adaptive dynamic networks can be designed to perform adaptive inference from a spatial perspective on image inputs. Depending on the granularity of dynamic operations, spatial adaptive networks can be divided into three levels: pixel level, region level, and resolution level.

Pixel Level

Pixel-level dynamic networks, as the name suggests, perform adaptive computations for each spatial position of the input feature map. For this problem, dynamic structure and dynamic parameter approaches can also be applied.

Pixel-Level Dynamic Networks

Pixel-level dynamic structure networks call different network modules for different pixel points, thereby avoiding redundant computations in areas unrelated to the task, such as the background.

For parameters, the aforementioned dynamic parameter methods can basically be applied here.

Region Level

The sparse sampling operation in pixel-level dynamic networks often leads to difficulties in achieving theoretical acceleration effects during actual operation. Region-level dynamic networks adaptively compute a whole area selected from the original input.

Specifically, region-level dynamic networks can be divided into two types. The first type learns a set of transformation parameters (e.g., affine, projection) based on the input image and performs parameterized transformations on the original image (part of the area), thereby enhancing the model’s robustness to image distortions or enlarging task-relevant areas in the image to improve recognition accuracy.

The second type uses a spatial hard attention mechanism to adaptively select image blocks containing important targets from the input, cropping these blocks for recognition.

The specific process can be as follows:

Adaptive Inference of Region-Level Dynamic Networks

Resolution Level

The above works typically require dividing an input image into different regions and performing adaptive computations on each region. The sparse sampling/cropping operation involved often affects the efficiency of the model during actual operation.

Resolution-level dynamic networks, on the other hand, process each input sample as a whole, but to reduce the redundant computations caused by high-resolution representations for “simple” samples, they adopt dynamic resolutions for data representation across different input images.

Dynamic resolution can be achieved through two main approaches: dynamic scaling factors and multi-scale architectures.

Dynamic Changes of Multi-Scale Architectures for Different Samples

3) Temporal Adaptive Dynamic Networks

Temporal data (e.g., text, video) has significant redundancy along the time dimension. Therefore, designing a dynamic network that performs adaptive computations for data at different time positions will make the network itself more efficient.

Overall, temporal adaptive dynamic networks can reduce redundant computations from two aspects: 1) Allocating less computation to certain “unimportant” positions of the input; 2) Performing computations only at a portion of sampled time positions.

The processing flow of conventional static RNNs for time series data with length T is to iteratively update its hidden state, i.e.,

However, since the contribution (importance) of inputs at different time points varies for the task, applying the same complexity of operations to inputs at each time point leads to unavoidable redundant computations. Therefore, dynamic RNNs can be designed in various forms to adaptively decide whether to allocate computation based on inputs at different time points, or what level of complexity to use for the computations.

Several Inference Modes of Temporal Adaptive Dynamic Networks

The first mode is dynamic updates of the hidden state. Considering the varying importance of input data at different time points, the hidden state of the RNN can be updated using adaptive computations at each time step. For example, one can skip updates of the hidden state, perform rough updates, or use multi-scale architectures for selective updates.

However, this approach still requires all time samples to be input before making adaptive computation decisions. In many scenarios, it is sufficient to solve the required problem based solely on the beginning of the time series data. For instance, humans can gain a general understanding of an article by reading only its abstract. Therefore, an “early exit” mechanism can be used to terminate the “reading” process early at some intermediate time.

In the “early exit” mechanism, the network can only decide whether to terminate the computation but cannot decide which positions of the input data to “look at”. If the aforementioned “skipping” mechanism can be employed, it would enable skipping a certain number of inputs.

Open Questions

Dynamic neural networks have obvious advantages. Especially now, the market urgently needs to deploy deep learning models to various mobile terminals, and how to use smaller models, consume less computational resources, and still ensure sufficient accuracy are pressing demands that are driving the research on dynamic neural networks forward.

However, despite this, the field is still in its infancy, and there are many open questions and research directions worth exploring. We believe there are six major issues:

1. Dynamic Network Structure Design.Currently, most work on network structure focuses on designing static networks, and most dynamic network design methods are also based on existing classical static model architectures, selectively executing different components within them. This may only be a suboptimal solution for the structure design of dynamic networks. Therefore, further enhancing the performance and computational efficiency of dynamic network structure design for adaptive computations is possible.

2. Applicability Under More Diverse Tasks.Most dynamic networks are currently designed only for classification tasks and are difficult to apply directly to other visual tasks such as object detection and semantic segmentation. The challenge lies in the lack of a simple criterion to judge the difficulty of samples in these tasks, while a sample may contain multiple targets/pixels of varying complexity. Some existing methods, such as spatial adaptive networks, have already been applied to more tasks beyond classification. However, designing a unified, simple, and elegant dynamic network structure that can directly serve as the backbone network for other tasks remains a significant challenge.

3. The Gap Between Practical Efficiency and Theory.Most existing deep learning computing hardware and software libraries are developed for static models, and their acceleration for dynamic models is still insufficiently friendly, leading to the potential for dynamic models to lag behind theoretical performance in practical acceleration. Therefore, designing hardware-friendly dynamic networks is a valuable and challenging topic. Another interesting direction is to further optimize computing hardware and software platforms to better reap the theoretical efficiency gains brought by dynamic networks.

4. Robustness.Recent work has shown that dynamic models can provide new research angles for the robustness of deep networks. Additionally, conventional adversarial attacks aim to reduce model accuracy, while for dynamic networks, attacks can simultaneously target both accuracy and efficiency. Thus, the robustness of dynamic networks is an intriguing and under-researched topic.

5. Interpretability.Dynamic networks inherit the “black box” characteristics of deep neural networks, leading to research directions for explaining their working mechanisms. Notably, the adaptive inference mechanisms of dynamic networks, such as spatial/temporal adaptiveness, align with the human visual system. Furthermore, for a given sample, it is quite convenient to analyze which parts need to be activated for a dynamic network to make predictions. We hope these properties can inspire new work in the interpretability of deep learning.

6. Dynamic Network Theory.Including: 1) Optimal Decision Problems: Decision-making is an essential operation in the reasoning process of most dynamic networks. Existing methods (based on confidence, policy networks, or gating functions) lack theoretical guarantees and may not necessarily be the optimal decision-making approach. Therefore, designing decision functions with theoretical guarantees is a valuable research direction for the future. 2) Generalization Performance. In dynamic models, different sub-networks are activated by different test samples, leading to certain distributional biases faced by these sub-networks during training and testing phases. Therefore, new theories on the generalization performance of dynamic networks will be an interesting topic.

THE END

Long press the QR code to follow CAAI for more media matrices

Official WeChat

Member ID

English Official WeChat

Leave a Comment Cancel reply