Methods for Neural Network Compression and Acceleration

Abstract: This issue discusses the technological innovations in neural network compression and acceleration at different levels, mainly involving three aspects: (1) neural network compression and its hardware-software co-design methods; (2) neural network compression software frameworks and compilation optimization techniques; (3) automation of neural network compression and support for new architectures. It aims to help readers understand the cutting-edge intersections in the fields of deep learning and architecture, and further envision the vast prospects of intelligent computing in the Internet of Things (IoT) systems and their applications.

Methods for Neural Network Compression and Acceleration

Keywords: Deep Learning, Neural Network Compression, Hardware-Software Co-Design

The intelligent Internet of Things plays an extremely important role in the construction of new infrastructure (new infrastructure), distinguishing it from previous infrastructure projects, and is a new driving force for the future development of the digital economy. In 2009, during a visit to Wuxi, then Premier Wen Jiabao proposed the construction of the “Perception China” center, thus initiating the era of China’s development of the Internet of Things as an information technology wave and a new economic engine. Over the past decade, the Internet of Things has gradually gone through stages of conceptual rise, demonstration applications leading, significant technological advancements, and gradual industrial maturity, and has now become an important driving force for a new round of technological revolution and industrial transformation worldwide. The emergence and development of 5G technology will help establish a more IoT-friendly ecosystem, playing an important role in promoting the deep integration and iterative evolution of new generation information technology and urban modernization, facilitating the construction and development of smart cities in China.

The Internet of Things essentially extends the Internet to intelligent terminals. However, unlike previous Internet environments, the large number of intelligent terminals in the IoT generates a massive amount of low-value-density data, necessitating the use of complex neural networks represented by deep learning for intelligent processing. In this increasingly complex IoT system, data, algorithms (models), and computing power are evolving collaboratively towards a more reasonable distribution across cloud data centers, edge service nodes, and intelligent terminals. Moreover, there seems to be a consensus: only those intelligent models that truly realize intelligence on edge devices and IoT devices can define the future of computing. Currently, Domain-Specific Architecture (DSA) has entered a golden age of development, and successful optimization solutions for deep learning model compression and hardware-software co-design are likely to drive the production of billions of mobile devices and trillions of IoT devices, ultimately generating significant commercial and social value.

Currently, artificial intelligence technologies represented by deep learning are developing rapidly, mainly benefiting from the virtuous cycle formed by data, algorithms (models), and computing power. In particular, neural networks, as an algorithmic framework with wide applicability and a unified, clear practical path, have formed a demand-driven, full-stack rapid development trend. However, increasingly complex applications such as medical imaging and natural language in real life often require substantial storage, computing power, and energy for training deep neural network models. The models formed through training are relatively large, with complex and irregular network structures, and the number of parameters can sometimes reach hundreds of millions. It is difficult to efficiently deploy and apply them directly in IoT systems relying solely on hardware-software co-design. Neural network compression aims to eliminate redundant parameters, reduce intermediate results, and lower structural complexity while ensuring low loss of model inference accuracy. Common compression methods include quantization, pruning, knowledge distillation, and matrix decomposition, which help improve the real-time performance and quality of deep learning application deployment, and have a significant impact on the application of intelligent processors represented by Huawei Ascend and Cambricon.

In the future, we need to further consider how to apply new technologies and architectures to promote the sustainable development of neural networks towards lightweight and high-performance directions. Therefore, this issue discusses the technological innovations in neural network compression and acceleration at different levels, mainly involving three aspects: (1) neural network compression and its hardware-software co-design methods; (2) neural network compression software frameworks and compilation optimization techniques; (3) automation of neural network compression and support for new architectures.

The article “Hardware-Software Co-Optimization for Efficient Deployment of Neural Networks” written by Professor Wang Yu from Tsinghua University leads this issue. It analyzes the three main factors affecting the efficient deployment of neural networks—workload, peak performance, and computational efficiency—from multiple perspectives: algorithms, systems, and practices, focusing on four main application scenarios for neural network acceleration: sensors and self-powered devices, smartphones and IoT, autonomous driving and smart cities, and cloud data centers. It discusses various optimization methods for neural network compression and acceleration, which have important reference value for both academia and industry.

Researcher Chen Yao from the Singapore Centre for Digital Science and Technology wrote the article “Opportunities and Challenges of Hardware-Software Co-Design for Edge AI,” which defines a new paradigm for hardware-software co-design for edge AI—Neural Architecture and Implementation Search (NAIS). It discusses bidirectional DNN/accelerator co-design and differentiable DNN/accelerator synchronous search, and presents some problems and challenges that need to be addressed to promote this method.

Assistant Professor Wang Yanzhi from Northeastern University wrote the article “AI Application Realization on Ordinary Mobile Phones Without Dedicated Hardware Acceleration,” which explores the software optimization potential for deep learning applications on edge and IoT devices. It discusses the compression-compilation co-optimization method and its supported inference acceleration framework, achieving a “hand-in-hand” design of deep learning model compression and its executable code compilation, which helps improve the real-time performance and quality of deep learning application deployment.

Associate Researcher Li Wei from the Institute of Computing Technology, Chinese Academy of Sciences, wrote the article “Neural Network Accelerators and Their Software Programming Systems,” which mainly discusses the DianNao series of neural network accelerators proposed and designed by the Cambricon team, as well as the neural network programming, compilation, and optimization methods that adapt to this infrastructure, achieving the goal of using a relatively unified set of operators and algorithm frameworks to handle various data and tasks.

Associate Professor Jiang Li from Shanghai Jiao Tong University wrote the article “Evolution of Specialized Architectures and Compression Technologies for Deep Neural Networks,” which summarizes the development context in four aspects: design abstraction level, toolchain, compression objectives, and dynamic compression, focusing on the open issue of sparsity in deep neural networks, discussing compression technologies for deep neural networks and their co-design with specialized chip architectures, and providing an evolutionary roadmap for deep neural network processors.

Associate Professor Lu Ye from Nankai University wrote the article “Challenges and Opportunities of Automation in Deep Neural Network Compression,” which addresses issues such as the variety of model structures, vast search space, and difficulties in designing search methods in the automation of deep neural network compression. It uses explorations in automated quantization of deep neural networks as an example to discuss the progress and challenges of compression automation derived from AutoML (automated machine learning).

Associate Professor Sun Guangyu from Peking University wrote the article “Design and Challenges of Memory-Compute Integrated Neural Network Accelerator Architecture Based on New Memory Devices,” which focuses on the increasingly prominent memory wall problem faced by data-intensive applications such as deep neural networks. It details the principles of memory-compute integrated architectures supported by these new memory devices, such as ReRAM, as well as neural network mapping methods, and presents the challenges and solutions faced by this new neural network acceleration method.

Since the release of the “New Generation Artificial Intelligence Development Plan” in 2017, artificial intelligence has become an important national development strategy in China. Deep learning is one of the main research directions in the field of artificial intelligence, leading to the construction of different types of deep neural networks based on the unique characteristics of data domains such as video (images), voice, and text, which are deployed and applied in various scenarios such as terminals, edges, and cloud centers in IoT systems. Based on the general consensus on deep learning applications, the increasing demand for high precision leads to more and more layers and larger parameter counts in deep neural networks, while the demand for high efficiency makes the compression of deep neural networks and the design of dedicated chip architectures more urgent. Fortunately, the sparsity and redundancy of deep neural networks provide the necessary conditions for solutions that can satisfy both. With the rapid advancement of hardware-software co-design technologies such as Field-Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs), neural network accelerators based on key technologies such as mixed precision, dynamic training, and in-memory computing have been designed one after another, and their software ecosystems have also rapidly developed, ultimately forming a heterogeneous distributed intelligent computing system that is “computing everywhere, anytime.” We hope that through this issue, readers can understand the cutting-edge intersections in the fields of deep learning and architecture, and further envision the vast prospects of intelligent computing in IoT systems and their applications.

Special Statement: The China Computer Federation (CCF) owns all copyrights of the contents published in the “Communications of the China Computer Federation” (CCCF). Without the permission of CCF, no text or photos from this publication may be reproduced; otherwise, it will be considered infringement. CCF will pursue legal responsibility for infringement.

Author Introduction

Methods for Neural Network Compression and Acceleration

Li Tao

CCF Outstanding Member, Director, CCCF Special Editor. Professor at Nankai University, Deputy Director of the Party Committee Network Information Office (Big Data Management Center). Main research directions include heterogeneous computing, machine learning, intelligent IoT, blockchain, etc.

[email protected]

Methods for Neural Network Compression and Acceleration

Click “Read the Original” to join CCF.

Leave a Comment Cancel reply