Training BERT and ResNet on Smartphones: 35% Energy Reduction

This article is sourced from Machine Heart.

Researchers indicate that they viewedge training as an optimization problem, thus discovering the optimal scheduling to achieve minimal energy consumption under a given memory budget.

Currently, deep learning models are widely deployed for inference on edge devices such as smartphones and embedded platforms. However, training predominantly occurs on large cloud servers equipped with high-throughput accelerators like GPUs. Centralized cloud training requires sensitive data such as photos and keystrokes to be transmitted from edge devices to the cloud, sacrificing user privacy and incurring additional data transfer costs.

Training BERT and ResNet on Smartphones: 35% Energy Reduction

Figure caption: Twitter @Shishir Patil

Therefore, to allow users to personalize their models without sacrificing privacy, device-based training methods like federated learning do not require data to be centralized in the cloud and can perform local training updates. These methods have been deployed in Google’s Gboard keyboard for personalized keyboard suggestions and used in iPhones to enhance automatic speech recognition. However, current device-based training methods do not support training modern architectures and large models. Training larger models on edge devices is impractical, primarily due to limited device memory that cannot store backpropagation activations. The memory required for a single training iteration of ResNet-50 is over 200 times that of inference.

Previous strategies proposed include paging to auxiliary memory and re-implementation to reduce memory usage during cloud training. However, these methods significantly increase overall energy consumption. Data transfer associated with paging methods typically requires more energy than computing the data itself. As the memory budget shrinks, re-implementation increases energy consumption at a rate of O(n^2).

In a recent paper from UC Berkeley, several researchers demonstrated that paging and re-implementation are highly complementary. By re-implementing simple operations while paging the results of complex operations to auxiliary storage like flash or SD cards, they were able to effectively expand memory capacity with minimal energy consumption. Moreover, through the combination of these two methods, the researchers also proved thattraining models like BERT on mobile-level edge devices is feasible. By viewing edge training as an optimization problem, they discovered the optimal scheduling to achieve minimal energy consumption under a given memory budget.

Paper link: https://arxiv.org/pdf/2207.07697.pdf
Project homepage: https://poet.cs.berkeley.edu/
GitHub link: https://github.com/shishirpatil/poet

The researchers proposed POET (Private Optimal Energy Training), an algorithm for energy-optimal training of modern neural networks on memory-constrained edge devices, as shown in the architecture in Figure 1. Given the high cost of caching all activation tensors for backpropagation, POET optimizes paging and re-implementation of activations, thereby reducing the highest memory consumption by twofold. They reformulated the edge training problem as Integer Linear Programming (ILP) and found that it could be solved to optimality within 10 minutes using a solver.

Figure caption: POET optimizes the training of SOTA machine learning models on edge devices.

For models deployed on real-world edge devices, training occurs when the edge device is idle and can compute cycles, for example, Google Gboard schedules model updates when the phone is charging. Therefore, POET also incorporates strict training constraints. Given memory limits and the number of training epochs, the solutions generated by POET can also meet specified training deadlines. In addition, the researchers developed a comprehensive cost model using POET and proved its mathematical validity (i.e., no approximations), suitable for existing out-of-the-box architectures.

Lead author Shishir Patil stated in a demonstration video that the POET algorithm can train any SOTA model requiring significant memory on commercial edge devices like smartphones.They also became the first research team to demonstrate training SOTA machine learning models like BERT and ResNet on smartphones and ARM Cortex-M devices.

Integrated Paging and Re-Implementation

Re-implementation and paging are two techniques for reducing memory consumption of large SOTA ML models. In re-implementation, once an activation tensor is no longer needed, it is deleted, most commonly during forward propagation. This frees up valuable memory for storing activations of subsequent layers. When the deleted tensor is needed again, this method recalculates it based on lineage from other relevant activations. Paging, also known as offloading, is a supplementary technique for reducing memory. In paging, activation tensors that are not immediately needed are moved from main memory to secondary storage, such as flash or SD cards. When the tensor is needed again, it is paged back in.

Figure 2 shows the execution timeline of an eight-layer neural network. Along the X-axis, each unit corresponds to each layer of the neural network (8 layers total). The Y-axis represents logical time steps within an epoch. The occupied units in the figure (colored) represent operations executed during the corresponding time steps (forward/backward propagation computations, re-implementation, or paging).

For example, we can see that the activation of L1 is computed at the first time step (T1). At T2 and T3, the activations of L2 and L3 are computed, respectively. Assuming layers L2 and L3 are memory-intensive but computationally inexpensive operations, such as nonlinear (tanH, ReLU, etc.), then re-implementation becomes the optimal choice. We can delete the activations ({T3, L2}, {T4, L3}) to free memory, and when these activations are needed again during backpropagation, they can be re-implemented ({T14, L3}, {T16, L2}).

Assuming layers L5 and L6 are computationally intensive operations, such as convolutions, dense matrix multiplications, etc. For such operations, re-implementation will lead to increased run time and energy, making it suboptimal. For these layers, it is better to page the activation tensors to auxiliary storage ({T6, L5}, {T7, L6}) and page them back in when needed ({T10, L6}, {T11, L5}).

A major advantage of paging is that, depending on the occupancy of the memory bus, pipelining can be performed to hide latency. This is because modern systems possess DMA (Direct Memory Access) capabilities, allowing activation tensors to be moved from auxiliary storage to main memory while the computing engine runs in parallel. For example, at time step T7, L6 can be paged out while L7 is being computed. However, re-implementation is computationally intensive and cannot be parallelized, leading to increased run times. For instance, we must use time step T14 for re-computing L3, delaying the execution of the remaining backpropagation.

POET

This research proposes POET, a graph-level compiler for deep neural networks that rewrites the training DAG of large models to fit the memory constraints of edge devices while maintaining high energy efficiency.

POET is hardware-aware; it first tracks the execution of forward and backward propagation along with associated memory allocation requests, run times, and energy consumption for each operation. This fine-grained analysis for each workload occurs only once for a given hardware, is automated, cost-effective, and provides POET with the most accurate cost model. POET then generates a mixed-integer linear programming (MILP) model that can be efficiently solved.

POET optimizer searches for effective re-implementation and paging schedules to minimize end-to-end energy consumption constrained by memory. It then uses the obtained schedules to generate a new DAG for execution on edge devices.

While the MILP is solved on commercial hardware, the schedule sent to edge devices is only a few hundred bytes, making it highly memory efficient.

For operations that are computationally inexpensive but memory-intensive, re-implementation is the most effective. However, paging is best suited for computationally intensive operations, where re-implementation would incur significant energy overhead. POET considers re-implementation and paging together in an integrated search space.

The methods in this paper are scalable to complex, real-world architectures, as illustrated by the POET optimizer algorithm.

This research introduces a new objective function in the optimization problem to minimize the combined energy consumption of computation, page-in, and page-out. The new objective function combining the energy consumption of paging and re-implementation is:

Where Φ_compute, Φ_pagein, and Φ_pageout represent the energy consumed by each node during computation, page-in, and page-out, respectively.

POET outputs the DAG schedule based on which nodes (k) are re-implemented and which nodes are page-in Training BERT and ResNet on Smartphones: 35% Energy Reduction

or page-out Training BERT and ResNet on Smartphones: 35% Energy Reduction

Experimental Results

In evaluating POET, the researchers aimed to answer three key questions. First, how much energy can POET reduce across different models and platforms? Second, how does POET benefit from the hybrid paging and re-implementation strategy? Lastly, how does POET adapt to different runtime budgets?

The researchers listed four different hardware devices in Table 2: ARM Cortex M0 MKR1000, ARM Cortex M4F nrf52840, A72 Raspberry Pi 4B +, and Nvidia Jetson TX2. POET is completely hardware-aware, relying on fine-grained analysis.

Figure 3 shows the energy consumption for a single training epoch, with each column corresponding to different hardware platforms. The researchers found that POET generated energy-efficient optimal schedules (Y-axis) across all platforms while reducing peak memory consumption (X-axis) and meeting time budgets.

In Figure 5, the researchers benchmarked POET against Capuchin while training ResNet-18 on A72. As the RAM budget decreased, Capuchin consumed 73% to 141% more energy than the baseline with full memory. In contrast, the energy consumed by POET was less than 1%. This trend applies to all architectures and platforms tested.

In Table 3, the study benchmarked POET against POFO while training ResNet-18 on Nvidia’s Jetson TX2. The researchers found that POET identified an integrated re-implementation and paging schedule that reduced peak memory consumption by 8.3% and improved throughput by 13%. This demonstrates the advantage of POET’s MILP solver, which optimizes over a larger search space. While POFO only supports linear models, POET can generalize to nonlinear models, as shown in Figure 3.

Figure 4 highlights the benefits of POET adopting an integrated strategy under different time constraints. For each runtime, the total energy consumption is plotted in the following figure.

Extra, welcome to attend theGlobal Edge Computing Conference on August 6 in Shenzhen. Here you can see the best edge computing software and hardware solutions in the country. We have invited leading domestic players in edge computing such as Volcano Engine Edge Computing, Lenovo Group, Wangsu Technology, EMQ, and Wangxin Technology to discuss together.

Training BERT and ResNet on Smartphones: 35% Energy Reduction

Leave a Comment Cancel reply