Training BERT and ResNet on Smartphones: 35% Energy Reduction

Researchers state that they see edge training as an optimization problem, thereby discovering the optimal scheduling to achieve minimal energy consumption under a given memory budget.

Currently, deep learning models are widely deployed on edge devices such as smartphones and embedded platforms for inference. Training, however, is still primarily conducted on large cloud servers equipped with high-throughput accelerators like GPUs. Centralized cloud training models require sensitive data, such as photos and keystrokes, to be transmitted from edge devices to the cloud, sacrificing user privacy and incurring additional data movement costs.

Training BERT and ResNet on Smartphones: 35% Energy Reduction

Figure Caption: Twitter @Shishir Patil

Therefore, to allow users to personalize their models without sacrificing privacy, device-based training methods like federated learning do not require data to be integrated into the cloud, enabling local training updates. These methods have been deployed in Google’s Gboard keyboard to personalize keyboard suggestions and are also used by iPhones to enhance automatic speech recognition. However, current device-based training methods do not support training modern architectures and large models. Training larger models on edge devices is not feasible mainly due to limited device memory, which cannot store backward propagation activations. The memory required for a single training iteration of ResNet-50 is over 200 times that of inference.

Previous works proposed strategies such as paging to auxiliary memory and re-implementation to reduce memory usage for cloud training. However, these methods significantly increase overall energy consumption. The data transfer associated with paging methods often requires more energy than the computation of the data. As the memory budget shrinks, re-implementation increases energy consumption at a rate of O(n^2).

In a recent paper from UC Berkeley, several researchers showed that paging and re-implementation are highly complementary. By re-implementing simple operations while paging the results of complex operations to auxiliary storage such as flash memory or SD cards, they were able to expand effective memory capacity with minimal energy consumption. Moreover, through the combination of these two methods, the researchers also demonstrated that training models like BERT on mobile-grade edge devices is possible. By viewing edge training as an optimization problem, they discovered the optimal scheduling to achieve minimal energy consumption under a given memory budget.

Paper link: https://arxiv.org/pdf/2207.07697.pdf
Project homepage: https://poet.cs.berkeley.edu/
GitHub link: https://github.com/shishirpatil/poet

The researchers proposed POET (Private Optimal Energy Training), an algorithm for energy-optimal training of modern neural networks on memory-constrained edge devices, as shown in the architecture below (Figure 1). Given the high cost of caching all activation tensors for backpropagation, POET optimizes paging and re-implementation of activations, thus reducing the highest memory consumption by twofold. They rephrased the edge training problem as an integer linear programming (ILP) problem, finding that it could be solved to optimality within 10 minutes using a solver.

Figure Caption: POET optimizes the training of SOTA machine learning models on edge devices.

For models deployed on real-world edge devices, training occurs when the edge device is idle and has computation cycles available, for example, Google Gboard schedules model updates while the phone is charging. Therefore, POET also includes strict training constraints. Given memory limitations and the number of training epochs, the solutions generated by POET can also meet the specified training deadlines. Additionally, the researchers developed a comprehensive cost model using POET and proved that it is mathematically sound (i.e., not approximate), applicable to existing out-of-the-box architectures.

Lead author Shishir Patil stated in a demonstration video that the POET algorithm can train any SOTA model requiring a large amount of memory on commercial edge devices like smartphones.They also became the first research team to demonstrate training SOTA machine learning models like BERT and ResNet on smartphones and ARM Cortex-M devices.

Integrating Paging and Re-Implementation

Re-implementation and paging are two techniques for reducing memory consumption of large SOTA ML models. In re-implementation, once an activation tensor is no longer needed, it is deleted, most commonly during forward propagation. This frees up valuable memory for storing activations of subsequent layers. When the deleted tensor is needed again, this method recalculates it from other related activations according to lineage. Paging, also known as offloading, is a complementary technique for reducing memory. In paging, activation tensors that are not immediately needed are moved from main memory to secondary storage, such as flash memory or SD cards. When the tensor is needed again, it is paged back in.

Figure 2 shows a timeline for executing an eight-layer neural network. Along the X-axis, each unit corresponds to each layer of the neural network (a total of 8 layers L8). The Y-axis represents logical time steps within an epoch. The occupied units in the figure (filled with color) represent operations executed at the corresponding time steps (forward/backward propagation computations, re-implementation, or paging).

For example, we can see that the activation of L1 is computed at the first time step (T1). At time steps T2 and T3, the activations of L2 and L3 are computed respectively. Assuming that layers L2 and L3 are memory-intensive but computationally inexpensive operations, such as nonlinear (tanH, ReLU, etc.), then re-implementation becomes the optimal choice. We can delete the activations ({T3, L2}, {T4, L3}) to free memory, and when these activations are needed again during backpropagation, they can be re-implemented ({T14, L3}, {T16, L2}).

Assuming layers L5 and L6 are computation-intensive operations, such as convolutions or dense matrix multiplications. For such operations, re-implementation would lead to increased runtime and energy, making this approach suboptimal. For these layers, it is better to page the activation tensors to auxiliary storage ({T6, L5}, {T7, L6}) and page them back in when needed ({T10, L6}, {T11, L5}).

A major advantage of paging is that it can be pipelined based on the occupancy of the memory bus to hide latency. This is because modern systems have DMA (Direct Memory Access) capabilities that can move activation tensors from auxiliary storage to main memory while the computation engine runs in parallel. For example, at time step T7, L6 can be paged out while L7 is being computed simultaneously. However, re-implementation is computation-intensive and cannot be parallelized, leading to increased runtime. For instance, we must allocate time step T14 for re-computing L3, causing delays in the remaining backpropagation execution.

POET

This research proposed POET, a graph-level compiler for deep neural networks that rewrites the training DAG of large models to fit the memory constraints of edge devices while maintaining high energy efficiency.

POET is hardware-aware; it first tracks the execution of forward and backward propagation along with related memory allocation requests, runtimes, and the memory and energy consumption of each operation. This fine-grained analysis for each workload occurs only once for a given hardware, is automated, inexpensive, and provides POET with the most accurate cost model. POET then generates a mixed-integer linear programming (MILP) formulation that can be effectively solved.

POET optimizer searches for effective re-implementation and paging schedules to minimize end-to-end energy consumption under memory constraints. It then generates a new DAG based on the obtained schedule for execution on edge devices.

While the MILP is solved on commercial hardware, the schedule sent to edge devices is only a few hundred bytes, making it highly memory-efficient.

For operations that are computationally inexpensive but memory-intensive, re-implementation is the most effective. However, paging is best suited for computation-intensive operations, where re-implementation would lead to significant energy overhead. POET jointly considers re-implementation and paging in an integrated search space.

The methods in this paper are scalable to complex, real-world architectures, and the POET optimizer algorithm is as follows.

The research introduced a new objective function in the optimization problem to minimize the combined energy consumption of computation, page-in, and page-out. The new objective function combining paging and re-implementation energy consumption is:

where Φ_compute, Φ_pagein, and Φ_pageout represent the energy consumed by each node during computation, page-in, and page-out, respectively.

POET outputs the DAG schedule based on which nodes (k) are re-implemented and which nodes are page-in Training BERT and ResNet on Smartphones: 35% Energy Reduction

or page-out Training BERT and ResNet on Smartphones: 35% Energy Reduction

to output.

Experimental Results

In evaluating POET, the researchers aimed to answer three key questions. First, how much energy can POET reduce across different models and platforms? Second, how does POET benefit from the hybrid paging and re-implementation strategy? Finally, how does POET adapt to different runtime budgets?

The researchers listed four different hardware devices in Table 2: ARM Cortex M0 MKR1000, ARM Cortex M4F nrf52840, A72 Raspberry Pi 4B +, and Nvidia Jetson TX2. POET is fully hardware-aware, relying on fine-grained analysis.

Figure 3 shows the energy consumption per training epoch, with each column corresponding to a different hardware platform. The researchers found that POET generated energy-optimal schedules across all platforms (Y-axis), while reducing peak memory consumption (X-axis) and meeting time budgets.

In Figure 5, the researchers benchmarked POET and Capuchin while training ResNet-18 on A72. As the RAM budget decreased, Capuchin had 73% to 141% more energy consumption than the baseline with full memory. In contrast, the energy consumption generated by POET was less than 1%. This trend applies to all architectures and platforms tested.

In Table 3, the study benchmarked POET and POFO while training ResNet-18 on Nvidia’s Jetson TX2. The researchers found that POET identified an integrated re-implementation and paging schedule that reduced peak memory consumption by 8.3% and increased throughput by 13%. This showcases the advantages of POET’s MILP solver, which can optimize over a larger search space. While POFO only supports linear models, POET can be extended to nonlinear models, as shown in Figure 3.

Figure 4 emphasizes the benefits of POET’s integrated strategy under different time constraints. For each runtime, the total energy consumption is plotted in the figure below.

Leave a Comment Cancel reply