Distributed TensorFlow Training with Amazon SageMaker

Machine Heart Reprint

Source: AWS Official Blog

Author: Ajay Vohra

TensorFlow is an open-source machine learning (ML) library widely used for developing large deep neural networks (DNNs), which require distributed training and utilize multiple GPUs across various hosts.Amazon SageMaker is a managed service that simplifies the ML workflow starting from labeled data through active learning, hyperparameter optimization, model distributed training, monitoring training progress, deploying trained models as auto-scaling RESTful services, and centralized management of concurrent ML experiments.

This article will focus on using Amazon SageMaker for distributed TensorFlow training.

Concept Overview

Although many distributed training concepts in this article generally apply to various types of TensorFlow models, the focus here is on distributed TensorFlow training of the Mask R-CNN model on the Common Object in Context (COCO) 2017 dataset.

Model

The Mask R-CNN model is used for object instance segmentation, where the model generates pixel-level masks (Sigmoid binary classification) and bounding boxes annotated with object classes (SoftMax classification) to depict each object instance in an image. Some common use cases of Mask R-CNN include perception for autonomous vehicles, surface defect detection, and geospatial image analysis.

There are three key reasons for choosing the Mask R-CNN model:

Distributed data parallel training of Mask R-CNN on large datasets can increase image throughput through the training pipeline and reduce training time.
The Mask R-CNN model has many open-source TensorFlow implementations.This article uses the Tensorpack Mask/Faster-RCNN implementation as the primary example, but also recommends the highly optimized AWS Sample Mask-RCNN.
The Mask R-CNN model is evaluated as a large object detection model in MLPerf results.

The following image is a schematic diagram of the Mask R-CNN deep neural network architecture.

Distributed TensorFlow Training with Amazon SageMaker

Sync Allreduce Gradients in Distributed Training

The main challenge of distributed DNN training is that the gradients computed during the backpropagation process across all GPUs need to be Allreduce (averaged) in a synchronization step before applying them to update the model weights across multiple GPUs on different nodes.

The synchronization Allreduce algorithm needs to be efficient; otherwise, any training speedup gained from distributed data parallel training can be negated by the inefficiency of the synchronization Allreduce step.

To achieve efficiency in the synchronization Allreduce algorithm, there are three main challenges:

The algorithm needs to scale with the increasing number of nodes and GPUs in the distributed training cluster.
The algorithm needs to leverage the topology of high-speed GPU-to-GPU interconnects within a single node.
The algorithm needs to effectively interleave computations on GPUs with communications to other GPUs by efficiently batching communications.

Uber’s open-source library Horovod overcomes these three main challenges by:

Providing an efficient synchronization Allreduce algorithm that scales with the number of GPUs and nodes.
Utilizing Nvidia Collective Communications Library (NCCL) communication primitives, which leverage knowledge of Nvidia GPU topology.
Including Tensor Fusion, which efficiently interleaves communication with computation by batching Allreduce data communications.

Many ML frameworks, including TensorFlow, support Horovod. TensorFlow distribution strategies also utilize NCCL and provide an alternative method for distributed TensorFlow training using Horovod. This article uses Horovod.

Training large DNNs like Mask R-CNN has high memory requirements per GPU so that you can push one or more high-resolution images through the training pipeline. They also require high-speed GPU-to-GPU interconnects and high-speed network interconnect machines for effectively synchronizing Allreduce gradients. The Amazon SageMaker ml.p3.16xlarge and ml.p3dn.24xlarge instance types meet all these requirements. For more information, see Amazon SageMaker ML instance types. They feature eight Nvidia Tesla V100 GPUs, 128–256 GB GPU memory, 25–100 Gbs network interconnect, and high-speed Nvidia NVLink GPU-to-GPU interconnect, making them well-suited for distributed TensorFlow training on Amazon SageMaker.

Message Passing Interface

The next challenge in distributed TensorFlow training is to reasonably place training algorithm processes across multiple nodes and associate each process with a unique global rank. The Message Passing Interface (MPI) is a widely used aggregation communication protocol for parallel computing, which is very useful in managing a set of training algorithm worker processes across multiple nodes.

MPI is used to place training algorithm processes across multiple nodes and associate each algorithm process with a unique global and local rank. Horovod logically fixes the algorithm processes on a given node to specific GPUs. Gradient synchronization Allreduce requires logically fixing each algorithm process to specific GPUs.

The main MPI concept to understand in this article is that MPI uses mpirun on the master node to launch concurrent processes across multiple nodes. The master node manages the lifecycle of the distributed training processes running across multiple nodes. To use MPI with Amazon SageMaker for distributed training, you must integrate MPI with Amazon SageMaker’s native distributed training capabilities.

Integrating MPI with Amazon SageMaker Distributed Training

To understand how to integrate MPI and Amazon SageMaker distributed training, you need to have a reasonable understanding of the following concepts:

Amazon SageMaker requires that both the training algorithm and framework be packaged in a Docker image.
The Docker image must be enabled for Amazon SageMaker training.Using Amazon SageMaker containers can simplify the enabling process, as this container acts as a library to help create a Docker image enabled for Amazon SageMaker.
You need to provide an entry point script (usually a Python script) in the Amazon SageMaker training image to act as an intermediary between Amazon SageMaker and your algorithm code.
To start training on the specified hosts, Amazon SageMaker runs a Docker container from the training image,then calls the entry point script using provided information (such as hyperparameters and input data location) via entry point environment variables.
The entry point script then starts the algorithm program with the correct args using the information passed to it in the entry point environment variables and polls the running algorithm processes.
If an algorithm process exits, the entry point script will exit with the exit code of the algorithm process.Amazon SageMaker uses this exit code to determine whether the training job succeeded.
The entry point script redirects the stdout and stderr of the algorithm process to its own stdout.In turn, Amazon SageMaker captures the stdout from the entry point script and sends it to Amazon CloudWatch Logs.Amazon SageMaker parses the stdout output for the algorithm metrics defined in the training job and sends the metrics to Amazon CloudWatch metrics.
When Amazon SageMaker starts a training job requesting multiple training instances, it creates a set of hosts and logically names each host algo-k, where k is the global rank of that host.For example, if the training job requests four training instances, Amazon SageMaker will name the hosts algo-1, algo-2, algo-3, and algo-4.On the network, hosts can connect using these hostnames.

If distributed training uses MPI, you need a command mpirun running on the master node (host) that controls the lifecycle of all algorithm processes distributed across multiple nodes (from algo-1 to algo-n, where n is the number of training instances requested in your Amazon SageMaker training job). However, Amazon SageMaker does not perceive MPI or any other parallel processing frameworks you might use to allocate algorithm processes across multiple nodes. Amazon SageMaker will call the entry point script running on the Docker container on each node. This means that the entry point script needs to know its global ranking of the node and execute different logic based on whether it is called on the master node or on other non-master nodes.

Specifically, for MPI, the entry point script called on the master node needs to run the mpirun command to start the algorithm processes on all nodes in the current Amazon SageMaker training job. When called by Amazon SageMaker on any non-master node, the same entry point script periodically checks whether the algorithm processes managed remotely by mpirun from the master node are still running and exits if they are not.

The master node in MPI is a logical concept that depends on the entry point script designating one host as the master node among all hosts in the current training job. This designation must be done in a distributed manner. One simple approach is to designate algo-1 as the master node and all other hosts as non-master nodes. Since Amazon SageMaker provides each node with its logical hostname in the entry point environment variables, nodes can intuitively determine whether they are the master or non-master node.

The train.py included in the accompanying GitHub repository and packaged in the Tensorpack Mask/Faster-RCNN algorithm Docker image follows the logic outlined in this section.

With such a conceptual understanding, you can proceed to the step-by-step tutorial on how to run distributed TensorFlow training for Mask R-CNN using Amazon SageMaker.

Solution Overview

This tutorial has the following key steps:

Use AWS CloudFormation automation scripts to create a private Amazon VPC and an Amazon SageMaker notebook instance network attached to this private VPC.
Start a distributed training job from the Amazon SageMaker notebook instance in the Amazon VPC network attached to your private VPC.You can use Amazon S3, Amazon EFS, and Amazon FSx as data sources for the training data pipeline.

Prerequisites

The following prerequisites must be met:

Create and activate an AWS account or use an existing AWS account.
Manage your Amazon SageMaker instance limits.You need at least two ml.p3dn.24xlarge or two ml.p3.16xlarge instances, with a recommended service limit of four each.Remember, each AWS region has specific service limits.This article uses us-west-2.
Clone the GitHub repository of this article and execute the steps in this article. All paths in this article are relative to the root directory of the GitHub repository.
Use any AWS region that supports Amazon SageMaker, EFS, and Amazon FSx. This article uses us-west-2.
Create a new S3 bucket or choose an existing one.

Create an Amazon SageMaker Notebook Instance Attached to VPC

The first step is to run the AWS CloudFormation automation script to create an Amazon SageMaker notebook instance attached to a private VPC. To run this script, you need IAM user permissions that align with network administrator roles. If you do not have such permissions, you may need to seek assistance from a network administrator to run the AWS CloudFormation automation script in this tutorial. For more information, see AWS managed policies for job functions.

Use the AWS CloudFormation template cfn-sm.yaml to create an AWS CloudFormation stack that will create a notebook instance attached to a private VPC. You can create the AWS CloudFormation stack using the cfn-sm.yaml in the AWS CloudFormation service console, or you can customize the variables in the stack-sm.sh script and run that script from anywhere you have the AWS CLI installed.

To use the AWS CLI method, perform the following steps:

Install and configure the AWS CLI.
In stack-sm.sh, set AWS_REGION and S3_BUCKET to your AWS region and your S3 bucket, respectively. You will need these two variables.
Alternatively, if you want to use an existing EFS file system, you need to set the EFS_ID variable. If your EFS_ID is left blank, a new EFS file system will be created. If you choose to use an existing EFS file system, ensure that the existing file system has no existing mount targets. For more information, see managing Amazon EFS file systems.
You can also specify GIT_URL to add a GitHub repository to the Amazon SageMaker notebook instance. If it is a GitHub repository, you can specify GIT_USER and GIT_TOKEN variables.
Run the customized stack-sm.sh script to create an AWS CloudFormation stack using the AWS CLI.

Save the AWS CloudFormation script summary output for later use. You can also view the output at the bottom of the AWS CloudFormation stack output tab in the AWS Management Console.

Start Amazon SageMaker Training Job

In the Amazon SageMaker console, open the notebook instance you created. In this notebook instance, there are three Jupyter notebooks available for training Mask R-CNN:

Mask R-CNN notebook using S3 bucket as data source: mask-rcnn-s3.ipynb.
Mask R-CNN notebook using EFS file system as data source: mask-rcnn-efs.ipynb.
Mask R-CNN notebook using Amazon FSx Lustre file system as data source: mask-rcnn-fsx.ipynb.

For the Mask R-CNN model and COCO 2017 dataset chosen in this article, the training time performance of all three data source options is roughly similar (though not exactly the same). Each data source has a different cost structure. Here are their differences in terms of setting up the training data pipeline:

For the S3 data source, it will take about 20 minutes to copy the COCO 2017 dataset from your S3 bucket to the storage volume attached to each training instance each time you start a training job.
For the EFS data source, it will take about 46 minutes to copy the COCO 2017 dataset from your S3 bucket to your EFS file system. You only need to copy this data once. During training, data will be input from the shared EFS file system mounted on all training instances via the network interface.
For Amazon FSx, it will take about 10 minutes to create a new Amazon FSx Lustre and import the COCO 2017 dataset from your S3 bucket into the new Amazon FSx Lustre file system. You only need to do this once. During training, data will be input from the shared Amazon FSx Lustre file system mounted on all training instances via the network interface.

If you are unsure which data source option is best for you, you can first try using S3, and if the training data download time at the start of each training job is unacceptable, then explore and choose EFS or Amazon FSx. Do not make assumptions about the training time performance of any data source. Training time performance depends on many factors; the best practice is to experiment and measure.

In all three cases, the logs and model checkpoint outputs during training will be written to the storage volume attached to each training instance, and then uploaded to your S3 bucket at the end of training. Logs will also be injected into Amazon CloudWatch during training, which you can check during the training process. System and algorithm training metrics will be injected into Amazon CloudWatch metrics during training, which you can visualize in the Amazon SageMaker service console.

Training Results

The following image shows example results after 24 trainings on the COCO 2017 dataset with two algorithms.

You can see the example results of the TensorPack Mask/Faster-RCNN algorithm below. The image can be split into three buckets:

Mean Average Precision (mAP) plots for target box predictions at various Intersection over Union (IoU) thresholds, as well as for small, medium, and large object sizes
Mean Average Precision (mAP) plots for object instance segmentation (segm) predictions at various Intersection over Union (IoU) thresholds, as well as for small, medium, and large object sizes
Other metrics related to training loss or label accuracy

You can see the example results of the optimized AWS Samples Mask R-CNN algorithm below. The aggregated mAP metrics displayed are nearly identical to those of the previous algorithm, though the convergence progress varies.

Conclusion

Amazon SageMaker provides a Docker-based simplified platform for distributed TensorFlow training, allowing you to focus on your ML algorithms without being distracted by ancillary issues such as infrastructure availability and scalability mechanisms, and concurrent experiment management, etc. Once model training is complete, you can use Amazon SageMaker’s integrated model deployment capabilities to create an auto-scaling RESTful service endpoint for your model and start testing it. For more information, see Deploying Models on Amazon SageMaker Managed Services. Once the model is ready, you can seamlessly deploy the model RESTful service into production.

About the Author

Ajay Vohra is a Chief Solutions Architect specializing in developing perception machine learning for autonomous vehicle development. Before joining Amazon, Ajay worked in a local data center on large-scale parallel grid computing for financial risk modeling and automation of application platform engineering.

Amazon SageMaker 1000 YuanGift Package

Now, enterprise developers can receive a service deduction voucher worth 1000 Yuan for free, easily getting started with Amazon SageMaker, in addition to the content covered in this article, there are more application examples waiting for you to experience.

Click to read the original text or scan the QR code, and get it immediately.