Click the above “Beginner Learning Vision” to select Star or Pin.
Important content delivered promptly
This article is transferred from | Computer Vision Alliance

tf.distribute.Strategy is a TensorFlow API used to distribute training across multiple GPUs, multiple machines, or TPUs. With this API, you can distribute existing models and training code with minimal code changes.
Keep the following key goals in mind when designing tf.distribute.Strategy:
Easy to use and supports multiple user segments, including researchers, ML engineers, etc.
Provides good performance out of the box.
Easy to switch strategies.
tf.distribute.Strategy can also be used in conjunction with high-level APIs like Keras for distributing custom training loops (and any computation typically done with TensorFlow).
In TensorFlow 2.0, you can use tf.function or eager execution in graphs. tf.distribute.Strategy is intended to support both execution modes. Although we discuss training for most of this guide, this API can also be used to distribute evaluation and prediction across different platforms.
You can change your code using tf.distribute.Strategy because we have altered the foundational components of TensorFlow to be strategy-aware. This includes variables, layers, models, optimizers, metrics, summaries, and checkpoints.
In this guide, we outline various types of strategies and how to use them in different scenarios.
tf.distribute.Strategy
aims to cover a number of use cases along different axes. Some of these combinations are currently supported and others will be added in the future. Some of these axes are:
-
Synchronous vs asynchronous training: These are two common ways of distributing training with data parallelism. In sync training, all workers train over different slices of input data in sync, and aggregating gradients at each step. In async training, all workers are independently training over the input data and updating variables asynchronously. Typically sync training is supported via all-reduce and async through parameter server architecture.
-
Hardware platform: You may want to scale your training onto multiple GPUs on one machine, or multiple machines in a network (with 0 or more GPUs each), or on Cloud TPUs.
In order to support these use cases, there are five strategies available. In the next section we explain which of these are supported in which scenarios in TF 2.0 at this time. Here is a quick overview:
Training API | MirroredStrategy | TPUStrategy | MultiWorkerMirroredStrategy | CentralStorageStrategy | ParameterServerStrategy | OneDeviceStrategy |
---|---|---|---|---|---|---|
Keras API | Supported | Experimental support | Experimental support | Experimental support | Supported planned post 2.0 | Supported |
Custom training loop | Experimental support | Experimental support | Support planned post 2.0 | Support planned post 2.0 | No support yet | Supported |
Estimator API | Limited Support | Not supported | Limited Support | Limited Support | Limited Support | Limited Support |
tf.distribute.MirroredStrategy
supports synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable
. These variables are kept in sync with each other by applying identical updates.
Efficient all-reduce algorithms are used to communicate the variable updates across the devices. All-reduce aggregates tensors across all the devices by adding them up, and makes them available on each device. It’s a fused algorithm that is very efficient and can reduce the overhead of synchronization significantly. There are many all-reduce algorithms and implementations available, depending on the type of communication available between devices. By default, it uses NVIDIA NCCL as the all-reduce implementation. You can choose from a few other options we provide, or write your own.
Here is the simplest way of creating MirroredStrategy
:
mirrored_strategy = tf.distribute.MirroredStrategy()
This will create a MirroredStrategy
instance which will use all the GPUs that are visible to TensorFlow, and use NCCL as the cross device communication.
If you wish to use only some of the GPUs on your machine, you can do so like this:
mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:1,/job:localhost/replica:0/task:0/device:GPU:0
If you wish to override the cross device communication, you can do so using the cross_device_ops
argument by supplying an instance of tf.distribute.CrossDeviceOps
. Currently, tf.distribute.HierarchicalCopyAllReduce
and tf.distribute.ReductionToOneDevice
are two options other than tf.distribute.NcclAllReduce
which is the default.
mirrored_strategy = tf.distribute.MirroredStrategy(
cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
tf.distribute.experimental.CentralStorageStrategy
does synchronous training as well. Variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs. If there is only one GPU, all variables and operations will be placed on that GPU.
Create an instance of CentralStorageStrategy
by:
central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()
INFO:tensorflow:ParameterServerStrategy with compute_devices = ('/device:GPU:0',), variable_device = '/device:GPU:0'
This will create a CentralStorageStrategy
instance which will use all visible GPUs and CPU. Update to variables on replicas will be aggregated before being applied to variables.
tf.distribute.experimental.MultiWorkerMirroredStrategy
is very similar to MirroredStrategy
. It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs. Similar to MirroredStrategy
, it creates copies of all variables in the model on each device across all workers.
It uses CollectiveOps as the multi-worker all-reduce communication method used to keep variables in sync. A collective op is a single op in the TensorFlow graph which can automatically choose an all-reduce algorithm in the TensorFlow runtime according to hardware, network topology and tensor sizes.
It also implements additional performance optimizations. For example, it includes a static optimization that converts multiple all-reductions on small tensors into fewer all-reductions on larger tensors. In addition, we are designing it to have a plugin architecture – so that in the future, you will be able to plugin algorithms that are better tuned for your hardware. Note that collective ops also implement other collective operations such as broadcast and all-gather.
Here is the simplest way of creating MultiWorkerMirroredStrategy
:
multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Single-worker CollectiveAllReduceStrategy with local_devices = ('/device:GPU:0',), communication = CollectiveCommunication.AUTO
MultiWorkerMirroredStrategy
currently allows you to choose between two different implementations of collective ops. CollectiveCommunication.RING
implements ring-based collectives using gRPC as the communication layer.CollectiveCommunication.NCCL
uses Nvidia’s NCCL to implement collectives. CollectiveCommunication.AUTO
defers the choice to the runtime. The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster. You can specify them in the following way:
multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)
WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Single-worker CollectiveAllReduceStrategy with local_devices = ('/device:GPU:0',), communication = CollectiveCommunication.NCCL
One of the key differences to get multi worker training going, as compared to multi-GPU training, is the multi-worker setup. The TF_CONFIG
environment variable is the standard way in TensorFlow to specify the cluster configuration to each worker that is part of the cluster. Learn more about setting up TF_CONFIG.
tf.distribute.experimental.TPUStrategy
lets you run your TensorFlow training on Tensor Processing Units (TPUs). TPUs are Google’s specialized ASICs designed to dramatically accelerate machine learning workloads. They are available on Google Colab, the TensorFlow Research Cloud and Cloud TPU.
In terms of distributed training architecture, TPUStrategy
is the same as MirroredStrategy
– it implements synchronous distributed training. TPUs provide their own implementation of efficient all-reduce and other collective operations across multiple TPU cores, which are used in TPUStrategy
.
Here is how you would instantiate TPUStrategy
:
cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
tpu=tpu_address)
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
tpu_strategy = tf.distribute.experimental.TPUStrategy(cluster_resolver)
The TPUClusterResolver
instance helps locate the TPUs. In Colab, you don’t need to specify any arguments to it.
If you want to use this for Cloud TPUs: – You must specify the name of your TPU resource in the tpu
argument. – You must initialize the tpu system explicitly at the start of the program. This is required before TPUs can be used for computation. Initializing the tpu system also wipes out the TPU memory, so it’s important to complete this step first in order to avoid losing state.
tf.distribute.experimental.ParameterServerStrategy
supports parameter servers training on multiple machines. In this setup, some machines are designated as workers and some as parameter servers. Each variable of the model is placed on one parameter server. Computation is replicated across all GPUs of all the workers.
In terms of code, it looks similar to other strategies:
ps_strategy = tf.distribute.experimental.ParameterServerStrategy()
For multi worker training, TF_CONFIG
needs to specify the configuration of parameter servers and workers in your cluster, which you can read more about in TF_CONFIG below below.
tf.distribute.OneDeviceStrategy
runs on a single device. This strategy will place any variables created in its scope on the specified device. Input distributed through this strategy will be prefetched to the specified device. Moreover, any functions called via strategy.experimental_run_v2
will also be placed on the specified device.
You can use this strategy to test your code before switching to other strategies which actually distributes to multiple devices/machines.
strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
So far we’ve talked about what are the different strategies available and how you can instantiate them. In the next few sections, we will talk about the different ways in which you can use them to distribute your training. We will show short code snippets in this guide and link off to full tutorials which you can run end to end.
Using tf.distribute.Strategy with Keras
We’ve integrated tf.distribute.Strategy
into tf.keras
which is TensorFlow’s implementation of the Keras API specification. tf.keras
is a high-level API to build and train models. By integrating into tf.keras
backend, we’ve made it seamless for you to distribute your training written in the Keras training framework.
Here’s what you need to change in your code:
-
Create an instance of the appropriate
tf.distribute.Strategy
-
Move the creation and compiling of Keras model inside
strategy.scope
.
We support all types of Keras models – sequential, functional and subclassed.
Here is a snippet of code to do this for a very simple Keras model with one dense layer
END
Good news, the Beginner Learning Vision team’s knowledge platform is now open! To thank everyone for their support and love, the team has decided to offer free access to the knowledge platform valued at 149 yuan. Everyone should seize the opportunity!
Group chat
Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions (which will gradually be subdivided in the future). Please scan the WeChat number below to join the group, and note: “nickname + school/company + research direction”, for example: “Zhang San + Shanghai Jiao Tong University + Vision SLAM”. Please follow the format in your note, otherwise, you will not be approved. After successful addition, you will be invited to join the relevant WeChat group based on your research direction. Please do not send advertisements in the group, otherwise you will be removed from the group. Thank you for your understanding~