Best Practices and Code Templates for PyTorch

Click on the above “Beginner’s Guide to Vision”, select to add “Star” or “Top”

Heavyweight content, delivered first time

Author:Igor Susmelj

Translation: ronghuaiyang

Introduction

Since PyTorch 1.0, more and more people have chosen to use PyTorch. Today, we introduce a GitHub project where the author summarizes a very useful set of best practices for using PyTorch based on their actual engineering experience with PyTorch. This covers all aspects of using PyTorch, and you will gain a lot from it!

This is not the official style guide for PyTorch. This article summarizes the best practices from over a year of experience using the PyTorch framework for deep learning. Please note that most of the experiences we share come from the perspective of research and startups.

This is an open project, and other collaborators are welcome to edit and improve the documentation.

The document has three main parts. First, it briefly reviews the best practices in Python, then introduces some tips and advice for using PyTorch. Finally, we share insights and experiences from using other frameworks that often help us improve our workflow.

We Recommend Using Python 3.6+

Based on our experience, we recommend using Python 3.6+ because it has the following features that are very convenient for concise code:

Typing support from Python 3.6
f-string support from Python 3.6

Review of Python Style Guide

We try to follow Google’s style guide for Python.

Please refer to the rich Python code style guide provided by Google.

We provide a summary of the most commonly used rules:

Naming Conventions

Type	Convention	Example
Packages & Modules	lowerwithunder	from prefetch_generator import BackgroundGenerator
Classes	CapWords	class DataLoader
Constants	CAPSWITHUNDER	BATCH_SIZE=16
Instances	lowerwithunder	dataset = Dataset
Methods & Functions	lowerwithunder()	def visualize_tensor()
Variables	lowerwithunder	background_color=’Blue’

IDEs

Code Editors

Generally, we recommend using IDEs like Visual Studio Code or PyCharm. VS Code provides syntax highlighting and autocomplete features in relatively lightweight editors, while PyCharm has many advanced features for handling remote clusters.

Jupyter Notebook vs Python Scripts

In general, we recommend using Jupyter notebooks for initial exploration/trying new models and code.

If you want to train models on larger datasets, you should use Python scripts, as reproducibility is more important with larger datasets.

Our Recommended Workflow:

Start with Jupyter Notebook
Explore data and models
Build classes/methods in notebook cells
Move code to Python scripts
Train/deploy on server

Jupyter Notebook	Python Scripts
+ Exploration	+ Running longer jobs without interruption
+ Debugging	+ Easy to track changes with git
– Can become a huge file	– Debugging mostly means rerunning the whole script
– Can be interrupted (don’t use for long training)
– Prone to errors and become a mess

Libraries

Commonly used libraries:

Name	Description	Used for
torch	Base Framework for working with neural networks	creating tensors, networks and training them using backprop
torchvision	todo	data preprocessing, augmentation, postprocessing
Pillow (PIL)	Python Imaging Library	Loading images and storing them
Numpy	Package for scientific computing with Python	Data preprocessing & postprocessing
prefetch_generator	Library for background processing	Loading next batch in background during computation
tqdm	Progress bar	Progress during training of each epoch
torchsummary	Keras summary for PyTorch	Displays network, its parameters and sizes at each layer
tensorboardx	Tensorboard without TensorFlow	Logging experiments and showing them in TensorBoard

File Structure

Do not put all layers and models in the same file. The best practice is to separate the final network into a single file (network.py), and keep layers, losses, and operators in their respective files (layers.py, loss.py, ops.py). The completed model (composed of one or more networks) should be referenced in a file with its name (e.g., yolov3.py, DCGAN.py)

Main routines, respective training scripts, and testing scripts should only import from files with the model name.

Building Neural Networks with PyTorch

We recommend breaking down the network into smaller reusable parts. A network is a neural network. Module consists of operations or other neural networks. Modules serve as building blocks. Loss functions are also nn.Module. Therefore, they can be directly integrated into the network.

Classes that inherit from nn.Module must have a forward method to implement the forward pass of the respective layers or operations.

nn.Module can be used on input data with self.net(input), which uses the call() method to provide input through the module.

output = self.net(input)

A Simple Network in PyTorch

For a simple single-input single-output network, use the following pattern:

class ConvBlock(nn.Module):
    def __init__(self):
        super(ConvBlock, self).__init__()
        block = [nn.Conv2d(...)]
        block += [nn.ReLU()]
        block += [nn.BatchNorm2d(...)]
        self.block = nn.Sequential(*block)


    def forward(self, x):
        return self.block(x)


class SimpleNetwork(nn.Module):
    def __init__(self, num_resnet_blocks=6):
        super(SimpleNetwork, self).__init__()
        # here we add the individual layers
        layers = [ConvBlock(...)]
        for i in range(num_resnet_blocks):
            layers += [ResBlock(...)]
        self.net = nn.Sequential(*layers)


    def forward(self, x):
        return self.net(x)

Note the following points:

We reuse simple looping building blocks like ConvBlock, which consists of the same looping pattern (convolution, activation, normalization), and place them in separate nn.Module
We build a list of the required layers and finally convert them into a model using nn.Sequential(). We use the * operator before the list object to unpack it.
In the forward pass, we simply run the input through the model

Using Networks with Skip Connections in PyTorch

class ResnetBlock(nn.Module):
    def __init__(self, dim, padding_type, norm_layer, use_dropout, use_bias):
        super(ResnetBlock, self).__init__()
        self.conv_block = self.build_conv_block(...)


    def build_conv_block(self, ...):
        conv_block = []


        conv_block += [nn.Conv2d(...),
                       norm_layer(...),
                       nn.ReLU()]
        if use_dropout:
            conv_block += [nn.Dropout(...)]


        conv_block += [nn.Conv2d(...),
                       norm_layer(...)]


        return nn.Sequential(*conv_block)


    def forward(self, x):
        out = x + self.conv_block(x)
        return out

Here, a ResNet block with a skip connection is implemented. PyTorch allows dynamic operations during the forward pass.

Using Networks with Multiple Outputs in PyTorch

For a network that requires multiple outputs, such as building a perception loss using a pre-trained VGG network, we use the following pattern:

class Vgg19(torch.nn.Module):
  def __init__(self, requires_grad=False):
    super(Vgg19, self).__init__()
    vgg_pretrained_features = models.vgg19(pretrained=True).features
    self.slice1 = torch.nn.Sequential()
    self.slice2 = torch.nn.Sequential()
    self.slice3 = torch.nn.Sequential()


    for x in range(7):
        self.slice1.add_module(str(x), vgg_pretrained_features[x])
    for x in range(7, 21):
        self.slice2.add_module(str(x), vgg_pretrained_features[x])
    for x in range(21, 30):
        self.slice3.add_module(str(x), vgg_pretrained_features[x])
    if not requires_grad:
        for param in self.parameters():
            param.requires_grad = False


  def forward(self, x):
    h_relu1 = self.slice1(x)
    h_relu2 = self.slice2(h_relu1)        
    h_relu3 = self.slice3(h_relu2)        
    out = [h_relu1, h_relu2, h_relu3]
    return out

Note the following points:

We use the pre-trained model provided by torchvision.
We divide the network into three parts. Each slice consists of layers from the pre-trained model.
We set the frozen network to requires_grad = False
Returns a list of three outputs containing the slices

Custom Loss

Even though PyTorch has many standard loss functions, sometimes it is necessary to create your own loss function. To do this, you need to create a separate file losses.py, then extend nn.Module class to create a custom loss function:

class CustomLoss(torch.nn.Module):


    def __init__(self):
        super(CustomLoss,self).__init__()


    def forward(self,x,y):
        loss = torch.mean((x - y)**2)
        return loss

Recommended Code Structure for Training Models

Note that we use the following pattern:

We use BackgroundGenerator from prefetch_generator to load the next batch of data
We use tqdm to monitor training progress and display computational efficiency. This helps us identify bottlenecks in the data loading pipeline.

# import statements
import torch
import torch.nn as nn
from torch.utils import data
...


# set flags / seeds
torch.backends.cudnn.benchmark = True
np.random.seed(1)
torch.manual_seed(1)
torch.cuda.manual_seed(1)
...


# Start with main code
if __name__ == '__main__':
    # argparse for additional flags for experiment
    parser = argparse.ArgumentParser(description="Train a network for ...")
    ...
    opt = parser.parse_args() 


    # add code for datasets (we always use train and validation/ test set)
    data_transforms = transforms.Compose([
        transforms.Resize((opt.img_size, opt.img_size)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])


    train_dataset = datasets.ImageFolder(
        root=os.path.join(opt.path_to_data, "train"),
        transform=data_transforms)
    train_data_loader = data.DataLoader(train_dataset, ...)


    test_dataset = datasets.ImageFolder(
        root=os.path.join(opt.path_to_data, "test"),
        transform=data_transforms)
    test_data_loader = data.DataLoader(test_dataset ...)
    ...


    # instantiate network (which has been imported from *networks.py*)
    net = MyNetwork(...)
    ...


    # create losses (criterion in pytorch)
    criterion_L1 = torch.nn.L1Loss()
    ...


    # if running on GPU and we want to use cuda move model there
    use_cuda = torch.cuda.is_available()
    if use_cuda:
        net = net.cuda()
        ...


    # create optimizers
    optim = torch.optim.Adam(net.parameters(), lr=opt.lr)
    ...


    # load checkpoint if needed/ wanted
    start_n_iter = 0
    start_epoch = 0
    if opt.resume:
        ckpt = load_checkpoint(opt.path_to_checkpoint) # custom method for loading last checkpoint
        net.load_state_dict(ckpt['net'])
        start_epoch = ckpt['epoch']
        start_n_iter = ckpt['n_iter']
        optim.load_state_dict(ckpt['optim'])
        print("last checkpoint restored")
        ...


    # if we want to run experiment on multiple GPUs we move the models there
    net = torch.nn.DataParallel(net)
    ...


    # typically we use tensorboardX to keep track of experiments
    writer = SummaryWriter(...)


    # now we start the main loop
    n_iter = start_n_iter
    for epoch in range(start_epoch, opt.epochs):
        # set models to train mode
        net.train()
        ...


        # use prefetch_generator and tqdm for iterating through data
        pbar = tqdm(enumerate(BackgroundGenerator(train_data_loader, ...)),
                    total=len(train_data_loader))
        start_time = time.time()


        # for loop going through dataset
        for i, data in pbar:
            # data preparation
            img, label = data
            if use_cuda:
                img = img.cuda()
                label = label.cuda()
            ...


            # It's very good practice to keep track of preparation time and computation time using tqdm to find any issues in your dataloader
            prepare_time = start_time-time.time()


            # forward and backward pass
            optim.zero_grad()
            ...
            loss.backward()
            optim.step()
            ...


            # update tensorboardX
            writer.add_scalar(..., n_iter)
            ...


            # compute computation time and *compute_efficiency*
            process_time = start_time-time.time()-prepare_time
            pbar.set_description("Compute efficiency: {:.2f}, epoch: {}/{}:".format(
                process_time/(process_time+prepare_time), epoch, opt.epochs))
            start_time = time.time()


        # maybe do a test pass every x epochs
        if epoch % x == x-1:
            # bring models to evaluation mode
            net.eval()
            ...
            #do some tests
            pbar = tqdm(enumerate(BackgroundGenerator(test_data_loader, ...)),
                    total=len(test_data_loader)) 
            for i, data in pbar:
                ...


            # save checkpoint if needed
            ...

Training with Multiple GPUs in PyTorch

There are two modes for training with multiple GPUs in PyTorch.

From our experience, both methods are effective. However, the first method results in better and fewer codes. Due to less communication between GPUs, the second method seems to have a slight performance advantage.

Splitting Each Network’s Batch

The most common method is to simply allocate the batch of all “networks” to individual GPUs.

Thus, if a model runs on a batch size of 64 on one GPU, it will run on two GPUs with a batch size of 32 each. This can be automatically done using nn.DataParallel(model).

Packaging All Networks into a Super Network and Splitting Input Batches

This pattern is less commonly used. The repository implementing this method is in the pix2pixHD implementation by Nvidia.

Dos and Don’ts

Avoid Using Numpy Code in nn.Module’s Forward Method

Numpy runs on the CPU and is slower than torch code. Since torch’s development philosophy is similar to numpy, most numpy functions have been supported by PyTorch.

Separate DataLoader from Main Code

The data loading pipeline should be independent of your main training code. PyTorch uses the background to load data more efficiently and does not interfere with the main training process.

Do Not Log Results in Every Iteration

Usually, we train our models for thousands of iterations. Therefore, logging losses and other results every n steps is sufficient to reduce overhead. In particular, saving intermediate results as images during training can be very time-consuming.

Use Command Line Arguments

Using command line arguments to set parameters during code execution (e.g., batch size, learning rate, etc.) is very convenient. A simple way to track experimental parameters is to print the dictionary received from parse_args:

...
# saves arguments to config.txt file
opt = parser.parse_args()
with open("config.txt", "w") as f:
    f.write(opt.__str__())
...

Use .detach() to Release Tensors from the Graph if Possible

PyTorch tracks all operations involving tensors for automatic differentiation. Using .detach() prevents recording unnecessary operations.

Use .item() to Print Scalar Data

You can print variables directly, but it is recommended to use variable.detach() or variable.item(). In earlier versions of PyTorch < 0.4, you had to use .data to access a variable’s tensor.

Use Function Calls in nn.Module Instead of Directly Using Forward

The following two ways are not the same:

output = self.net.forward(input)
# they are not equal!
output = self.net(input)

FAQ

How to make experiments reproducible?

We recommend setting the following seeds at the beginning of the code:

np.random.seed(1)
torch.manual_seed(1)
torch.cuda.manual_seed(1)

How to further improve training and inference speed?

On Nvidia GPUs, you can add the following line at the beginning of your code. This will allow the cuda backend to optimize your graph on the first execution. However, be aware that if you change the size of the input/output tensors, the graph will be optimized each time a change occurs. This may lead to very slow execution and out-of-memory errors. Set this flag only when the input and output always have the same shape. Generally, this will lead to about a 20% improvement.

torch.backends.cudnn.benchmark = True

What is the best value for computational efficiency using tqdm + prefetch_generator mode?

This depends on the machine used, preprocessing pipeline, and network size. Using an SSD on a 1080Ti GPU, we see a computational efficiency close to 1.0, which is an ideal scenario. If using shallow (small) networks or slow hard drives, this number may drop to around 0.1-0.2, depending on your setup.

How to have a batch size > 1 even if I don’t have enough memory?

In PyTorch, we can easily implement virtual batch sizes. We simply do not let the optimizer update the parameters every time and accumulate the gradients of batch_size.

...
# in the main loop
out = net(input)
loss = criterion(out, label)
# we just call backward to sum up gradients but don't perform step here
loss.backward() 
total_loss += loss.item() / batch_size
if n_iter % batch_size == batch_size-1:
    # here we perform our optimization step using a virtual batch size
    optim.step()
    optim.zero_grad()
    print('Total loss: ', total_loss)
total_loss = 0.0
...

How to adjust the learning rate during training?

We can directly use the instantiated optimizer to get the learning rate, as shown below:

...
for param_group in optim.param_groups:
    old_lr = param_group['lr']
    new_lr = old_lr * 0.1
    param_group['lr'] = new_lr
    print('Updated lr from {} to {}'.format(old_lr, new_lr))
...

How to use a pre-trained model as a loss (without backpropagation) during training?

If you want to use a pre-trained model, like VGG, to compute loss but not train it (for example, in style-transfer/GANs/Auto-encoders for perceptual loss), you can use the following pattern:

...
# instantiate the model
pretrained_VGG = VGG19(...)


# disable gradients (prevent training)
for p in pretrained_VGG.parameters():  # reset requires_grad
    p.requires_grad = False
...
# you don't have to use the no_grad() namespace but can just run the model
# no gradients will be computed for the VGG model
out_real = pretrained_VGG(input_a)
out_fake = pretrained_VGG(input_b)
loss = any_criterion(out_real, out_fake)
...

Why use .train() and .eval() in training?

These methods are used to set layers like BatchNorm2d or Dropout2d from training mode to inference mode. Each module inherits from nn.Module and has an attribute called istrain. .eval() and .train() simply set this attribute to True/False. For details on how this method is implemented, refer to the module code in PyTorch.

My model uses a lot of memory during inference/How to run inference models correctly in PyTorch?

Make sure there are no computations and stored gradients during code execution. You can simply use the following pattern to ensure:

with torch.no_grad():
    # run model here
    out_tensor = net(in_tensor)

How to fine-tune a pre-trained model?

In PyTorch, you can freeze layers. This will prevent them from being updated during the optimization step.

# you can freeze whole modules using
for p in pretrained_VGG.parameters():  # reset requires_grad
    p.requires_grad = False

When to use Variable(…)?

Since PyTorch 0.4, Variable and Tensor have merged, and we no longer need to explicitly construct Variable objects.

Is PyTorch faster on C++ than Python?

The C++ version is 10% faster

Can TorchScript / JIT speed up the code?

Todo…

Does using cudnn.benchmark=True make PyTorch code faster?

Based on our experience, you can achieve about a 20% speedup. However, the first time you run the model, it takes a considerable amount of time to build the optimized graph. In some cases (loops in the forward pass, variable input shapes, if/else in the forward, etc.), this flag may lead to out of memory or other errors.

How to train using multiple GPUs?