Common Pitfalls in PyTorch

Click on the “CVer” above to select “Star” or “Top”

Heavyweight content delivered promptly

Author: Yu Zhenbo

https://zhuanlan.zhihu.com/p/77952356

This article is authorized by the author and cannot be reproduced without permission.

Recently, I just started using PyTorch and have encountered quite a few pitfalls. I record them here, as I feel they are common issues that many might face. I would like to thank the friends who helped me troubleshoot these problems, and I will keep updating this list, hoping to encounter fewer issues in the future.

As a veteran TensorFlow user with a slight case of obsessive-compulsive disorder, I have leveled up from version 0.6 to 1.2, and then to 1.10, experiencing several iterations of TensorFlow versions. I must say that the efficiency of using tf.data.dataset + tfrecord is much higher than that of the DataLoader.

TensorFlow has a very useful queue mechanism, tf.input_producer + tfrecord. However, there is a bug in input_producer that prevents it from shuffling each epoch separately; it can only shuffle as a whole, which means we cannot follow the normal training process (train for several epochs, validate for one epoch, and finally select the best result from validation for testing). Later, I reported this issue to the official team, and their response was that this bug could not be fixed at the moment. However, they mentioned that the upcoming tf1.2 version would introduce a new data processing API, tf.contrib.data.dataset (which was merged into tf.data.dataset in version 1.3) that would perfectly solve this bug, and tf.input_producer would be discarded in tf2.0. As soon as tf1.2 was released, I immediately upgraded and started encountering issues with tf.data.dataset for about two weeks. (This new API is actually not very powerful and has several limitations, which I won’t elaborate on here.)

——————————————————————————

It seems I have digressed; returning to PyTorch, I find it somewhat embarrassing that PyTorch does not have its own data structure and data reading algorithm. The DataLoader feels similar to the feed in TensorFlow and does not provide any speed or performance improvements.

Let me summarize the pitfalls I encountered:

1. No efficient data storage. TensorFlow has tfrecord, Caffe has lmdb, and using cv.imread during network training is a waste of time. Special thanks to Xiaozhi. @Zhi Tiancheng

Solution:

I found a pretty good GitHub link:

https://github.com/Lyken17/Efficient-PyTorch

It mainly discusses how to use lmdb, h5py, pth, lmdb, n5, and other data storage methods.

Personally, I feel that h5 is relatively fast in data retrieval, but if you want to use multi-threaded read/write, you should avoid using h5, as multi-threaded read/write can be quite troublesome.

http://docs.h5py.org/en/stable/mpi.html

Here is the code for reading and writing h5 data (the main thing to note is that reading and writing strings requires encode and decode, and it is best to use create_dataset; directly writing may cause errors when reading):

# Write:
imagenametotal_.append(os.path.join('images', imagenametotal).encode())
with h5py.File(outfile) as f:
    f.create_dataset('imagename', data=imagenametotal_)
    f['part'] = parts_
    f['S'] = Ss_
    f['image'] = cvimgs
# Read:
with h5py.File(outfile) as f:
    imagename = [x.decode() for x in f['imagename']]
    kp2ds = np.array(f['part'])
    kp3ds = np.array(f['S'])
    cvimgs = np.array(f['image'])

2. GPU imbalance. Special thanks to Senior Zhang Hang. @Zhang Hang

This is a common problem where the first GPU occupies more memory.

Senior Zhang Hang (张航) proposed an open-source tool for GPU balancing—PyTorch-Encoding.

https://github.com/zhanghang1989/PyTorch-Encoding

The usage is quite convenient, as shown below:

from balanced_parallel import DataParallelModel, DataParallelCriterion
model = DataParallelModel(model, device_ids=gpus).cuda()
criterion = loss_fn().cuda()

There are actually two points to note: first, during testing, you need to manually merge the GPUs, as shown below:

from torch.nn.parallel.scatter_gather import gather
preds = gather(preds, 0)

Second, when the loss function has multiple components, for example, loss = loss1 + loss2 + loss3, you need to write these three losses into a class and then sum them in the forward method.

Additionally, we can use another function, DistributedDataParallel, to solve the GPU imbalance problem.

The usage is as follows: (Note: This method seems to not work with h5 data simultaneously)

from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel
torch.distributed.init_process_group(backend="nccl")
# Configure each process's GPU
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
# Before wrapping, move the model to the corresponding GPU
model.to(device)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank],
                                                   output_device=local_rank)
# Add a data sample to the original DataLoader
train_loader = torch.utils.data.DataLoader(
        train_dataset,
        sampler=DistributedSampler(train_dataset)
    )

3. Low GPU utilization + wasted GPU memory.

Common configurations:

1. Add this at the beginning of the main function: (This sacrifices a bit of memory to improve model accuracy)

cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.enabled = True

2. Add this before each epoch during training: (Periodically clear the model; the effect feels not obvious)

torch.cuda.empty_cache()

3. Add this before useless variables: (Similarly, the effect is quite noticeable in some operations)

del xxx(variable_name)

4. Set the length of the DataLoader: (This can prevent intermittent stalls in the DataLoader)

def __len__(self):
    return self.images.shape[0]

5. Set the pre-loading for the DataLoader: (This loads data during model training, slightly improving GPU utilization)

train_loader = torch.utils.data.DataLoader(
    train_dataset,
    pin_memory=True,
)

6. Network design is crucial, and avoid initializing any unused variables, as PyTorch’s initialization and forward are separate; it will not skip initialization just because you do not use them.

7. Lastly, here is a picture that still troubles me:

As you can see, at the beginning of each epoch when training data, the first iteration takes a lot of time. PyTorch handles this poorly, as it is not a dynamic allocation process. I have also seen what seems to be a reliable solution, which is as follows:

Feeding the GPU in deep learning

However, after looking at the code, it might require refactoring the DataLoader. I noticed in the comments that there may still be issues, and I am a bit lazy, so I haven’t tackled it yet. I plan to address it later when I have time.

Exciting news! The CVer academic group has been established!

Scan the code to add the CVer assistant and apply to join CVer-Object Detection, Image Segmentation, Object Tracking, Face Detection & Recognition, OCR, Pose Estimation, Super Resolution, SLAM, Medical Imaging, Re-ID, GAN, NAS, Depth Estimation, Autonomous Driving, Reinforcement Learning, Lane Detection, and Model Pruning & Compressionand other groups.Please remember: Research Direction + Location + School/Company + Nickname (e.g., Object Detection + Shanghai + SJTU + Kaka)

▲ Long press to join the group

▲ Long press to follow us

Please give me a thumbs up!！

Leave a Comment Cancel reply