Common Pitfalls in PyTorch

Click the above “Beginner Learning Vision” to choose to add Star Mark or “Top”

Important content delivered immediately

I recently started using PyTorch and have encountered quite a few pitfalls. I’m recording them here as I feel they are common issues that many might face. I would also like to thank my friends who helped me troubleshoot these problems, and I hope to encounter fewer pitfalls in the future.

As a seasoned TensorFlow user with mild obsessive-compulsive tendencies, I have upgraded from version 0.6 to 1.2 and then to 1.10, experiencing several version changes of TensorFlow. I must say that using tf.data.dataset + tfrecord is much more efficient than using DataLoader.

TensorFlow has a very useful queue mechanism, tf.input_producer + tfrecord. However, input_producer has a bug that prevents it from shuffling each epoch separately; it can only shuffle the entire dataset. This means we cannot follow a normal training process (train for several epochs, validate for one epoch, and finally select the best result from validation for testing). I raised an issue with the official team, and their response was that this bug could not be fixed at that time. However, they mentioned that in the upcoming tf1.2 version, a new data processing API, tf.contrib.data.dataset (merged into tf.data.dataset in tf1.3), would perfectly solve this bug. They also planned to eliminate tf.input_producer in tf2.0. After tf1.2 was released, I immediately upgraded and began encountering pitfalls with tf.data.dataset for about two weeks. (This new API is not very powerful and has many limitations, which I won’t elaborate on here.)

——————————————————————————

It seems I have digressed. Returning to PyTorch, I find it somewhat embarrassing that PyTorch does not have its own data structures and data reading algorithms. DataLoader feels similar to TensorFlow’s feed and does not provide any speed or performance improvements.

Let’s summarize the pitfalls I encountered:

1. Lack of efficient data storage; cv.imread is inefficient during network training

Solution:

I found a decent GitHub link,

https://github.com/Lyken17/Efficient-PyTorch

This mainly discusses how to use lmdb, h5py, pth, lmdb, n5, and other data storage methods. My personal experience is that h5 is relatively fast for data access, but if you need to use multi-threaded read/write, it’s best to avoid h5, as multi-threaded operations with h5 seem quite troublesome.

http://docs.h5py.org/en/stable/mpi.html

Here’s some code for reading and writing h5 data (note that string read/write requires encoding and decoding; it’s best to use create_dataset; writing directly may cause errors when reading):

Write:    imagenametotal_.append(os.path.join('images', imagenametotal).encode())with h5py.File(outfile) as f:        f.create_dataset('imagename', data=imagenametotal_)        f['part'] = parts_        f['S'] = Ss_        f['image'] = cvimgsRead:with h5py.File(outfile) as f:    imagename = [x.decode() for x in f['imagename']]    kp2ds = np.array(f['part'])    kp3ds = np.array(f['S'])    cvimgs = np.array(f['image'])

2. GPU imbalance

Senior Zhang Hang (Hang Zhang) proposed an open-source GPU balancing tool — PyTorch-Encoding.

The usage is quite convenient, as shown below:

from balanced_parallel import DataParallelModel, DataParallelCriterionmodel = DataParallelModel(model, device_ids=gpus).cuda()criterion = loss_fn().cuda()

There are actually two points to note here. First, during testing, you need to manually merge GPUs, as shown in the code below:

from torch.nn.parallel.scatter_gather import gatherpreds = gather(preds, 0)

Second, when the loss function consists of multiple components, for example, loss = loss1 + loss2 + loss3, you need to write these three losses into a class and then sum them in the forward method.

Additionally, we can use another function, distributedDataParallel, to solve the GPU imbalance issue.

Usage is as follows: (Note: This method seems incompatible with h5 data simultaneously)

from torch.utils.data.distributed import DistributedSamplerfrom torch.nn.parallel import DistributedDataParalleltorch.distributed.init_process_group(backend="nccl")# Configure GPU for each processlocal_rank = torch.distributed.get_rank()torch.cuda.set_device(local_rank)device = torch.device("cuda", local_rank)# Before wrapping, move the model to the corresponding GPU model.to(device)model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[local_rank],                                                   output_device=local_rank)# Add a data sampler on the original dataloader train_loader = torch.utils.data.DataLoader(        train_dataset,        sampler=DistributedSampler(train_dataset)    )

3. Low GPU utilization and wasted GPU memory

Common configurations:

1. Add at the beginning of the main function: (this will sacrifice a bit of memory for improved model accuracy)

cudnn.benchmark = Truetorch.backends.cudnn.deterministic = Falsetorch.backends.cudnn.enabled = True

2. During training, add before each epoch: (periodically clear the model; the effect seems not obvious)

torch.cuda.empty_cache()

3. Add before useless variables: (similarly, the effect is quite noticeable in some operations)

del xxx(variable_name)

4. Set the length of the DataLoader: (the DataLoader sometimes stalls intermittently; setting it this way can avoid many issues)

def __len__(self):return self.images.shape[0]

5. Preload settings for the DataLoader: (this will load data during model training, slightly improving GPU utilization)

train_loader = torch.utils.data.DataLoader(        train_dataset,        pin_memory=True,    )

6. The design of the network is crucial, and do not initialize any unnecessary variables, as PyTorch’s initialization and forward processes are separate; it will initialize even if you don’t use it.

7. Finally, here’s an image that still troubles me:

You can see that at the beginning of each epoch, the first iteration takes a very long time. PyTorch handles this poorly; it is not a dynamic allocation process. I also saw a seemingly reliable solution proposed by @风车车:

Feeding GPUs in Deep Learning

https://zhuanlan.zhihu.com/p/77633542

However, I looked at the code and it may require restructuring the DataLoader. According to the comments, there seem to be issues, and I’m a bit lazy; I haven’t tackled this yet but plan to do so when I have time.

I’ll update here for now, and I welcome everyone to share additional pitfalls they encounter. I am a beginner in PyTorch.

Update; by the way, I want to complain about the above DALI; it has many limitations, and it’s quite difficult to handle some tricky data preprocessing.

8. Apex mixed precision model

It turns out that Apex is not as magical as the official website claims; it can only reduce memory usage, not speed (12G memory can be reduced to about 8G, which is quite significant, but the speed decreases by about 1/3, which seems a bit counterproductive).

After compilation, the speedup is also limited. I’ll leave this as a pit; if anyone can solve it, please feel free to message me. If it can be resolved, I will carefully list it out.

Good news! The Beginner Learning Vision knowledge community is now open to the public👇👇👇 Download 1: Chinese Tutorial for OpenCV-Contrib Extension Modules Reply "Extension Module Chinese Tutorial" in the background of the "Beginner Learning Vision" public account to download the first Chinese version of the OpenCV extension module tutorial, covering installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters of content. Download 2: 52 Lectures on Python Vision Practical Projects Reply "Python Vision Practical Projects" in the background of the "Beginner Learning Vision" public account to download 31 practical projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, and facial recognition, to help you quickly learn computer vision. Download 3: 20 Lectures on OpenCV Practical Projects Reply "20 Lectures on OpenCV Practical Projects" in the background of the "Beginner Learning Vision" public account to download 20 practical projects based on OpenCV, achieving advanced learning of OpenCV. Group Chat Welcome to join the public account reader group to exchange ideas with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will gradually be subdivided in the future). Please scan the WeChat ID below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format; otherwise, you will not be approved. Once successfully added, you will be invited to the relevant WeChat groups based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed. Thank you for your understanding~

Leave a Comment Cancel reply