Training Larger Models on GPU with Gradient Checkpointing in PyTorch
Source: Deephub Imba This article is approximately 3200 words long and is recommended to be read in 7 minutes. This article will introduce gradient checkpointing, a technique that allows you to train larger models on the GPU at the cost of increased training time. We will implement it in PyTorch and train a classifier model. … Read more