“IT has something to talk about” is a professional IT information and service platform under the Machinery Industry Press, dedicated to helping readers master more professional and practical knowledge and skills in the broad IT field, quickly enhancing their workplace competitiveness. Click the blue WeChat name to quickly follow us!
There is no need to load the entire dataset in the constructor (__init__). If your dataset is large (like thousands of image files), loading it all at once will not save memory; it is recommended to load them on demand (whenever __get_item__ is called).
• __get_item__(self,index): It allows indexing into the dataset so that it can work like a list (dataset[i])—it must return the tuple (feature, label) corresponding to the requested data point. It can return the corresponding slice of a pre-loaded dataset or load them on demand as mentioned above.
• __len__(self): It should simply return the size of the entire dataset, so that whenever it is sampled, its index is limited to the actual size.
Build a simple custom dataset that requires two tensors as parameters: one for features and one for labels. For any given index, the dataset class will return the corresponding slice of each tensor. The code is as follows:
class CustomDataset(Dataset): def __init__(self, x_tensor, y_tensor): self.x = x_tensor self.y = y_tensor
def __getitem__(self, index): return (self.x[index], self.y[index])
def __len__(self): return len(self.x)
# Wait, is this a CPU tensor? Why? Where is .to(device)?
x_train_tensor = torch.from_numpy(x_train).float()
y_train_tensor = torch.from_numpy(y_train).float()
train_data = CustomDataset(x_train_tensor, y_train_tensor)
print(train_data[0])
(tensor([0.7713]), tensor([2.4745]))

TensorDataset
Again, you might wonder, “Why wrap several tensors in a class?” Once again, you are right… If a dataset is just a few tensors, you can use PyTorch’s TensorDataset class, which works almost the same as the custom dataset above.
So now, the mature custom dataset class may seem a bit contrived, but we will reuse this structure in later chapters. For now, enjoy the simplicity of the TensorDataset class.
train_data = TensorDataset(x_train_tensor, y_train_tensor)
print(train_data[0])
(tensor([0.7713]), tensor([2.4745]))
Important Note: In most cases, you should set shuffle=True for your training set to improve the performance of gradient descent. However, there are some exceptions, such as time series problems, where shuffling can actually lead to data leakage.
The functionality of DataLoader goes far beyond what meets the eye… For instance, it can also be used with samplers to obtain mini-batches that compensate for imbalanced classes. There is too much to handle right now, but we will get to it eventually.
Our loader will behave like an iterator, so we can loop through it and get different mini-batches each time.
“How do I choose the mini-batch size?”
For mini-batch sizes, powers of 2 are often used, such as 16, 32, 64, or 128, with 32 seeming to be the choice of most, including Yann LeCun.
Some more complex models might use larger sizes, although the size is usually limited by hardware (i.e., the actual number of data points that can be loaded into memory).
In our example, there are only 80 training points, so I chose a mini-batch size of 16 to conveniently split the training set into 5 mini-batches.
train_loader = DataLoader(dataset=train_data, batch_size=16, shuffle=True)
next(iter(train_loader))
[tensor([[0.1196],[0.1395],...[0.8155],[0.5979]]), tensor([[1.3214],[1.3051],...[2.6606],[2.0407]])]
If you call list(train_loader), you will get a list; the result is a list containing 5 elements, which are all 5 mini-batches. Then you can use the first element of that list to get a single mini-batch, as shown above. This defeats the purpose of using the DataLoader iterable object, which is to iterate one element (in this case, a mini-batch) at a time.
To learn more, check out RealPython’s materials on iterables and iterators. How does this change the code so far? Let’s take a look!
First, we need to add the Dataset and DataLoader elements to the data preparation section of the code. Also, note that the tensors have not yet been sent to the device. The code is as follows:
%%writefile data_preparation/v1.py
# Data is in Numpy arrays
# But needs to be converted to PyTorch tensors
x_train_tensor = torch.from_numpy(x_train).float()
y_train_tensor = torch.from_numpy(y_train).float()
# Build Dataset
train_data = TensorDataset(x_train_tensor, y_train_tensor) ①
# Build DataLoader
train_loader = DataLoader(dataset=train_data, batch_size=16, shuffle=True) ②
%run -i data_preparation/v1.py
%run -i model_configuration/v1.py
%%writefile model_training/v2.py
# Define the number of epochs
n_epochs = 1000
losses = []
# For each epoch…
for epoch in range(n_epochs): # Inner loop mini_batch_losses = [] ④ for x_batch, y_batch in train_loader: ① # The dataset “exists” in CPU, so do the mini-batches # Therefore, these mini-batches need to be sent to the device x_batch = x_batch.to(device) ② y_batch = y_batch.to(device) ②
# Perform a training step # and return the corresponding loss for this mini-batch mini_batch_loss = train_step(x_batch, y_batch) ③ mini_batch_losses.append(mini_batch_loss) ④
# Calculate the average loss of all mini-batches—this is the loss for the epoch loss = np.mean(mini_batch_losses) ⑤
losses.append(loss)
① Inner loop for mini-batch.
② Send a mini-batch to the device.
③ Perform a training step.
④ Track the loss within each mini-batch.
⑤ Average the mini-batch losses to get the epoch loss.
Run—Model Training V2
%run -i model_training/v2.py
For larger datasets, loading data on demand in the __get_item__ method of the Dataset (into CPU tensors), and then sending all data points belonging to the same mini-batch immediately to your GPU (device) is a great way to utilize GPU memory.
Additionally, if you have multiple GPUs to train your model, it is better to keep your dataset “device-independent” and assign batches to different GPUs during training.
# Check the model parameters
print(model.state_dict())
Output
OrderedDict([('0.weight', tensor([[1.9684]], device='cuda:0')), ('0.bias', tensor([1.0235], device='cuda:0'))])
Did you get slightly different values? Try running the entire pipeline again:
Complete Pipeline
%run -i data_preparation/v1.py
%run -i model_configuration/v1.py
%run -i model_training/v2.py
Mini-batch Inner Loop
The inner loop depends on the following 3 elements:
• The device to which the data is sent.
• The data loader from which mini-batches are extracted.
• A step function that returns the corresponding loss.
Taking these elements as inputs and using them to execute the inner loop will yield the following function:
def mini_batch(device, data_loader, step): mini_batch_losses = [] for x_batch, y_batch in data_loader: x_batch = x_batch.to(device) y_batch = y_batch.to(device)
mini_batch_loss = step(x_batch, y_batch) mini_batch_losses.append(mini_batch_loss)
loss = np.mean(mini_batch_losses) return loss
In the previous section, we realized that due to the mini-batch inner loop, we performed over five times the updates (train_step function) for each epoch. Previously, 1000 epochs meant 1000 updates. Now, only 200 epochs are needed to perform the same 1,000 updates.
Run—Data Preparation V1, Model Configuration V1
%run -i data_preparation/v1.py
%run -i model_configuration/v1.py
%%writefile model_training/v3.py
# Define the number of epochs
n_epochs = 200
losses = []
for epoch in range(n_epochs): # Inner loop loss = mini_batch(device, train_loader, train_step) ① losses.append(loss)
%run -i model_training/v3.py
Check the model state:
# Check the model parameters
print(model.state_dict())
Output
OrderedDict([('0.weight', tensor([[1.9687]], device='cuda:0')), ('0.bias', tensor([1.0236], device='cuda:0'))])
%%writefile data_preparation/v2.py
torch.manual_seed(13)
# Build tensors from numpy arrays before splitting
x_tensor = torch.from_numpy(x).float() ①
y_tensor = torch.from_numpy(y).float() ①
# Build a dataset containing all data points
dataset = TensorDataset(x_tensor, y_tensor)
# Perform the split
ratio = .8
n_total = len(dataset)
n_train = int(n_total * ratio)
n_val = n_total - n_train
train_data, val_data = random_split(dataset, [n_train, n_val]) ②
# Build loaders for each set
train_loader = DataLoader(dataset=train_data, batch_size=16, shuffle=True)
val_loader = DataLoader(dataset=val_data, batch_size=16) ③
① Generate tensors from the complete dataset (before splitting).
② Perform train-validation split in PyTorch.
③ Create a data loader for the validation set.
Run—Data Preparation V2
%run -i data_preparation/v2.py
The above content is excerpted from PyTorch Deep Learning Guide: Programming Basics Volume I
Author: [Brazil] Daniel Voigt Godoy
▊《PyTorch Deep Learning Guide》
[Brazil] Daniel Voigt Godoy, translated by Zhao Chunjiang
-
“The PyTorch Deep Learning Guide” series systematically explains important concepts, algorithms, and models related to deep learning, focusing on how PyTorch implements these algorithms and models. The book is divided into three volumes: Volume I Programming Basics, Volume II Computer Vision, Volume III Sequences and Natural Language Processing.
Written by: Ji Xu
Editor: Zhang Shuqian
Reviewer: Cao Xinyu