Machine Heart Release
Author: zasdfgbnm
This article introduces a practical tool for PyTorch code called TorchSnooper. The author is the creator of TorchSnooper and also one of the PyTorch developers.
GitHub project address: https://github.com/zasdfgbnm/TorchSnooper
Many of you may encounter such troubles: for example, when running your own PyTorch code, PyTorch prompts you that the data types do not match, requiring a double tensor but you provided a float; or it requires a CUDA tensor, but you provided a CPU tensor. For example, like the following:
RuntimeError: Expected object of scalar type Double but got scalar type Float
Debugging such problems can be very troublesome because you don’t know where the problem started. For example, you might have created a CPU tensor using torch.zeros on the third line of your code, and then this tensor underwent several operations, all performed on the CPU without any errors, until the tenth line when it needed to operate with a CUDA tensor passed in as input, at which point the error occurred. To debug such errors, sometimes you have to manually write print statements line by line, which is very cumbersome.
Or, you might imagine what kind of operation you would perform on a tensor and expect a certain result, but PyTorch reports an error saying that the tensor shapes do not match, or it doesn’t report any errors but the final output shape is not what we expected. At this point, we often don’t know where things started to deviate from our expectations. We sometimes also need to insert a bunch of print statements to find the cause.
TorchSnooper is a tool designed to solve this problem. The installation of TorchSnooper is very simple; you just need to execute the standard Python package installation command:
pip install torchsnooper
After installation, you only need to decorate the function you want to debug with @torchsnooper.snoop(). When this function is executed, it will automatically print out the shape, data type, device, and whether gradient information is needed for the tensor of each executed line.
After installation, let’s illustrate how to use it with two examples.
Example 1
For example, we wrote a very simple function:
def myfunc(mask, x):
y = torch.zeros(6)
y.masked_scatter_(mask, x)
return y
This is how we use this function:
mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda')
source = torch.tensor([1.0, 2.0, 3.0], device='cuda')
y = myfunc(mask, source)
The code above seems fine, but when we run it, it throws an error:
RuntimeError: Expected object of backend CPU but got backend CUDA for argument #2 'mask'
Where is the problem? Let’s snoop! Decorate the myfunc function with @torchsnooper.snoop():
import torch
import torchsnooper
@torchsnooper.snoop()
def myfunc(mask, x):
y = torch.zeros(6)
y.masked_scatter_(mask, x)
return y
mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda')
source = torch.tensor([1.0, 2.0, 3.0], device='cuda')
y = myfunc(mask, source)
Then run our script, we see the following output:
Starting var:.. mask = tensor<(6,), int64, cuda:0>
Starting var:.. x = tensor<(3,), float32, cuda:0>
21:41:42.941668 call 5 def myfunc(mask, x):
21:41:42.941834 line 6 y = torch.zeros(6)
New var:....... y = tensor<(6,), float32, cpu>
21:41:42.943443 line 7 y.masked_scatter_(mask, x)
21:41:42.944404 exception 7 y.masked_scatter_(mask, x)
Combining our error, we mainly look at the device of each output variable to find out from which variable it started on the CPU. We note this line:
New var:....... y = tensor<(6,), float32, cpu>
This line directly tells us that we created a new variable y and assigned a CPU tensor to this variable. This line corresponds to the code y = torch.zeros(6). Therefore, we realize that when using torch.zeros, if we do not manually specify the device, the default created tensor is on the CPU. We change this line to y = torch.zeros(6, device=’cuda’), and this issue is fixed.
Although this issue is fixed, our problem is not completely resolved. Running the modified code still throws an error, but now the error has changed to:
RuntimeError: Expected object of scalar type Byte but got scalar type Long for argument #2 'mask'
Alright, this time the error is due to the data type. This error message is quite informative; we can roughly know that our mask’s data type is wrong. Looking again at the output from TorchSnooper, we notice:
Starting var:.. mask = tensor<(6,), int64, cuda:0>
Indeed, our mask’s type is int64, but it should be uint8. We modify the definition of mask:
mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda', dtype=torch.uint8)
And then it can run.
Example 2
This time we want to build a simple linear model:
model = torch.nn.Linear(2, 1)
We want to fit a plane y = x1 + 2 * x2 + 3, so we create such a dataset:
x = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
y = torch.tensor([3.0, 5.0, 4.0, 6.0])
We use the most common SGD optimizer for optimization, and the complete code is as follows:
import torch
model = torch.nn.Linear(2, 1)
x = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
y = torch.tensor([3.0, 5.0, 4.0, 6.0])
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for _ in range(10):
optimizer.zero_grad()
pred = model(x)
squared_diff = (y - pred) ** 2
loss = squared_diff.mean()
print(loss.item())
loss.backward()
optimizer.step()
However, during the process of running, we find that the loss does not decrease beyond around 1.5. This is very abnormal because the data we constructed should all fall without error on the plane we want to fit, and the loss should decrease to 0 to be considered normal.
At first glance, it is unclear where the problem lies. With a try-it-and-see attitude, we snoop a bit. In this example, we did not define a custom function, but we can use the with statement to activate TorchSnooper. We place the training loop inside the with statement, and the code becomes:
import torch
import torchsnooper
model = torch.nn.Linear(2, 1)
x = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
y = torch.tensor([3.0, 5.0, 4.0, 6.0])
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
with torchsnooper.snoop():
for _ in range(10):
optimizer.zero_grad()
pred = model(x)
squared_diff = (y - pred) ** 2
loss = squared_diff.mean()
print(loss.item())
loss.backward()
optimizer.step()
Running the program, we see a long list of outputs. By browsing through them carefully, we notice
New var:....... model = Linear(in_features=2, out_features=1, bias=True)
New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... optimizer = SGD (Parameter Group 0 dampening: 0 lr: 0....omentum: 0 nesterov: False weight_decay: 0)
02:38:02.016826 line 12 for _ in range(10):
New var:....... _ = 0
02:38:02.017025 line 13 optimizer.zero_grad()
02:38:02.017156 line 14 pred = model(x)
New var:....... pred = tensor<(4, 1), float32, cpu, grad>
02:38:02.018100 line 15 squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4, 4), float32, cpu, grad>
02:38:02.018397 line 16 loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
02:38:02.018674 line 17 print(loss.item())
02:38:02.018852 line 18 loss.backward()
26.979290008544922
02:38:02.057349 line 19 optimizer.step()
By carefully observing the shapes of the tensors here, we can easily find that y has the shape (4,), while pred has the shape (4, 1). When they are subtracted, due to broadcasting, the shape of squared_diff becomes (4, 4).
This is certainly not the result we want. Fixing this problem is also simple; we just change the definition of pred to pred = model(x).squeeze(). Now, looking at the modified code’s output from TorchSnooper:
New var:....... model = Linear(in_features=2, out_features=1, bias=True)
New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... optimizer = SGD (Parameter Group 0 dampening: 0 lr: 0....omentum: 0 nesterov: False weight_decay: 0)
02:46:23.545042 line 12 for _ in range(10):
New var:....... _ = 0
02:46:23.545285 line 13 optimizer.zero_grad()
02:46:23.545421 line 14 pred = model(x).squeeze()
New var:....... pred = tensor<(4,), float32, cpu, grad>
02:46:23.546362 line 15 squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4,), float32, cpu, grad>
02:46:23.546645 line 16 loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
02:46:23.546939 line 17 print(loss.item())
02:46:23.547133 line 18 loss.backward()
02:46:23.591090 line 19 optimizer.step()
Now the results look normal. After testing, the loss can now decrease to very close to 0. Mission accomplished.
This article is published by Machine Heart, please contact this public account for authorization if reprinted.
✄————————————————
Join Machine Heart (Full-time Reporter / Intern): [email protected]
Submit or seek reports: content@jiqizhixin.com
Advertising & Business Cooperation: [email protected]