Comprehensive Insights into PyTorch Internals from Core Developer

Source: Ezyang Blog Information Network Engineering Research Center


This article is approximately 9000 words long and is recommended for a reading time of over 15 minutes. This article is a long version of a talk about the internal mechanisms of PyTorch.

Edward Z. Yang, a PhD student at Stanford University and a research engineer at Facebook AI Research, is one of the core developers of the PyTorch open-source project. He gave a talk on the internal mechanisms of PyTorch at the PyTorch NYC meetup on May 14, and this article is a long written version of that talk.

Comprehensive Insights into PyTorch Internals from Core Developer

Hello everyone! Today I want to talk about the internal mechanisms of PyTorch.

This talk is intended for those who have used PyTorch and are eager to contribute to it but have been deterred by its vast C++ codebase. It’s no lie: the PyTorch codebase can indeed be daunting at times.

The purpose of this talk is to provide you with a roadmap: to explain the basic conceptual structure of a “tensor library that supports automatic differentiation” and to provide you with some tools and techniques to help you navigate the codebase. I assume you have written some PyTorch before but may not have delved deeply into how machine learning software libraries are built.

Comprehensive Insights into PyTorch Internals from Core Developer

This talk is divided into two parts: in the first part, I will provide a comprehensive introduction to the various concepts of tensor libraries. I will first discuss the tensor data types that you know and love and detail what this data type can provide, which will help us better understand its true implementation.

If you are an advanced PyTorch user, you may already be familiar with most of this material. We will also discuss the three concepts of “extension points”, layout, device, and dtype, which will guide our thinking about how to extend the tensor class. During the live presentation at the PyTorch NYC meetup, I skipped the slides about autograd, but I will explain some here.

The second part will explain the basic details involved in actually writing code with PyTorch. I will tell you how to navigate through the autograd code, what code is truly important, and how to benefit others, and I will introduce all the cool tools PyTorch provides for writing kernels.

Concepts

Tensor

A tensor is the core data structure in PyTorch. You may already have a good understanding of what a tensor intuitively represents: a tensor is an n-dimensional data structure that contains some scalar type (like floating-point numbers and integers). We can think of a tensor as being made up of some data, along with some metadata that describes the size, the type of elements it contains (dtype), and the device where the tensor is located (CPU memory? CUDA memory?).

Comprehensive Insights into PyTorch Internals from Core Developer

There is also a piece of metadata you may not be as familiar with: stride. Stride is actually one of the most elegant features of PyTorch, so it’s worth discussing it a bit more.

Comprehensive Insights into PyTorch Internals from Core Developer

A tensor is a mathematical concept. But to represent it in our computers, we need to define some physical representation for it. The most common representation is to store each element of the tensor contiguously in memory (which is also the source of the term “contiguous”), meaning writing each row out to memory as shown above. In the above case, I specified that the tensor contains 32-bit integers, so you can see that each integer resides at a physical address, with each address being 4 bytes apart. To remember the actual dimensions of the tensor, we need to keep the scale size as additional metadata.

So how does this image relate to stride?

Comprehensive Insights into PyTorch Internals from Core Developer

Suppose I want to read the element at the logical position tensor [0,1]. How do I translate this logical position into a position in physical memory? Stride allows us to do this: to find the position of any element in a tensor, I multiply each index by its stride in that dimension and then sum them all up. In the above image, I represent the first dimension in blue and the second dimension in red so that you can understand the indices and strides in this stride calculation. After performing this summation, I get 2 (zero-indexed); in fact, the number 3 is located 2 positions below the start of this contiguous array.

(Later I will also discuss TensorAccessor, which is a convenience class for handling index calculations. When you use TensorAccessor, you no longer operate on raw pointers; these calculations have been hidden from you.)

Stride is the fundamental basis for explaining methods to PyTorch users. For example, suppose I want to extract a tensor representing the second row of the above tensor:

Comprehensive Insights into PyTorch Internals from Core Developer

With advanced indexing support, I can simply write tensor [1, :] to obtain this row. Importantly: when I do this, no new tensor is created; instead, a tensor that is a different view based on the underlying data is returned. This means that if I edit the data in this view, it will reflect in the original tensor.

In this case, understanding how to do this isn’t too difficult: 3 and 4 are located in contiguous memory, and we just need to remember an offset that indicates that the (logical) tensor’s data is 2 positions below the top.

(Every tensor records an offset, but most of the time it’s zero, and in those cases, I omit it in my diagrams.)

Question during the talk: If I take a segment of a tensor, how do I free the memory of the underlying tensor?

Answer: You have to make a copy of that segment, thus breaking its connection to the original physical memory. There isn’t much else you can do. Also, if you wrote Java a long time ago, taking a substring of a string has a similar issue because by default it doesn’t make a copy, so the substring retains the original (possibly very large) string. Clearly, Java 7u6 fixed this.

If I want to take the first column, it gets even more interesting:

Comprehensive Insights into PyTorch Internals from Core Developer

When we look at physical memory, we can see that the elements of that column are not contiguous: there is a gap of one element between them. Stride shines here: we no longer specify the stride between one element and the next as 1, but set it to 2, meaning we skip two steps. (By the way, this is why it’s called “stride”: if we think of the index as walking on the layout, the stride specifies how many positions forward we step each time.)

The stride representation actually allows you to represent all types of tensor segments; if you want to explore the various possibilities, please refer to https://ezyang.github.io/stride-visualizer/index.html

Now let’s take a step back and think about how we implement this functionality (after all, this is a talk about internal mechanisms). If we can get segments of a tensor, it means we must decouple the concept of a tensor (the user-facing concept you know and love) from the actual physical data that stores the tensor’s data (referred to as “storage”):

Comprehensive Insights into PyTorch Internals from Core Developer

There may be multiple tensors sharing the same storage. The storage defines the dtype and physical size of the tensor, while each tensor also records the size, stride, and offset, which defines the logical interpretation of the physical memory.

One thing to note: there will always be a tensor-storage pair, even in the “simple” case where storage is not actually needed (for example, when just allocating a contiguous tensor with torch.zeros(2, 2)).

By the way, we are not interested in this case; rather, we are interested in cases where there is a discrete storage concept, defining a segment as a tensor supported by a base tensor. This is a bit more complex, but it has advantages: contiguous tensors can implement a much more direct representation without the indirect trouble caused by storage. Such changes can make PyTorch’s internal representation closer to Numpy.

We have introduced some data layouts of tensors (some may say that if you correctly understand data representation, everything else will fall into place naturally). But it is still necessary to briefly discuss how operations on tensors are implemented. At the most abstract level, when you call torch.mm, two dispatches occur:

Comprehensive Insights into PyTorch Internals from Core Developer

The first dispatch is based on the device type and tensor layout: for example, whether it is a CPU tensor or a CUDA tensor, whether it is a tensor with strides or a sparse tensor. This dispatch is dynamic: it is a virtual function call (the subject of the second half of this talk is where this virtual function call occurs).

It should be reasonable to have a dispatch here: the implementation of CPU matrix multiplication is very different from that of CUDA. The reason for this dynamic dispatch is that these kernels may reside in different libraries (like libcaffe2.so or libcaffe2_gpu.so), and thus you have no choice: if you want to get into a library you don’t directly depend on, you must reach there through dynamic dispatch.

The second dispatch is the dispatch on the involved dtype. This dispatch is just a simple switch statement aimed at selecting the kernel that supports any dtype. The reason for dispatching here is also reasonable: the CPU code (or CUDA code) implements multiplication based on float, which is different from the code used for int. This indicates that you need to use different kernels for each dtype.

If you want to understand how operators are called in PyTorch, this may be the most important knowledge you should have in mind. We will return to this when we delve deeper into the code later.

Comprehensive Insights into PyTorch Internals from Core Developer

Since we have talked about tensors, I would also like to take some time to discuss tensor extensions. After all, besides dense CPU float tensors, there are many other types of tensors, such as XLA tensors, quantized tensors, and MKL-DNN tensors; and for a tensor library, there is one more thing to consider: how to accommodate these extensions?

Comprehensive Insights into PyTorch Internals from Core Developer

Our current model for extensions provides four extension points for tensors. First, there are three accompanying parameters that independently determine the tensor type:

device: describes the physical memory where the tensor is actually stored, such as on CPU, NVIDIA GPU (cuda), AMD GPU (hip), or TPU (xla). The characteristics that differ between devices have their own allocators, which cannot be used for other devices.
layout: describes how to logically interpret the physical memory. The most commonly used layout is the strided tensor, but sparse tensor layouts are different, involving a pair of tensors, one for indexing and one for data; MKL-DNN tensor layouts are even more peculiar, such as blocked layout, which cannot be represented by just stride.
dtype: describes the type of data actually stored in each element of the tensor, which can be a floating-point number, an integer, or a quantized integer.

If you want to add an extension to PyTorch tensors, you should think about which of these parameters you want to extend. The Cartesian product of these parameters defines all the possible tensors you can get. Now, not all of these combinations have kernels (who has kernels for sparse quantized tensors on FPGA?), but in principle, such combinations can make sense, so we should at least support expressing them.

There is one last way to add “extensions” to tensor functionality, which is to write a wrapper class around the PyTorch tensor of the target type you want to achieve. This may sound obvious, but sometimes people go to extend those three parameters when they only need to make a wrapper class. One prominent advantage of a wrapper class is that the development result can completely not affect the original type (out of tree).

When should you write a tensor wrapper instead of extending PyTorch itself? The key indicator is whether you need to pass this tensor through the autograd backward pass process. For example, this indicator tells us that sparse tensors should be a true tensor extension, not just a Python object containing an index and value tensor: when optimizing on networks involving embeddings, we want to generate sparse gradients for embeddings.

Comprehensive Insights into PyTorch Internals from Core Developer

Our concept of extensions will also affect the data layout of the tensor itself. For our tensor structure, one thing we truly want is a fixed layout: we don’t want basic operations (this phrase is common), like “What is the size of a tensor?” to request virtual dispatch.

So when you look at the actual layout of a tensor (defined as the TensorImpl structure), you will see a common prefix for all fields—we think that anything resembling “tensor” will have it; and there are some fields that only truly apply to strided tensors, but they are also important, so we retain them in the main structure; then on each tensor, custom fields can be completed in the suffix. For example, sparse tensors can store their indices and values in this suffix.

Automatic Differentiation (Autograd)

I have explained tensors, but if PyTorch only had this trick, it would just be a clone of Numpy. The distinguishing feature of PyTorch is its provision of automatic differentiation for tensors from the very beginning (now we also have cool features like TorchScript, but back then it was just this!).

What does automatic differentiation do? It is the mechanism responsible for running neural networks:

Comprehensive Insights into PyTorch Internals from Core Developer

…… and fills in the missing code that actually computes the gradients of your network:

Comprehensive Insights into PyTorch Internals from Core Developer

Take a moment to look at this image. There is a lot to interpret, let’s look at:

First, direct your attention to the red and blue variables. PyTorch implements reverse-mode automatic differentiation, which means we can “walk backward” through the forward computation to efficiently compute gradients. You can see this by looking at the variable names: at the bottom of the red section, we compute the loss; then in the blue part of this program, the first thing we do is compute grad_loss. The loss is computed based on next_h2, allowing us to calculate grad_next_h2. Technically, the variables with grad_ added are not actually gradients; they actually left-multiply a vector’s Jacobian matrix, but in PyTorch, we just call them grad, and basically everyone knows what that means.
If the structure of the code remains the same, while the behavior does not: every line from the forward is replaced with a different computation representing the derivative of the forward operation. For example, the tanh operation is translated into the tanh_backward operation (these two lines are connected by a gray line on the left side of the diagram). The inputs and outputs of forward and backward operations are swapped: if the forward operation yields next_h2, the backward operation takes grad_next_h2 as input.

The significance of autograd lies in executing the computation described by this diagram without actually generating this source. PyTorch’s autograd does not perform source-to-source transformations (although PyTorch JIT does know how to perform symbolic differentiation).

Comprehensive Insights into PyTorch Internals from Core Developer

To accomplish this, we need to store more metadata when performing operations on tensors. Let’s adjust our diagram of the tensor data structure: now there is not just a tensor pointing to storage, but also a variable wrapping this tensor, which also stores more information (AutogradMeta) that the user needs when calling loss.backward() in their PyTorch scripts.

The content of this slide will soon be outdated. Will Feng is pushing for the fusion of variables and tensors in C++ after simply merging the frontend port of PyTorch:https://github.com/pytorch/pytorch/issues/13638.

We also need to update the diagram above regarding dispatch:

Comprehensive Insights into PyTorch Internals from Core Developer

Before we dispatch to the CPU or CUDA implementation, there is another dispatch for the variable responsible for unwrapping (unwrap) the variable, calling the underlying implementation (green), and then re-wrapping the result back into the variable and recording the necessary autograd metadata for the backward process.

Some implementations do not unwrap; they simply call other variable implementations. So you may have to spend some time in the variable universe. However, once you unwrap and enter the non-variable tensor universe, you reach the end; you no longer need to return to variables (unless returning from your function).

In my NYC meetup talk, I skipped the next seven slides. The textual introduction to them will take some time.

Comprehensive Insights into PyTorch Internals from Core Developer

Engineering Development

Having discussed the concepts, let’s take a look at the code.

Finding Your Path

PyTorch has a large number of folders, with a very detailed description of them in the CONTRIBUTING.md document, but in reality, you only need to know 4 directories:

Comprehensive Insights into PyTorch Internals from Core Developer

First, torch/ contains the actual Python modules you are familiar with: the ones you import and use. These are Python codes and easy to manipulate (just modify them and see the results). However, if you go too deep…
torch/csrc/: implements the C++ code that you might refer to as the PyTorch frontend. To put it more descriptively, it implements the binding code that converts between Python and C++; it also contains some quite important parts of PyTorch, such as the autograd engine and the JIT compiler. It also contains C++ frontend code.
aten/: short for “A Tensor Library” (named by Zachary DeVito), is a C++ library that implements tensor operations. If you check where some kernel code is located, it’s likely in ATen. ATen is divided into two operator areas: “native” operators (the modern C++ implementation of operators) and “legacy” operators (TH, THC, THNN, THCUNN), which are legacy C implementations. The legacy operators are the bad part; if possible, please don’t spend too much time on them.
c10/: a pun on “Caffe2” and “A”Ten”, contains the core abstractions of PyTorch, including the actual implementation of tensor and storage data structures.

Finding code requires looking in many places; we should simplify the directory structure, that’s it. If you want to study operators, you should spend time on aten.

Let’s see how this code is separated in practice:

Comprehensive Insights into PyTorch Internals from Core Developer

What happens when you call a function like torch.add? If you remember our discussion about dispatch, you should have these basics in mind:

We have to convert from the Python realm to the C++ realm (Python argument parsing).
We handle variable dispatch (VariableType—Type, by the way, has no special relation to programming language types; it’s just a small tool for executing dispatch).
We handle device type/layout dispatch (Type).
We have the actual kernel, which is either a modern native function or a traditional TH function.

Each step corresponds to some code. Let’s cut through this jungle.

Comprehensive Insights into PyTorch Internals from Core Developer

Our starting landing point in C++ code is a C implementation of a Python function, which we have seen on the Python side, like torch._C.VariableFunctions.add. THPVariable_add is such an implementation.

One important thing about this code: it is automatically generated. If you search in the GitHub repository, you won’t find them because you must actually build PyTorch to see them. Another important point: you don’t need to deeply understand what these codes are doing; you should quickly skim through them to know their functionality.

I have highlighted the most important parts in blue: you can see that a PythonArgParser class is used here to extract C++ objects from Python args and kwargs; then we call a dispatch_add function (red inline); this releases the global interpreter lock and then calls a regular old method on the C++ tensor itself. On its way back, we re-wrap the returned Tensor into a PyObject.

(There is an error in this slide: I should explain the variable dispatch code. I haven’t fixed it here yet. Some magical things happen, and so…)

Comprehensive Insights into PyTorch Internals from Core Developer

When we call the add method on the Tensor class, no virtual dispatch occurs yet. Instead, I have an inline method that calls an inline method that calls a virtual method on the “Type” object. This method is a real virtual method (this is why I said Type is just a “small tool” that allows you to implement dynamic dispatch).

In this specific case, this virtual call dispatches to the add implementation on a class named TypeDefault. This just happens because we have the same add implementation for all device types (CPU and CUDA); if we happen to have different implementations, we might end up with something like CPUFloatType::add. It is this implementation of the virtual method that allows us to finally get to the actual kernel code.

Also, I hope this slide will soon be outdated; Roy Li is studying using another mechanism to replace Type dispatch to better support PyTorch on mobile.

It is worth emphasizing again that until we reach the kernel, all this code is automatically generated.

Comprehensive Insights into PyTorch Internals from Core Developer

The road is winding, and once you can grasp the direction, I recommend jumping straight to the kernel section.

Writing Kernels

PyTorch provides a wealth of useful tools for those looking to write kernels. In this section, we will explore some of them. But first, what does it take to write a kernel?

Comprehensive Insights into PyTorch Internals from Core Developer

Generally, we consider a kernel in PyTorch to consist of the following parts:

First, some metadata about the kernel we need to write, which can assist code generation and allow you to get all the bindings with Python without writing a single line of code.
Once you reach the kernel, you have gone through device type/layout dispatch. The first thing you need to write is error checking to ensure the input tensors have the correct dimensions. (Error checking is really important! Don’t skimp on it!)
Next, we generally have to allocate the result tensor that we will write to.
It’s time to write the kernel. Now you should perform the second dtype dispatch to jump to the dtype-specific kernel for each dtype it operates on. (You shouldn’t do this too early, as that would result in you copying code that looks the same in any case uselessly.)
Most high-performance kernels require some form of parallelization to utilize multi-CPU systems. (CUDA kernels are “implicitly” parallelized because their programming model is based on large-scale parallelization.)
Finally, you need to read the data and perform the computations you want!

In the following slides, I will introduce tools in PyTorch that can help you accomplish these steps.

Comprehensive Insights into PyTorch Internals from Core Developer

To fully leverage PyTorch’s code generation capabilities, you need to write a schema for your operator. This schema can provide mypy-style types for your function and control whether to generate bindings for methods or functions on tensors. You can also tell the schema which implementation of your operator should be called for a given device-layout combination.

For more information on this format, see:

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/README.md

Comprehensive Insights into PyTorch Internals from Core Developer

You might also need to define a derivative for your operation in derivatives.yaml.

Comprehensive Insights into PyTorch Internals from Core Developer

Error checking can be done through low-level APIs or through high-level APIs. The low-level API is simply a macro TORCH_CHECK that takes a boolean and an arbitrary number of parameters forming an error string to conclude whether that boolean is true.

This macro has a great feature: you can mix strings with non-string data; each item is formatted using their operator<< implementation, and most important data types in PyTorch have operator<< implementations.

The high-level API allows you to avoid repeatedly writing duplicate error messages. It works by first wrapping each tensor as TensorArg, which contains information about where the tensor came from (like its parameter name). It then provides some pre-packaged functions to check various properties; for example, checkDim() tests whether the tensor’s dimension is a fixed value. If not, the function provides a user-friendly error message based on the TensorArg metadata.

Comprehensive Insights into PyTorch Internals from Core Developer

When writing operators in PyTorch, one important thing is that you often need to register three operators: abs_out (which operates on a pre-allocated output, implementing the out= keyword parameter), abs_ (which operates in-place), and abs (which is just an ordinary old function version of the operator).

Most of the time, abs_out is the real workhorse, while abs and abs_ are just weak wrappers around abs_out; but sometimes it is also possible to write dedicated implementations for each case.

Comprehensive Insights into PyTorch Internals from Core Developer

To perform dtype dispatch, you should use the AT_DISPATCH_ALL_TYPES macro. This will take the dtype of the tensor you want to perform dispatch operations on and will also specify a lambda for each dtype that can be dispatched from that macro. Generally, this lambda just calls a template helper function.

This macro does not just “perform dispatch”; it also determines the dtypes your kernel will support. Thus, this macro actually has quite a few versions that allow you to select a different subset of dtypes to generate specific results. Most of the time, you only need AT_DISPATCH_ALL_TYPES, but also keep an eye on situations where you may need to dispatch other types.

Comprehensive Insights into PyTorch Internals from Core Developer

On the CPU, you usually need to parallelize your code. In the past, this was often done by directly adding OpenMP pragmas in your code.

Comprehensive Insights into PyTorch Internals from Core Developer

Sometimes, you really need to access the data. PyTorch provides quite a few options for this.

If you just want to get the value at a specific position, you should use TensorAccessor. Tensor accessors are like tensors, but they hard-code the tensor’s dimensions and dtype as template parameters. When you retrieve an accessor, like x.accessor

();, we perform a runtime test to ensure the tensor is indeed of that format; but after that, each access will not be checked. Tensor accessors handle strides correctly, so you should prefer using them over raw pointer access (unfortunately, many legacy kernels do this). There is also PackedTensorAccessor, which is particularly suitable for sending accessors through CUDA launch, allowing you to access the accessor from your CUDA kernel. (A point worth mentioning: TensorAccessor defaults to 64-bit indexing, which is much slower than 32-bit indexing in CUDA!)
If you are writing some operator with very conventional element access like point-wise operations, then using much higher-level abstractions is better, like TensorIterator. This helper class can automatically handle broadcasting and type promotion, making it quite useful.
To achieve real speed on the CPU, you might need to write your kernels using vectorized CPU instructions. We also have helper functions for this! The Vec256 class represents a scalar vector and provides methods to perform vectorized operations on them at once. Then binary_kernel_vec and other helper functions allow you to easily run vectorized operations and then finish those that cannot be easily converted into vector instructions using ordinary old instructions. This infrastructure can also compile your kernels multiple times under different instruction sets and then test at runtime what instructions your CPU supports, using the best kernel in those cases.

Comprehensive Insights into PyTorch Internals from Core Developer

Many kernels in PyTorch are still written in the traditional TH style. (By the way, TH stands for TorcH. It’s a nice acronym but unfortunately has been polluted; if you see TH in the name, consider it traditional.) What does traditional TH style mean?

It is written in C style, with little or no use of C++.
Its refcounted is manual (using a manual call to THTensor_free to lower your refcounts when you finish using the tensor).
It is located in the generic/ directory, which means we actually compile this file many times but using different #define scalar_t.

This kind of code is quite crazy, and we hate reviewing it, so please don’t add it. If you want to write code but know little about kernel writing, one useful thing you can do is to port some TH functions to ATen.

Workflow Efficiency

Comprehensive Insights into PyTorch Internals from Core Developer

Finally, I want to talk about workflow efficiency on PyTorch. If the vast C++ codebase of PyTorch is the first roadblock to people contributing to PyTorch, then your workflow efficiency is the second. If you want to develop C++ with Python habits, it can be quite challenging: recompiling PyTorch takes a lot of time, and it also takes a long time to know whether your changes are effective.

How to work efficiently could itself be worth a talk, but this slide summarizes some of the most common anti-patterns I’ve seen some people complain about: “Developing PyTorch is hard.”

If you edit a header, especially one included by many source files (especially when included by CUDA files), you can expect long rebuild times. Try to edit cpp files only, and be cautious when editing headers!
Our CI is a very good zero-setup way to test whether modifications are effective. But you may need to wait an hour or two for feedback. If you are making a change that will require a lot of experimentation, take some time to set up a local development environment. Similarly, if you encounter difficult debug issues on a specific CI configuration, set it up locally. You can download and run the Docker image locally: https://github.com/pytorch/ossci-job-dsl
The contribution guide explains how to set up ccache: https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#use-ccache; this is highly recommended because it can help you avoid a lot of recompilation when editing headers. This can also help you cover up vulnerabilities in our build system when we recompile files that shouldn’t be.
Finally, we will have a lot of C++ code. If you are building on a powerful server with CPUs and RAM, it will be a pleasant experience. I particularly advise against performing CUDA builds on laptops. Building CUDA is very, very slow, and laptops often lack the performance to complete it quickly.

Get Involved!

Comprehensive Insights into PyTorch Internals from Core Developer

And that concludes our whirlwind tour of the PyTorch kernel! A lot has been omitted; but I hope the descriptions and explanations here can at least help you digest a significant portion of its codebase.

What’s next? What contributions can you make? Our issue tracker is a good place to start:https://github.com/pytorch/pytorch/issues.

Since this year, we have been categorizing and triaging issues; issues marked “triaged” indicate that at least one PyTorch developer has looked into it and performed an initial assessment. You can use these labels to find what we think are high-priority issues or look at issues for specific modules (like autograd), and find what we consider to be minor issues. (Warning: We are sometimes wrong!)

Even if you don’t want to start writing code right away, there are still many other useful tasks worth doing, such as improving documentation (I love merging documentation PRs; they are great), helping us reproduce bug reports from other users, and helping us discuss RFCs on the issue tracker. Without our open-source contributors, PyTorch would not be where it is today; we hope you can join us!

Original article address:

http://blog.ezyang.com/2019/05/pytorch-internals/

Editor: Yu Tengkai

Proofreader: Lin Yilin

Leave a Comment Cancel reply