One line of code, alchemy 2 times faster! PyTorch 2.0 surprise came out, LeCun forwarded it passionately. 07/13 Update SLTechnology News&Howtos

One line of code, alchemy 2 times faster! PyTorch 2.0 surprise came out, LeCun forwarded it passionately.

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Now, with just one line of code, PyTorch2.0 can speed up your training of the Transformer model by 1.5x to 2x!

On December 2, PyTorch 2.0 was officially released!

This update not only pushes the performance of PyTorch to new heights, but also adds support for dynamic shapes and distribution.

In addition, the 2.0 series will move parts of the PyTorch code from C++ back to Python.

PyTorch 2.0 is currently in the beta phase, and the first stable version is expected to be available in early March 2023.

PyTorch 2.x: faster, more Python! Over the past few years, PyTorch has innovated and iterated from 1. 0 to the most recent 1. 13, and transferred to the newly established PyTorch Foundation as part of the Linux Foundation.

The challenge with the current version of PyTorch is that it is difficult for eager-mode to keep up with the growing GPU bandwidth and crazier model architecture.

The birth of PyTorch 2.0 will fundamentally change and improve the way PyTorch runs at the compiler level.

It is well known that the Py in PyTorch comes from the open source Python programming language that is widely used in data science.

However, PyTorch's code does not fully use Python, but instead gives part of it to C++.

However, in the future 2.x series, the PyTorch project team plans to move torch.nn-related code back into Python.

In addition, because PyTorch 2.0 is a completely additional (and optional) feature, 2.0 is 100% backward compatible.

That is, the code base is the same, the API is the same, and the model is written in the same way.

More technical support for TorchDynamo

Using the Python framework to evaluate hooks to safely capture PyTorch programs is a major innovation developed by the team in graph capture over the past five years.

AOTAutograd

The autograd engine of PyTorch is overloaded as a tracking autodiff to generate advanced reverse tracing.

PrimTorch

Developers can build a complete PyTorch backend based on a closed set of about 2000 PyTorch operators that are summarized into a closed set of about 250 primitive operators. Greatly reduces the barrier to writing PyTorch functionality or back-end.

TorchInductor

A deep learning compiler that can generate fast code for multiple accelerators and backends. For Nvidia's GPU, it uses OpenAI Triton as the key building block.

It is worth noting that TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor are all written in Python and support dynamic shapes.

Faster training speeds up model training with the introduction of a new compilation mode "torch.compile", PyTorch 2.0, with a single line of code.

There's no trick here, just run torch.compile (), that's all:

Opt_module = torch.compile (module) to validate these technologies, the team carefully built test benchmarks, including tasks such as image classification, object detection, and image generation, as well as various NLP tasks, such as language modeling, question answering, sequence classification, recommendation systems, and reinforcement learning. Among them, these benchmarks can be divided into three categories:

46 models from HuggingFace Transformers

61 models from TIMM: the most advanced PyTorch image model collected by Ross Wightman

56 models from TorchBench: a set of popular code bases from github

The test results show that the training speed has been improved by 38% by 76% on the 163open source models that span vision, NLP and other fields.

In addition to the comparison on the NVIDIA A100 GPU, the team also benchmarked some popular open source PyTorch models and achieved significant acceleration from 30 to 2 times.

Developer Sylvain Gugger said: "with just one line of code, PyTorch 2.0 can achieve 1.5 to 2.0 times faster when training Transformers models. This is the most exciting thing since the mixed precision training came out! "

Technical Overview PyTorch's compiler can be broken down into three parts:

The acquisition of graphs

Reduction of graph

Compilation of graphs

Among them, when building a PyTorch compiler, getting graphs is a more difficult challenge.

TorchDynamo earlier this year, the team began working on TorchDynamo, an approach that uses the CPython feature introduced in PEP-0523, called Framework Evaluation API.

To this end, the team took a data-driven approach to verifying the effectiveness of TorchDynamo on graph capture-using more than 7000 Github projects written in PyTorch as a validation set.

The results show that TorchDynamo can perform graph capture correctly and securely 99% of the time, and the cost is negligible.

TorchInductor for the back end of the new compiler for PyTorch 2.0, the team took inspiration from how users write high-performance custom kernels: more and more using the Triton language.

TorchInductor uses the loop-by-loop level of IR defined by Pythonic to automatically map the PyTorch model to the generated Triton code on GPU and C++/OpenMP on CPU.

TorchInductor's core loop-level IR contains only about 50 operators, and it is implemented in Python, which makes it easy to extend.

To speed up training, AOTAutograd needs to capture not only user-level code, but also back propagation.

AOTAutograd can take advantage of PyTorch's torch_dispatch extension mechanism to track the Autograd engine and capture back propagation "in advance", which in turn can use TorchInductor to accelerate forward and backward channels.

PrimTorchPyTorch has more than 1200 operators, and if you take into account the various overloads of each operator, there are more than 2000. As a result, writing back-end or cross-domain functionality becomes an energy-consuming task.

In the PrimTorch project, the team defined two smaller and more stable sets of operators:

Prim ops has about ~ 250 operators, which is suitable for compilers. Because they are low enough, they just need to be fused together to achieve good performance.

ATen ops has about 750 typical operators, which are suitable for output as is. These are suitable for backends that have been integrated at the ATen level or have not been compiled to restore the performance of low-level operator sets such as Prim ops.

When studying the necessary conditions to support the generality of PyTorch code, a key requirement is to support dynamic shapes and allow the model to accept tensors of different sizes without causing recompilation each time the shape changes.

When dynamic shapes are not supported, a common solution is to fill them to the nearest quadratic power. However, as we can see from the chart below, it generates significant performance overhead as well as significantly longer compilation time.

Now, with support for dynamic shapes, PyTorch 2.0 achieves up to 40% higher performance than Eager.

Finally, in the roadmap for PyTorch 2.x, the team hopes to further promote the development of compilation patterns in terms of performance and scalability.

Reference:

Https://pytorch.org/get-started/pytorch-2.0/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era), editor: sleepy

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.