PyTorch 2.0 has been released: compiler performance is greatly improved and 100% backward compatibility 04/29 Update SLTechnology News&Howtos

PyTorch 2.0 has been released: compiler performance is greatly improved and 100% backward compatibility

2025-04-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Thanks to CTOnews.com netizens for the delivery of clues such as magic pestle and Broadwell! CTOnews.com March 19, PyTorch 2.0 stable release has been released. Compared with the previous 1.0 version, 2.0 has a subversive change. The biggest improvement in PyTorch 2.0 is API's torch.compile, and the new compiler is much faster than the real-time code generation provided by the previous "eager mode" and further improves performance.

CTOnews.com official website address: https://pytorch.org/

GitHub address: https://github.com/pytorch/pytorch/releases

Updates to the new version include a stable version of Accelerated Transformers (formerly known as Better Transformers); the Beta version includes torch.compile as the main API of PyTorch 2.0, the scaled_dot_product_attention function as part of torch.nn.functional, the MPS back end, functorch API; in the torch.func module, and other Beta / Prototype improvements to GPU and CPU across a variety of reasoning, performance, and training optimization functions.

For a comprehensive introduction and technical overview of torch.compile, please see the 2.0 getting started page.

In addition to PyTorch 2.0, the R & D team also released a series of Beta updates to PyTorch domain libraries, including in-tree libraries and independent libraries such as TorchAudio, TorchVision, TorchText, and so on. In addition, TorchX has moved to a community support model.

Summary:

Torch.compile is the main API of PyTorch 2.0, which can wrap and return the compiled model. This is a completely additional (and optional) feature, so PyTorch 2.0 is by definition 100% backward compatible.

As the basic technology of torch.compile, TorchInductor and Nvidia / AMD GPU will rely on OpenAI Triton deep learning compiler to generate performance code and hide low-level hardware details. OpenAI triton-generated kernels achieve the same performance as handwritten kernels and specialized cuda libraries such as cublas.

Accelerated Transformers introduces high-performance support for training and reasoning, using a custom kernel architecture to scale point product attention (SPDA). API integrates with torch.compile (), and model developers can also use the scaled dot product attention kernel directly by calling the new scaled_dot_product_attention () operator.

The Metal Performance Shaders (MPS) backend provides GPU-accelerated PyTorch training on the Mac platform and adds support for the first 60 most commonly used operators, covering more than 300 operators.

Amazon AWS optimizes PyTorch CPU reasoning on AWS Graviton3. Compared to previous versions, PyTorch 2.0 improves the reasoning performance of Graviton, including improvements for ResNet-50 and BERT.

Other new prototype features and methods across TensorParallel, DTensor, 2D parallel, TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor.

To see the full list of public 2.0, 1.13, and 1.12 features, click here.

Stabilization feature PyTorch 2.0 includes a new high-performance implementation of PyTorch Transformer API, formerly known as "Better Transformer API", now renamed "Accelerated PyTorch 2 Transformers".

The R & D team said they hoped the entire industry would be able to afford to train and deploy the SOTA Transformer model. The new version introduces high-performance support for training and reasoning, and uses a custom kernel architecture to scale dot product attention (SPDA).

Similar to the Fast path (fastpath) architecture, the custom kernel is fully integrated into PyTorch Transformer API-so using Transformer and MultiHeadAttention API will enable users to:

Obviously see a significant increase in speed.

Support for more use cases, including the use of cross-attention models, Transformer decoders, and can be used for training models

Continue to use fastpath reasoning for fixed and variable sequence length Transformer encoders and self-attention use cases.

To take full advantage of different hardware models and Transformer use cases, PyTorch 2.0 supports multiple SDPA custom kernels, which select the highest performance kernel for a given model and hardware type. In addition to the existing Transformer API, model developers can directly use the scaling point product attention kernel by calling the new scaled_dot_product_attention () operation.

To use your model while benefiting from the additional acceleration of pt2 compilation (for inference or training), preprocess the model with model = torch.compile (model).

By using a combination of a custom kernel and torch.compile (), we used Accelerated PyTorch 2 transformer to achieve a significant acceleration in training transformer models, especially large language models.

▲ 's use of scaling dot product attention in conjunction with a custom kernel and torch.compile provides a significant acceleration in training large language models (as shown in the figure above, nanoGPT). As you can see from the official data, the compilation efficiency of PyTorch 2.0 is much higher than that of 1.0.

This data comes from the benchmark testing of 163open source models conducted by the PyTorch Foundation on the Nvidia A100 GPU using PyTorch 2.0, including tasks such as image classification, target detection, image generation, and various NLP tasks.

These Benchmark are divided into three categories: TIMM, TorchBench, and HuggingFace Tranformers.

According to the PyTorch Foundation, the speed of the new compiler increased by 21% when using Float32 precision mode and 51% when using automatic mixing precision (AMP) mode. Among these 163models, torch.compile works well on 93% models.

It is worth mentioning that the acceleration power officially measured on desktop-level GPU (such as NVIDIA 3090) is lower than that of server-level GPU (such as A100). So far, PyTorch 2.0 default backend TorchInductor supports CPU and NVIDIA Volta and Ampere GP, but does not support other GPU, XPU or old NVIDIA GPU for the time being.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.