How to improve the inference speed of BERT 04/28 Update SLTechnology News&Howtos

How to improve the inference speed of BERT

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)05/31 Report--

Most people do not understand the knowledge points of this article "how to improve the inference speed of BERT", so the editor summarizes the following content, detailed content, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "how to improve the inference speed of BERT" article.

Guide reading

Microsoft has just opened up the breakthrough optimization of Transformer, which greatly improves the speed of reasoning on CPU and GPU.

One of the most popular deep learning models for natural language processing is BERT. Due to the need for a large amount of computation, the amount of BERT computation in large-scale inference is so large that it is impossible even under strict delay constraints. Recently, we shared "Bing has improved BERT inference on GPU for its real-time", which serves more than 1 million BERT inferences per second within Bing's delay limit. We are pleased to announce that Microsoft has opened up these enhanced versions of these optimizations in ONNX Runtime and extended them to GPU and CPU.

With ONNX Runtime, artificial intelligence developers can now easily produce high-performance large-scale transformer models on CPU and GPU hardware, using the same technology as Microsoft to serve customers.

Natural language processing in Bing

To provide our customers with the most relevant results, Bing uses state-of-the-art natural language processing (NLP) technology to better understand user queries, web pages, and other documents. A key component of NLP is the language representation model, such as BERT, RoBERTa, or MT-DNN. Bing has developed and tuned its own language representation model for tasks such as web search, Q & A, and image description.

However, using a large transformer network in a real-time production environment presents latency and cost challenges because running a 12-tier or 24-tier BERT for each query is computationally expensive. As announced last November, we first condensed the larger model into a three-layer BERT model using knowledge distillation, with no significant loss of accuracy and significantly reduced computational costs. However, the refined three-tier BERT model is still based on the latency of 77ms, and running millions of queries and documents per second is still very expensive. For further optimization, the whole model is reimplemented using C++ api, making full use of GPU architecture, which achieves 800x throughput improvement compared with CPU.

Once these optimizations are successfully used in Bing products, more needs to be done. Since these large transformer network can be used for more NLP tasks than web search, we need an easy way to share these useful work with others. The current solution requires every model developer to reimplement the model using our C++ library, which is very time-consuming. To further popularize transformer reasoning and enable others to benefit from these improvements, we further optimized them, extended them to CPU, and opened up their source code in ONNX Runtime.

Using ONNX Runtime to accelerate 17x BERT reasoning

ONNX Runtime is a high performance machine learning model reasoning engine. It is compatible with PyTorch, TensorFlow, and many other frameworks and tools that support ONNX standards. ONNX Runtime designs an open and extensible architecture that easily optimizes and accelerates reasoning by leveraging built-in graphics optimization and various hardware acceleration capabilities across CPU, GPU, and edge devices. ONNX Runtime can be easily plugged into your technology stack because it works on Linux, Windows, Mac and Android, and provides convenient api for Python, c #, C++, C, and Java.

A Transformer model like BERT consists of many operators. Graphic optimization, from small graphics simplification and node removal to more complex node fusion and layout optimization, is a basic technology built in ONNX Runtime. Since the BERT model consists mainly of stacked Transformer units, we optimize each unit, including the Self-Attention layer, the LayerNormalization layer, and the Gelu layer, by merging the key subgraphs of several basic operators into a single kernel of CPU and GPU. This greatly reduces memory replication between a large number of basic calculations.

In addition, in the CPU implementation of Self-Attention, the columns of matrix Q, K and V are divided according to the number of Self-Attention heads. With this optimization, we can significantly improve parallelism and make full use of the available CPU kernel. Moreover, the transpose operation after Q, K and V are fully connected can be calculated in GEMM, which further reduces the calculation cost.

With these optimizations, ONNX Runtime performs reasoning on 128sequence length and batch size 1 BERT-SQUAD on the Azure standard NC6S_v3 (GPU V100):

The 12-layer fp16 BERT-SQUAD is 1.7ms. Layer 24 fp16 BERT-SQUAD is 4.0ms.

Here are the detailed performance figures for the lengths of fp32 BERT and 128sequences in the upper three layers of ONNX Runtime. On CPU, we saw 17x acceleration, and on GPU, we saw more than 3x acceleration.

Use ONNX Runtime reasoning on a global scale

With the latest BERT optimizations available in ONNX Runtime, Bing transforms the transformer inference code base into a jointly developed ONNX Runtime. ONNX not only deduces large-scale transformer networks within the range of Bing traffic, but also reduces the delay of Bing. In addition, Bing found ONNX Runtime easier to use and reduced the reuse time for new scene optimization from a few days to a few hours.

In addition to Bing, ONNX Runtime is deployed by dozens of Microsoft products and services, including Office, Windows, Cognitive services, Skype, Bing Ads and PowerBI. ONNX Runtime is used in various models such as computer vision, speech, language processing, prediction and so on. Compared to previous reasoning solutions, the team achieved up to 18 times better performance on the same hardware.

Start using BERT acceleration

You can take advantage of the same acceleration that Microsoft technology uses in your own products, whether you are targeting the cloud or the smart edge, or whether you are using cpu or gpu. Here we go:

Load pre-trained models using or from popular frameworks such as PyTorch or TensorFlow. Prepare the model for optimization inference by exporting from PyTorch or converting from TensorFlow/Keras to ONNX format. Use ONNX Runtime for high-performance reasoning across multiple platforms and hardware. The above is the content of this article on "how to improve the inference speed of BERT". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.