How to implement an efficient Softmax CUDA kernel 07/19 Update SLTechnology News&Howtos

How to implement an efficient Softmax CUDA kernel

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces you how to implement an efficient Softmax CUDA kernel, the content is very detailed, interested friends can refer to it, I hope it can help you.

Softmax is one of the most common operations in deep learning models. In deep learning classification tasks, the final classifier of the network is often a combination of Softmax + CrossEntropy:

Although the math can be reduced when Softmax and CrossEntropy are used together, there are many scenarios where Softmax Op is used alone. For example, Softmax is used separately in the attention layer of each layer of BERT's Encoder to solve the probability distribution of attention; Softmax is also used separately in the multi-head part of GPT-2's attention, etc.

All computed operators in the deep learning framework are translated into CUDA kernel functions on the GPU, and Softmax operations are no exception. Softmax is a widely used operator in most networks, and its CUDA kernel implementation efficiency will affect the final training speed of many networks. So how to implement an efficient Softmax CUDA Kernel? This article will introduce the optimized Softmax CUDA Kernel technique in OneFlow, and compare it with the Softmax operation in cuDNN. The results show that the utilization of memory bandwidth by Softmax after OneFlow deep optimization can be close to the theoretical upper limit, which is much higher than the implementation of cuDNN.

GPU Basics and CUDA Performance Optimization Principles:

For an introduction to GPU basics and the principles and goals of CUDA performance optimization, please refer to the previous article:

https://zhuanlan.zhihu.com/p/271740706

It briefly introduces the hardware structure and execution principle of GPU:

Kernel: CUDA Kernel function, which is the basic computing task description unit of GPU. Each Kernel will be executed in parallel by many threads on GPU according to configuration parameters. GPU computing efficiency is because thousands of cores (threads) can be executed simultaneously, and computing efficiency is far higher than CPU.

GPU threads are logically divided into Thread, Block and Grid, and hardware threads are divided into core and warp.

GPU memory is divided into three levels: Global memory, Shared memory and Local memory.

GPU provides two main resources: computing resources and memory bandwidth resources. If we can make the best use of both resources, and the demand on resources cannot be reduced any further, then performance is optimized to the limit and execution time is minimized. In most cases, GPU computing resources in deep learning training are fully utilized, and the optimization goal of a GPU CUDA Kernel is to make the best use of memory bandwidth resources.

How to evaluate whether a CUDA Kernel is fully utilizing memory bandwidth resources?

For memory bandwidth resources,"full utilization" means that the effective memory read and write bandwidth of Kernel reaches the upper limit of device memory bandwidth, where device memory bandwidth can be obtained by executing bandwidthTest in cuda. Kernel's effective memory bandwidth is evaluated by the amount of Kernel read and write data and Kernel execution time:

Effective memory bandwidth of current Kernel = read/write data/execution time Effective memory bandwidth of current Kernel = read/write data/execution time Effective memory bandwidth of current Kernel = read/write data/execution time

Naive's Softmax implementation:

Before introducing optimization techniques, let's look at what the theoretical maximum bandwidth of an unoptimized Softmax Kernel is. As shown in the following figure, in one of the simplest Softmax calculation implementations, several basic CUDA Kernel functions are called to complete the overall calculation:

Assuming that the input data size is D, shape = (num_rows, num_cols), i.e. D = num_rows * num_cols, the most Navie operation will access Global memory multiple times, where:

ReduceMax = D + num_rows (read is D, write is num_rows)

BroadcaseSub = 2 * D + num_rows (read is D + num_rows, write is D)

Exp = 2 * D (read and write are both D)

ReduceSum = D + num_rows (read is D, write is num_rows)

BroadcastDive = 2 * D + num_rows (read is D + num_rows, write is D)

A total of 8 * D + 4 * num_rows is required. Since num_rows is negligible compared to D, the Navie version of Softmax CUDA Kernel requires access to at least 8 times the amount of memory, i.e.:

N a i v e S o f t m a x K e r n e l Effective memory bandwidth

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.