The speed is amazing, the mobile phone runs Stable Diffusion,12 seconds to show the picture, Google accelerates the diffusion model to break the record. 05/09 Update SLTechnology News&Howtos

The speed is amazing, the mobile phone runs Stable Diffusion,12 seconds to show the picture, Google accelerates the diffusion model to break the record.

2025-05-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

A 12-second one-click picture of a mobile phone? Google's latest research has done it.

The cell phone picture only takes 12 seconds?

This is not a boast, Google has been realized.

In the latest study, Google researchers optimized four GPU layers and successfully ran Stable Diffusion 1.4 on Samsung phones.

Image generation was achieved in 11.5 seconds, and importantly, memory usage was significantly reduced.

As the saying goes, Speed Is All You Need

Paper address: https://arxiv.org/ abs / 2304.11267 Google's latest approach is universal and can be improved on all diffusion models, not just for specific devices.

The experimental results show that the overall image generation time of Samsung S23 Ultra and iPhone 14 Pro is reduced by 52% and 33% respectively.

This means that the future of a phone with a generative AI model is nearer.

From 3080 to a mobile phone currently, a key consideration for merging a large diffusion model into any App is where the model will perform the selection.

The benefits of deploying the model on a consumer device are lower service costs, improved scalability, offline, and improved user privacy.

In 22 years, the first version of Stable Diffusion just released could only be run slowly on RTX 3080 at first.

Stable Diffusion has more than 1 billion parameters, and DALL-E is 12 billion. With the development of diffusion model, the number of parameters will increase gradually.

Due to the limitations of device computing and memory resources, there are many challenges at run time.

Without careful design, running these models on the device may result in increased output latency due to iterative denoising and excessive memory consumption.

Previously, there have been studies that have successfully deployed Stable Diffusion to devices, but only to specific devices or chipsets.

In response, Google researchers have provided a series of optimizations for large-scale diffusion models that achieve the fastest reasoning latency reported so far on GPU-equipped mobile devices.

Without using INT8 quantization, for 20 iterations of a 512x512 image, the reasoning delay of Stable Diffusion 1.4 is less than 12 seconds.

How exactly is it realized?

GPU perception optimization in this paper, researchers focus on using a large diffusion model, and then complete the task of generating images from text descriptions.

Although part of the discussion in this paper is the optimization suggestions put forward by researchers for the specific structure of Stable Diffusion, these optimizations can be easily extended to other large-scale diffusion models.

The researchers say that when reasoning with text cues, the process involves applying additional conditions to guide reverse diffusion based on the desired text description.

Specifically, the main components of Stable Diffusion include: text embedder (Text Embedder), noise generation (Noise Generation), denoising neural network (Denoising Neural Network,aka UNet), and image decoder (Image Decoder).

As shown in the following figure:

Schematic diagram of main components and their interactions in Stable Diffusion

Next, we will introduce these components and the reference diagram of the relationship between them.

Text embedder:

The CLIP model is used to encode the text prompt y to generate a high-dimensional embedding vector tau θ (y), which encapsulates the semantics of the text prompt. The embedding is used as the input of the denoising neural network to provide an indication for the process of reverse diffusion.

Noise generation:

The potential space is provided with random noise z, which is used as the starting point of the reverse diffusion process.

Denoising neural network:

The network is designed as a conditional distribution in the form of approximate p (z | y). The conditional denoising automatic encoder θ (zt, t, Tu θ (y)) (denoising autoencoder) is used. Each iteration t uses the UNet architecture.

At the same time, the cross-attention mechanism (cross-attention mechanism) is used to manipulate potential spaces and text embedding vectors to predict the denoised version of z in the iterative process.

Image decoder:

Retrograde diffusion process in potential space

In the middle of. Once this process is completed, the image decoder D is used to reconstruct the RGB image from the potential vector.

The researchers implemented group normalization (Group normalization,GN) throughout the UNet architecture.

The working principle of this normalization technique is that the pipeline of the feature graph (feature map) is divided into smaller groups, and each group is normalized independently, so that the dependence of GN on batch size is reduced, and it is more suitable for various batches and various network structures.

Apply the formula ①, each eigenvalue

The mean value of the group normalized to the group to which it belongs

Sum variance

(formula ①) instead of performing all the remodeling, averaging, variance, and normalization operations mentioned above in turn, the researchers designed a special program in the form of GPU Shader to perform all of these operations in a GPU command without an intermediate process.

Let's first introduce Gaussian Error Linear Unit (GELU).

As a ubiquitous activation function in the model, GELU contains many numerical calculations, such as multiplication, addition and Gaussian error functions, as shown in the formula ②.

The researchers created a special Shader to integrate these numerical calculations and their accompanying segmentation and multiplication operations so that they can be performed in a single drawing call.

(formula ②) the text / image converter in stable diffusion helps to model the conditional distribution P (z | tau θ (y)), which is very important for text-to-image generation.

However, self / cross attention mechanisms encounter difficulties in dealing with long sequences because their time and memory complexity is squared. In this paper, the researchers introduce two possible optimizations to alleviate these computing bottlenecks.

One is Partially Fused Softmax, the other is FlashAttention.

Here's just an example of Softmax.

The figure above is the optimized softmax implementation in the attention module.

The flow chart above the dotted line describes directly in the matrix

The initial implementation of applying softmax in.

The one below the dotted line shows the revised module (red part).

All in all, the researchers put forward a set of optimization schemes, which can achieve a breakthrough delay number when implementing large-scale diffusion models on various devices.

These improvements expand the versatility of the model and improve the overall user experience on a variety of devices.

In 12 seconds, in order to evaluate the improved model, the researchers conducted a set of benchmark tests on Samsung S23 Ultra (Adreno 740) and iPhone 14 Pro Max (A16).

As a denoising neural network, UNet is the component with the highest computing requirements.

The researchers provided the delay data required to perform a single iteration of UNet, measured in milliseconds, with an image resolution of 512x512.

In addition, they recorded the memory usage of the intermediate tensor generated at run time in the "Tensor" column, as well as the use of the memory "Weight" column assigned to save the model weight, all in megabytes.

Note that the memory manager optimizes memory footprint by reusing the buffer of the intermediate tensor.

As the data in the table shows, the first row shows the implementation using the internal OpenCL kernel in the public Github repository, but without any optimization results.

The results are implemented, and the researchers use the internal OpenCL kernel without any optimization.

Lines 2-5, enable each optimization one at a time:

Opt. Softmax: partially integrated softmax and optimized softmax reduction steps

S-GN / GELU: a dedicated kernel for group normalization and GELU

FlashAttn.:FlashAttention implementation

Winograd (All): using Winograd convolution

As each optimization is enabled, the experimental results show that the delay decreases gradually.

Compared to the baseline, a significant reduction in overall latency was observed on both devices: Samsung S23 Ultra decreased by 52.2% and iPhone 14 Pro Max decreased by 32.9%.

In addition, the researchers assessed the end-to-end delay of text-to-image output on the Samsung S23 Ultra.

20 denoising iterations were carried out to generate a 512x512 image, achieving an industry-leading result of less than 12 seconds.

It can be seen that in the case of no data connection or cloud server, running the generative artificial intelligence model locally on the mobile phone will open up many possibilities.

Google's latest research offers an entirely new solution.

Reference:

Https://arxiv.org/abs/2304.11267

Https://www.reddit.com/r/MachineLearning/comments/12zclus/d_google_researchers_achieve_performance/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.