Jay Alammar released a new work: ultra-high quality graphic Stable Diffusion, thoroughly understand the principle of "image generation" 04/20 Update SLTechnology News&Howtos

Jay Alammar released a new work: ultra-high quality graphic Stable Diffusion, thoroughly understand the principle of "image generation"

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

The Stable Diffusion principle that even rookies can understand!

Do you still remember the popular graphic Transformer of the whole network?

Recently, the big blogger Jay Alammar also wrote a diagram of the Stable Diffusion model of the fire on his blog, allowing you to fully understand the principle of the image generation model from scratch, with a super-detailed video explanation!

Article link: https://jalammar.github.io/illustrated-stable-diffusion/

Video link: https://www.youtube.com/ watch?v=MXmacOUJUaw

Graphic Stable DiffusionAI model recently shows that the ability of image generation is far beyond people's expectations, directly according to the text description can create images with amazing visual effects, the operating mechanism behind it seems very mysterious and magical, but it does affect the way human beings create art.

The release of Stable Diffusion is a milestone in the process of AI image generation and development, which is equivalent to providing an available high-performance model for the public, not only the image quality is very high, the running speed is very fast, but also the requirements of resources and memory are low.

I'm sure anyone who has tried AI image generation will want to know exactly how it works. This article will unravel the mystery of how Stable Diffusion works.

Stable Diffusion from the function mainly includes two aspects: 1) its core function is only based on the text prompt as the input to generate the image (text2img); 2) you can also use it to modify the image according to the text description (that is, input text + image).

Illustrations are used below to help explain the components of Stable Diffusion, how they interact, and the meaning of image generation options and parameters.

Stable Diffusion component Stable Diffusion is a system made up of multiple components and models, not a single model.

When we look inside the model from the perspective of the model as a whole, we can find that it contains a text understanding component for translating text information into digital representation (numeric representation) to capture semantic information in the text.

Although the model is still analyzed from a macro point of view, there are more model details later, but we can also roughly speculate that the text encoder is a special Transformer language model (specifically the text encoder of the CLIP model).

The input to the model is a text string and the output is a list of numbers that represent each word / token in the text, transforming each token into a vector.

This information is then submitted to the image generator (image generator), which also contains multiple components.

The image generator mainly consists of two phases:

1. Image information creator is the exclusive recipe for Stable Diffusion, and many of its performance gains are realized here compared to the previous model.

The component runs multiple steps to generate image information, where steps is also a parameter in the Stable Diffusion interface and library, usually defaulting to 50 or 100.

The image information creator runs entirely in the image information space (or latent space), which makes it run faster than other Diffusion models working in pixel space; technically, the component consists of a UNet neural network and a scheduling (scheduling) algorithm.

The word diffusion describes what happens during the internal operation of the component, that is, the information is processed step by step, and the next component (the image decoder) produces a high-quality image.

two。 The image decoder draws a picture based on the information obtained from the image information creator, and the whole process can be run only once to generate the final pixel image.

As you can see, Stable Diffusion consists of three main components, each of which has a separate neural network:

1) Clip Text is used for text encoding.

Entering: text

Output: 77 token embedding vectors, each containing 768 dimensions

2) UNet + Scheduler gradually processes / diffuses information in the information (potential) space.

Input: text embedding and an initial multidimensional array of noises (a structured list of numbers, also known as tensor tensor).

Output: a processed information array

3) self-codec (Autoencoder Decoder), a decoder that uses the processed information matrix to draw the final image.

Input: processed information matrix, dimension is (4, 64, 64)

Output: result image, each dimension is (3512512), that is (red / green / blue, wide, high)

What is Diffusion? Diffusion is a process that takes place in the pink image information creator component in the following image, which includes token embedding that represents the input text, and a random initial image information matrix (also known as latents). The process also requires an image decoder to draw the information matrix of the final image.

The whole running process is step by step, and more relevant information will be added at each step.

In order to feel the whole process more intuitively, we can look at the random latents matrix halfway and observe how it is transformed into visual noise, in which visual inspection (visual inspection) is carried out through the image decoder.

The whole diffusion process consists of multiple steps, each of which operates based on the input latents matrix and generates another latents matrix to better fit the "input text" and the "visual information" obtained from the model image set.

Visualizing these latents shows how this information is added up in each step.

The whole process is from scratch, and it looks quite exciting.

The process transition between steps 2 and 4 looks particularly interesting, as if the outline of the picture emerged from the noise.

The core idea of using diffusion model to generate images in Diffusion is based on the existing powerful computer vision models. As long as the input data set is large enough, these models can learn any complex operation.

Suppose we already have an image and generate some noise to add to the image, and then the image can be regarded as a training example.

Using the same operation, a large number of training samples can be generated to train the core components of the image generation model.

The above example shows some optional noise values, from the original image (level 0, no noise) to all noise added (level 4), so that you can easily control how much noise is added to the image.

So we can spread the process among dozens of steps and generate dozens of training samples for each image in the data set.

Based on the above data sets, we can train a noise predictor with excellent performance, and the training of each training step is similar to that of other models. When running in a certain configuration, the noise predictor can generate an image.

The trained noise predictor can Denoise an image with added noise, and can also predict the amount of noise added.

Because the sampled noise is predictable, if the noise is subtracted from the image, the final image will be closer to the image trained by the model.

The resulting image is not an accurate original image, but a distribution, that is, the pixel arrangement of the world, such as the sky is usually blue, people have two eyes, cats have pointed ears, and so on. The specific image style completely depends on the training data set.

Not only does Stable Diffusion generate images through denoising, but also DALL-E 2 and Google's Imagen model.

It is important to note that the diffusion process described so far has not used any text data to generate images. Therefore, if we deploy this model, it can generate a very good-looking image, but the user has no way to control the generated content.

In the next section, you will describe how to merge conditional text into the process to control the type of image generated by the model.

Acceleration: diffusion on compressed data in order to accelerate the process of image generation, Stable Diffusion does not choose to run the diffusion process on the pixel image itself, but chooses to run on the compressed version of the image, which is also called "Departure to Latent Space" in this paper.

The whole compression process, including the subsequent decompression and drawing of the image, is completed by the self-encoder, which compresses the image into the latent space, and then only uses the decoder to reconstruct the compressed information.

The forward diffusion (forward diffusion) process is completed by compressing the latents, and the noise slices is applied to the noise on the latents, not to the pixel image, so the noise predictor is actually trained to predict the noise in the compressed representation (latent space).

In the forward process, the noise predictor is trained using an encoder in the self-encoder. Once the training is complete, the image can be generated by running the reverse process (the decoder in the self-encoder).

The forward and backward processes are shown below, and a conditioning component is included to describe the text prompt that the model should generate an image.

Text encoder: the language understanding component in an Transformer language model uses the Transformer language model to convert input text prompts into token embedding vectors. The published Stable Diffusion model uses ClipText (GPT-based model), and this article chose to use the BERT model for ease of explanation.

Experiments in Imagen's paper show that larger language models can improve image quality more than choosing larger image generation components.

The early Stable Diffusion model used the pre-trained ClipText model released by OpenAI, but in Stable Diffusion V2 it has turned to the newly released, larger CLIP model variant OpenClip.

How is CLIP trained? The data needed by CLIP is the image and its title, and the data set contains about 400 million images and descriptions.

The dataset is collected through images captured from the Internet and the corresponding "alt" tag text.

CLIP is a combination of image encoder and text encoder. Its training process can be simplified to capture images and text descriptions, and two encoders are used to encode the data respectively.

Then use the cosine distance to compare the results embedded, at the beginning of the training, even if the text description and the image are matched, the similarity between them must be very low.

With the continuous updating of the model, the embedding of the image and text encoded by the encoder will be gradually similar in the subsequent stage.

By repeating this process throughout the data set and using a large batch size encoder, it is finally possible to generate an embedded vector in which the image of the dog is similar to the sentence "picture of a dog".

As in word2vec, the training process needs to include mismatched images and negative samples of instructions, and the model needs to assign them a lower similarity score.

In order to integrate the text conditions into a part of the image generation process, the input of the noise predictor must be adjusted to text.

All operations are in the latent space, including encoded text, input image and prediction noise.

In order to better understand how textual token is used in Unet, you also need to understand the Unet model.

The layer in the Unet noise predictor (no text) A diffusion Unet that does not use text, its input and output is as follows:

Inside the model, you can see:

1. The layers in the Unet model are mainly used to transform latents.

two。 Each layer operates on the output of the previous layer

3. Some outputs (through residual connections) are fed to the later processing of the network

4. Convert the time step into a time step embedding vector, which can be used in the layer.

The layer in the Unet noise predictor (with text) now needs to be retrofitted with a text version of the previous system.

The main modification is to add support for text input (term: text conditioning), that is, to add an attention layer between ResNet blocks.

It is important to note that the ResNet block does not see the text content directly, but merges the representation of the text in the latents through the attention layer, and then the next ResNet can take advantage of the upper text information in the process.

Reference:

Https://jalammar.github.io/illustrated-stable-diffusion/

Https://www.reddit.com/r/MachineLearning/comments/10dfex7/d_the_illustrated_stable_diffusion_video/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.