Realistic reproduction of "perfect Chinese couple", enhanced version of Stable Diffusion free experience, the latest technical report released 02/11 Update SLTechnology News&Howtos

Realistic reproduction of "perfect Chinese couple", enhanced version of Stable Diffusion free experience, the latest technical report released

2026-02-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Just now, the technical report of Stable Diffusion XL 0.9, which is known as the "open source version of Midjourney", has been released.

The latest technical report of the enhanced version of Stable Diffusion has been released!

Report address:

Https://github.com/Stability-AI/generative-models/blob/main/assets/sdxl_report.pdf

After the public trial was launched in April, Stable Diffusion XL was favored by many people, known as the "open source version of Midjourney."

In the painter, writing and other details, SDXL can control the whole world, the most important thing is that it can be achieved without ultra-long prompt.

Not only that, compared to the need for krypton gold Midjourney,SDXL 0.9 can be free experience!

Interestingly, the research team thanked "ChatGPT for the writing help" in the final appendix.

"small victory" Midjourney so, compared with Midjourney, how capable is SDXL?

In the report, the researchers randomly selected five prompts from each category and generated four 1024 × 1024 images using Midjourney (V5.1, seed set to 2) and SDXL for each prompt.

The images are then submitted to the AWS GroundTruth task force, which votes according to the prompts.

Overall, SDXL is slightly better than Midjourney in following prompts.

Feedback from 17153 users compared to cover all the "categories" and "challenges" in the PartiPrompts (P2) benchmark.

It is worth noting that SDXL is superior to Midjourney V5.1 in 54.9% of the cases.

Preliminary tests show that the recently released Midjourney V5.2 has decreased in terms of understanding prompts. However, the tedious process of generating multiple prompts hinders the speed of more extensive testing.

Each tip in the P2 benchmark is organized by category and challenge, each of which focuses on different difficulties in the generation process.

The comparison results of each category (figure 10) and challenges (figure 11) in the P2 benchmark are shown below.

In 4 of 6 categories, SDXL outperformed Midjourney, and in 7 of 10 challenges, there was no significant difference between the two models, or SDXL outperformed Midjourney.

You can also guess which of the following diagrams are generated by SDXL and which are generated by Midjourney.

(the answer will be revealed below)

SDXL: the strongest open source AI last year, Stable Diffusion, known as the strongest open source model, ignited the torch of global generative graphics.

Compared to OpenAI's DALL-E,Stable Diffusion, people can achieve the effect of text graphics on consumer graphics cards.

Stable Diffusion is a potential text-to-image diffusion model (DM), which is widely used.

Recent studies on brain image reconstruction based on functional magnetic resonance imaging (fMRI) and music generation are all based on DM.

Stability AI, the startup behind the blaze tool, launched again in April this year, an improved version of Stable Diffusion-SDXL.

According to user research, the performance of SDXL always surpasses all previous versions of Stable Diffusion, such as SD 1.5, SD2.1.

In the report, the researchers proposed design options that led to this performance improvement, including:

1) compared with the previous Stable Diffusion model, the UNet backbone architecture has increased by three times.

2) two simple and effective additional regulation techniques that do not require any form of additional supervision

3) A separate diffusion-based thinning model, which denoises the potential signal generated by SDXL to improve the visual quality of the sample.

Improve Stable Diffusion researchers have improved the Stable Diffusion architecture. These are modular and can be used alone or together to extend any model.

Although the following strategies are developed as an extension of the potential diffusion model, most of them also apply to their pixel space counterparts, the report said.

At present, DM has been proved to be a powerful generation model of image synthesis, and convolution UNet architecture has become the dominant architecture of diffusion-based image synthesis.

As DM evolves, so does the underlying architecture: from increasing self-attention and improving upgrade layers, to cross-attention for text and image synthesis, to purely Transformer-based architecture.

In the continuous improvement of Stable Diffusion, researchers are also following this trend, transferring most of the Transformer computing to the lower-level features of UNet.

In particular, compared with the original SD architecture, the researchers used different heterogeneous distribution of Transformer blocks in UNet.

To improve efficiency, the Transformer block is omitted at the highest feature level, 2 and 10 blocks are used at the lower level, and the lowest level (8x downsampling) is completely deleted from UNet, as shown in the following figure.

Comparison of different versions of SDXL and Stable Diffusion models the researchers chose a more powerful pre-training text encoder for text tuning.

Specifically, OpenCLIP ViT-bigG is used in conjunction with CLIP ViT-L, where the penultimate text encoder output is connected along the channel axis.

In addition to using the cross-attention layer to constrain the text input of the model, the researchers also follow and attach constraint models to the mixed embedded text of the OpenCLIP model.

As a result, these factors lead to a model parameter of 2.6B in UNet and a total parameter of 817m for the text encoder.

The biggest disadvantage of fine-tuning the potential diffusion model (LDM) is that because it is a two-stage architecture, training a model requires a minimum image size.

There are two main ways to solve this problem, either discard training images below a certain minimum resolution, (SD 1.4max 1.5 discard all images below 512 pixels) or choose ultra-small advanced images.

However, the first method may cause a large amount of training data to be discarded and the image performance will be lost. The researchers made a visualization of the SDXL pre-training data set.

For special data selections, discarding all samples with a pre-training resolution of less than 256 × 256 pixels will result in 39% data loss.

The second method, on the other hand, usually brings upgrade artifacts, which may be leaked into the final model output, resulting in sample blur.

In this regard, the researchers suggest that the UNet model be conditional on the original resolution. This is very easy to get during training.

In particular, the height and width of the original image are provided as the model

The additional conditions of the.

Each component is embedded independently using Fourier feature coding and connected into a vector, and the research team feeds back to the model by adding it to the time step embedding.

In inference, the user can use this size adjustment to set the intuitive resolution required by the image. Obviously, the model has learned to apply the conditions

Image properties related to resolution.

As shown in the figure, the researchers showed that four samples were taken from SDXL with the same random seeds and resized. When the larger image size is adjusted, the image quality is obviously improved.

The following is the output of SDXL, compared with previous versions of SD. For each prompt, the researchers took 3 random samples in the 50 steps of the DDIM sampler and cfg-scale 8.0.

In the previous SD model, the composite image may have been incorrectly cropped, such as the cat head generated by SD 1.5 and SD 2.1 in the left example.

From the following comparisons, it is not difficult to see that SDXL has basically solved this problem.

Such a significant improvement can be achieved because the researchers have come up with a simple and effective conditionalization approach:

Unified sampling and cutting coordinates in the process of data loading

And

(integers that specify the number of pixels cropped along the height and width axes from the upper-left corner, respectively, and input them as conditional parameters into the model through Fourier feature embedding, similar to the size conditionalization method described above.

Then use the link to embed

As an additional conditional parameter.

Here, the research team emphasizes that this is not the only technology suitable for LDMs, and that cropping and resizing can be easily combined.

In this case, join the embedded feature along the channel dimension and then add it to the time step of embedding the UNet.

As shown in the figure, by tuning

The amount of cutting in the reasoning process can be successfully simulated.

Inspired by the above techniques, the researchers also fine-tuned the model to deal with multiple aspect ratios at the same time: dividing the data into buckets with different aspect ratios, keeping the pixel count close to 1024 × 1024 as much as possible. change the height and width in multiples of 64 accordingly.

Although most of the semantic composition of the improved self-encoder is completed by LDM, researchers can improve the local and high-frequency details of the generated image by improving the self-encoder.

To this end, the researchers trained the same self-encoder structure for the original SD with a larger batch size (256 vs 9) and tracked the weights with an exponential moving average.

The resulting self-encoder is better than the original model in all the evaluated reconstruction indicators.

SDXL gave birth to researchers in a multi-stage process to train the final model SDXL. SDXL uses a self-encoder, and a 1000-step discrete-time diffusion plan.

First, a basic model is pre-trained on an internal data set, and its height and width distribution is shown as 600000 optimization steps, with a resolution of 256x256 and a batch size of 2048, using the size and cropping adjustments described above.

Then, the researchers continue to train the 512 × 512 image, then carry out 200000 optimization steps, and finally use comprehensive training, combined with 0.05 offset noise level, to train the model with different aspect ratios in the area of about 1024 × 1024 pixels.

In the refinement stage, the researchers found that the resulting model sometimes produces locally low-quality samples, as shown in the following figure.

In order to improve the quality of the samples, they trained an independent LDM in the same potential space to deal with high-quality and high-resolution data, and introduced denoising processing into the basic model samples using SDEdit.

In the reasoning process, the researchers use the same text input, render the potential information from the basic SDXL, and use the refinement model to directly spread and Denoise in the potential space.

The 1024 × 1024 sample from SDXL is enlarged, without (left) and with (right) thinning model. It is worth mentioning that this step can improve the sample quality of background and face, and is optional.

Limitations although the performance of SDXL has been greatly improved, the model still has obvious limitations.

First of all, SDXL is still not good at dealing with complex structures such as human hands.

The researchers speculate that the reason for this problem is that because there are great differences in human hands and other complex objects in different images, it is difficult for the model to extract the real 3D shape.

Second, the images generated by SDXL are not nearly as realistic as photos.

In some subtle details, such as low light effects or texture changes, the image generated by AI may be missing or inaccurate.

In addition, when the image contains multiple objects or subjects, the model may have the so-called "concept overflow" phenomenon. This problem is represented by the accidental merging or overlap of different visual elements.

For example, the orange sunglasses in the image below are due to the concept spillover of the "orange sweater".

In figure 8, penguins, which are supposed to be wearing "blue hats" and "red gloves", are wearing "blue gloves" and "red hats" in the resulting image.

At the same time, the "text generation" part of the text graph model, which has been troubling us, is still a big problem.

As shown in figure 8, the text generated by the model may sometimes contain random characters or may not be consistent with a given prompt.

In response to future work, the researchers say that further improvements to the model will focus on the following areas:

Single-stage generation

Currently, the team is using an additional refinement model to generate the best sample of SDXL in a two-phase manner. This requires two large models to be loaded into memory, reducing accessibility and sampling speed.

Text synthesis

Size and larger text encoders (OpenCLIP ViT-bigG) help improve text rendering, while the introduction of byte-level tokenizer or the extension of the model to a larger scale may further improve the quality of text synthesis.

Architecture

During the exploration phase, the team tried Transformer-based architectures, such as UViT and DiT, but did not improve significantly. However, the team still believes that a larger extension of the Transformer-based architecture can eventually be achieved through more careful hyper-parametric research.

Distillation

Although the original Stable Diffusion model has been significantly improved, the cost is increased inferred costs (including video memory and sampling speed). Therefore, the future work will focus on reducing the amount of computation required for inference and improving the sampling speed. For example, through guided distillation, knowledge distillation and progressive distillation and other methods.

At present, the latest report is only available on GitHub. Stability AI's CEO said it would be uploaded to arxiv soon.

Reference:

Https://twitter.com/emostaque/status/1676315243478220800?s=46&t=iBppoR0Tk6jtBDcof0HHgg

Https://github.com/Stability-AI/generative-models/blob/main/assets/sdxl_report.pdf

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.