In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
Here comes the enemy of Midjourney! Google customization master StyleDrop, with a picture as a reference, can be reproduced no matter how complex the artistic style.
As soon as Google StyleDrop came out, it instantly brushed the screen on the Internet.
Given Van Gogh's starry sky, AI incarnated Master Van Gogh and made numerous similar paintings after a top understanding of this abstract style.
Another card is ventilated, and the objects you want to draw are much cuter.
Even, it can accurately control the details and design the original style of logo.
The charm of StyleDrop is that it only needs a picture as a reference, no matter how complex the artistic style is, it can be deconstructed and reproduced.
Netizens have said that it is the kind of AI tool that eliminates designers.
StyleDrop fire research is the latest product from Google's research team.
Paper address: https://arxiv.org/ pdf / 2306.00983.pdf now, with tools like StyleDrop, you can not only paint more controllably, but also accomplish delicate tasks that were previously unimaginable, such as drawing logo.
Even Nvidia scientists call it a "phenomenal" achievement.
According to the author of the master thesis of "customization", StyleDrop is inspired by Eyedropper (color absorption / extraction tool).
Similarly, StyleDrop also wants you to quickly and effortlessly "pick" a style from a single / small number of reference images to generate an image of that style.
A sloth can have 18 styles:
A panda has 24 styles:
StyleDrop perfectly controls the watercolor paintings painted by children, and even restores the folds of the paper.
I have to say, it's too strong.
StyleDrop also refers to different styles of English alphabet design:
It's the same Van Gogh letter.
And line drawing. Line drawing is a high degree of abstraction of the image, and the rationality of picture generation is very high, so it has been difficult to succeed in the past.
The strokes of the cheese shadow in the original image are restored to the objects in each image.
Refer to Android LOGO creation.
In addition, the researchers have expanded the ability of StyleDrop to customize not only style and DreamBooth, but also content.
For example, it is still Van Gogh style, producing similar style paintings for little Corky:
Here's another one. This Corky feels like the Sphinx on the pyramids of Egypt.
How do you work? StyleDrop is built on Muse and consists of two key parts:
One is the effective fine-tuning of the parameters that generate the visual Transformer, and the other is iterative training with feedback.
The researchers then synthesized the image from two fine-tuning models.
Muse is a composite model that generates the latest text-to-image of an image Transformer based on a mask. It contains two compositing modules for basic image generation (256x256) and super resolution (512x512or 1024 × 1024).
Each module consists of a text encoder T, a transformer G, a sampler S, and an image encoder E and decoder D.
T maps the text prompt t ∈ T to the continuous embedded space E. G processes the text embedded in e ∈ E to generate the logarithm l ∈ L of the visual token sequence. S extracts the visual token sequence v ∈ V from the logarithm by iterative decoding, which runs several steps of transformer reasoning, provided that the text is embedded in e and the visual token decoded from the previous step.
Finally, D maps the discrete token sequence to the pixel space I. In general, given a text prompt t, the composition of image I is as follows:
Figure 2 is a simplified Muse transformer layer architecture that has been partially modified to support efficient parameter fine tuning (PEFT) and adapters.
Use the transformer of L layer to process the visual token sequence displayed in green when the text is embedded in e. The learning parameter theta is used to build the weight of the adapter tuning.
In order to train theta, in many cases, researchers may only give pictures as a style reference.
Researchers need to manually attach text prompts. They propose a simple, templated way to build text prompts, including a description of the content, followed by a descriptive style phrase.
For example, in Table 1, the researchers used "cat" to describe an object and appended "watercolor painting" as a style description.
It is important to include a description of the content and style in the text prompt because it helps to separate the content from the style, which is the main goal of the researchers.
Figure 3 shows iterative training with feedback.
When training on a single style reference image (orange box), some images generated by StyleDrop may show what is extracted from the style reference image (red box, with a house similar to the style image in the background).
Other images (blue boxes) are better able to separate the style from the content. Iterative training of good samples (blue box) for StyleDrop results in a better balance between style and text fidelity (green box).
Here the researchers also used two methods:
-CLIP score this method is used to measure the alignment of images and text. Therefore, it can evaluate the quality of the generated image by measuring the CLIP score (that is, the cosine similarity between visual and text CLIP embedding).
Researchers can choose the CLIP image with the highest score. They call this method iterative training with CLIP feedback (CF).
In the experiment, the researchers found that using CLIP scores to evaluate the quality of the composite image is an effective way to improve the recall rate (that is, text fidelity) without too much loss of style fidelity.
On the other hand, CLIP scores may not be fully aligned with human intentions, nor can they capture subtle stylistic attributes.
-HF manual feedback (HF) is a more direct way to inject user intentions directly into the quality assessment of composite images.
In the LLM fine-tuning of reinforcement learning, HF has proved to be powerful and effective.
HF can be used to compensate for the fact that CLIP scores cannot capture subtle style attributes.
At present, a large number of studies have focused on the personalization of the text-to-image diffusion model in order to synthesize images with a variety of personal styles.
The researchers showed how to combine DreamBooth and StyleDrop in a simple way to personalize both style and content.
This is accomplished by sampling from two modified generation distributions, guided by θ s of style and θ c of content, respectively, which are adapter parameters trained independently on style and content reference images.
Unlike existing finished products, the team's approach does not require joint training of learnable parameters in multiple concepts, which leads to greater combination capability. because pre-trained adapters are trained on a single theme and style.
The researchers' overall sampling process follows the iterative decoding of equation (1), and the sampling logarithm is different in each decoding step.
Let t be the text prompt and c the text prompt without style descriptors. In step k, the logarithm is calculated as follows:
Where: γ is used to balance StyleDrop and DreamBooth--. If γ is 0, we get StyleDrop. If it is 1, we get DreamBooth.
By setting gamma reasonably, we can get a suitable image.
So far, there is no extensive research on the style adjustment of the text-image generation model.
As a result, the researchers proposed a new experimental scheme:
-data collection
The researchers collected dozens of pictures of different styles, from watercolors and oil paintings, graphic illustrations and 3D renderings to sculptures of different materials.
-Model configuration
The researchers used adapters to tune Muse-based StyleDrop. For all experiments, update the adapter weight of 1000 steps using the Adam optimizer, with a learning rate of 0.00003. Unless otherwise noted, the researchers used StyleDrop to represent the second round model, which was trained on more than 10 composite images with artificial feedback.
-Evaluation
The quantitative evaluation of the study is based on CLIP, which measures style consistency and text alignment. In addition, the researchers conducted user preference studies to assess style consistency and text alignment.
As shown in the figure, the researchers collected 18 different styles of images, the results of StyleDrop processing.
As you can see, StyleDrop can capture the nuances of textures, shadows, and structures of various styles, and has better control over the style than before.
For comparison, the researchers also introduced the results of DreamBooth on Imagen, the LoRA implementation of DreamBooth on Stable Diffusion and the results of text inversion.
The specific results are shown in the table, the evaluation indicators of human score (top) and CLIP score (bottom) of image-text alignment (Text) and visual style alignment (Style).
(a) qualitative comparison of DreamBooth, (b) StyleDrop, and (c) DreamBooth + StyleDrop:
Here, the researchers applied two indicators of CLIP scores mentioned above-text and style scores.
For text scores, the researchers measured the cosine similarity between the image and the text embedding. For style scores, the researchers measured the cosine similarity between style reference and composite image embedding.
The researchers generated a total of 1520 images for 190 text prompts. Although the researchers hope to score higher in the end, these indicators are not perfect.
Iterative training (IT) improves text scores, which is in line with the researchers' goals.
However, as a tradeoff, their style scores on the first round of models have been reduced because they are trained on composite images, and the style may be biased due to selection bias.
The style score of DreamBooth on Imagen is lower than that of StyleDrop (0.644 versus 0.694 of HF).
The researchers noted that the style score of DreamBooth on Imagen did not increase significantly (0.569 → 0.644), while the increase of StyleDrop on Muse was more significant (0.556 → 0.694).
The researchers analyzed that the style fine-tuning on Muse was more effective than that on Imagen.
In addition, in fine-grained control, StyleDrop captures subtle style differences, such as color offset, hierarchy, or acute angle control.
Netizens commented that if the designer had the StyleDrop,10 double speed work efficiency, it would have taken off.
AI one day, 10 years in the world, AIGC is developing at the speed of light, that kind of blind speed of light!
Tools are just following the trend, and those that should be eliminated have already been eliminated.
This tool is much easier to use than Midjourney for making Logo.
Reference:
Https://styledrop.github.io/
This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.