Google releases MediaPipe Diffusion plug-in, "mobile" available image generation control model 07/02 Update SLTechnology News&Howtos

Google releases MediaPipe Diffusion plug-in, "mobile" available image generation control model

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

[new Zhiyuan Guide] Google has released a low-cost controllable text graph plug-in model MediaPipe Diffusion, which increases the speed of the mobile terminal by 20 times and runs up to 100 times faster on the V100.

In recent years, the diffusion model has achieved great success in text-to-image generation, achieving higher image generation quality, improving reasoning performance, and stimulating extended creative inspiration.

However, only relying on the text to control the generation of the image often can not get the desired results, such as the specific character posture, facial expressions and so on, it is difficult to specify with the text.

Recently, Google released a MediaPipe Diffusion plug-in that runs a low-cost text-to-image generation solution on mobile devices, supporting existing pre-training diffusion models and their low-rank adaptive (LoRA) variants

The image generation process of background knowledge based on diffusion model can be regarded as an iterative denoising process.

Starting from the noisy image, in each step, the diffusion model will gradually reduce the noise of the image to generate an image that conforms to the target concept. Taking the text prompt as a condition can greatly improve the effect of image generation.

For text-to-image generation, text embedding is connected to the image generation model through a cross-attention layer, but there is still some information that is difficult to describe by text prompts, such as the position and posture of objects.

In order to solve this problem, the researchers propose to add an additional model to the diffusion model and inject control information into the conditional image.

Commonly used methods for controlling text diagrams include:

1. Plug and play (Plug-and-Play) uses the denoising diffusion implicit model (DDIM) inversion method to derive the initial noise input from the input image by inverting the generation process, and then uses the diffusion model (860 million parameters in the case of Stable Diffusion1.5) to encode the conditions from the input image.

Plug-and-play extracts spatial features with self-attention from copied diffusion and injects them into the diffusion process of text-to-image.

2. ControlNet creates a trainable copy of the diffusion model encoder and encodes the conditional information passed to the decoder layer through a convolution layer connection after zero initialization parameters.

3. T2I Adapter is a small network (7700 million parameters), which can achieve a similar effect in controllable generation, only the conditional image is used as input, and its output is shared among all diffusion iterations.

However, the T2I adapter model is not designed for portable mobile devices.

The MediaPipe Diffusion plug-in to make conditional generation more efficient, customizable, and extensible, the researchers designed the MediaPipe diffusion plug-in as a separate network:

1. Pluggable (Plugable): can be easily connected to the pre-training basic model

two。 Training from scratch (Trained from scratch): do not use pre-training weights from the base model

3. Portability (Portable): the underlying model can be run on a mobile device, and the cost of reasoning is negligible compared to the original model.

The comparison of plug-and-play, ControlNet, T2I adapter and MediaPipe diffusion plug-in, * the specific numbers will change according to the selected model. Simply put, MediaPipe diffusion plug-in is a model for text-to-image generation that can be run on portable devices. Multi-scale features are extracted from conditional images and added to the encoder of the corresponding hierarchical diffusion model. When connected to the text diffusion model, the plug-in model can provide additional conditional signals to image generation.

Plug-in network is a lightweight model with only 6 million parameters, using deep convolution and reverse bottleneck (inverted bottleneck) in MobileNetv2 to achieve fast reasoning on mobile devices.

The MediaPipe diffusion model plug-in is a separate network, and the output can be inserted into the pre-trained text into the image generation model, and the extracted features are applied to the relevant lower sampling layer (blue) of the diffusion model. Unlike ControlNet, researchers inject the same control function into all diffusion iterations, so you only need to run the plug-in once for the image generation process, saving computation.

As you can see in the following example, the control effect is effective at each diffusion step, and the generation process can be controlled even in the previous iteration step; more iterations can improve the alignment of the image with the text prompt and generate more detail.

Demonstration of the generation process using MediaPipe diffusion plug-ins in this work, the researchers developed plug-ins based on diffusion-based text-to-image generation models with MediaPipe face landmark,MediaPipe holistic landmark, depth maps, and Canny edges.

For each task, about 100000 images are selected from a very large image-text data set, and the corresponding MediaPipe solution is used to calculate the control signal, and the PaLI optimized description is used to train the plug-in.

Face Landmark

The MediaPipe Face Landmarker task calculates 478 landmark (with attention) of a human face.

The researchers used drawing utils in MediaPipe to render faces, including facial contours, mouth, eyes, eyebrows and iris, and represented them in different colors.

The following example shows randomly generated samples by adjusting face meshes and prompts; in contrast, both ControlNet and Plugin can control text-to-image generation under given conditions.

Face-landmark plug-in for text-to-image generation, compared with ControlNet. Holistic Landmark

MediaPipe Holistic Landmark tasks include the landmark of body posture, hand and facial meshes, and can generate various stylized images by adjusting the overall features.

Holistic landmark plug-in for text-to-image generation. depth

Depth plug-in text to image generation. Canny Edge

Canny-edge plug-in for generating text to image. The evaluation researchers quantitatively evaluated the face landmark plug-in to demonstrate the performance of the model. The evaluation data set contained 5000 human images, using metrics such as Fre é chet start distance (FID) and CLIP scores.

The basic model uses a pre-trained text-to-image diffusion model Stable Diffusion v1.5

Quantitative comparison of FID, CLIP and reasoning time from the scores of FID and CLIP in the experimental results, the sample quality generated by ControlNet and MediaPipe diffusion plug-ins is much better than that of the basic model.

Unlike ControlNet, the plug-in model only needs to run once for each generated image, not in every denoising step, so the reasoning time increases by only 2.6%.

The researchers measured the performance of three models on the server machine (using Nvidia V100 GPU) and on the mobile device (Galaxy S23): on the server, running all three models with 50 diffusion steps, and on the mobile side, running 20 diffusion steps using the MediaPipe image generation application.

Compared with ControlNet, MediaPipe plug-in shows obvious advantages in reasoning efficiency while maintaining sample quality.

The reasoning time (ms) of plug-ins on different mobile devices is summarized in this work, the researchers proposed MediaPipe, a conditional text-to-image generation plug-in that can be used on the mobile side, which injects features extracted from conditional images into the diffusion model to control the image generation process.

Portable plug-ins can be connected to a pre-trained diffusion model running on a server or device, and generative AI can be applied more flexibly by fully running text to image generation and plug-ins on the device.

Reference:

Https://ai.googleblog.com/2023/06/on-device-diffusion-plugins-for.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.