Stability AI threw two Kings in a row, the first open source RLHF model ascended the throne, and the DeepFloyd IF pixel level produced the picture. 04/28 Update SLTechnology News&Howtos

Stability AI threw two Kings in a row, the first open source RLHF model ascended the throne, and the DeepFloyd IF pixel level produced the picture.

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

StabilityAI dropped two blockbusters in one day: the first open-source RLHF large language model in history, and the pixel-level image model DeepFloyd IF. Open source community ecstatic!

Recently, the company behind the famous Stable Diffusion has completed two major projects in a row.

First, Stability AI released the world's first RLHF-based open source LLM chatbot, Stable Vicuna.

StableVicuna is based on Vicuna-13B and is the first large-scale open-source chatbot trained using human feedback.

Some netizens said after the actual measurement, StableVicuna is currently worthy of the king of 13B LLM!

In response, the founders of 1x exited said that this can be seen as the second milestone since the launch of ChatGPT.

In addition, Stability AI released the open-source model DeepFloyd IF, a text-to-image cascaded pixel diffusion model that is extremely powerful and can skillfully integrate text into images.

The revolutionary significance of this model lies in that it has solved two difficult problems in the field of Venturi graph: correctly generating characters and correctly understanding spatial relations!

In keeping with the tradition of open source, DeepFloyd IF will be fully open source in the future.

Stailiity AI is indeed a well-deserved leader in the open source world.

StableVicuna is the world's first open source RLHF LLM chatbot, released by Stability AI!

A Youtube anchor tested Stable Vicuna, which beat its predecessor Vicuna in every test.

Therefore, this Youtuber excitedly shouted: Stable Vicuna is the most powerful 13B LLM model at present, and it deserves to be the king of LLM models!

StableVicuna is based on the Vicuna-13B model implementation and is a further instruction fine-tuned and RLHF trained version of Vicuna-13B.

Vicuna-13B is an instruction fine-tuning model of LLaMA-13B.

The following benchmark tests show how StableVicuna compares to open-source chatbots of similar size in terms of overall performance.

StableVicuna can do basic math problems.

You can write code.

They can also explain grammar to you.

AI wants to make such an open source chat robot, of course, it is also affected by the ChatGPT replacement frenzy triggered by the previous LLaMa weight leak.

From the chatbot at Character.ai last spring, to ChatGPT and Bard, there has been a strong interest in open source ping.

The success of these chat models is largely due to two training paradigms: instruction fine-tuning and reinforcement learning with human feedback (RLHF).

Developers have been working hard to build open source frameworks to help train these models, such as trlX, trl, DeepSpeed Chat, and ColossalAI, but there is no open source model that can apply both instruction tuning and RLHF.

Most models are fine-tuned without RLHF because of the complexity of the process.

More recently, Open Assistant, Anthropic, and Stanford have all started making RLHF datasets available to the public.

Stability AI combines these data sets with the RLHF provided by trlX to obtain the first large-scale instruction tuning and RLHF model in history-Stable Vicuna.

To achieve the robust performance of StableVicuna, the researchers used Vicuna as a base model and followed a typical three-stage RLHF pipeline.

Vicuna is based on the 13 billion parameter LLaMA model, adjusted using Alpaca.

They mixed three data sets and trained Vicuna base model with supervised fine tuning (SFT):

OpenAssistant Conversations Dataset (OASST1), a human-generated, human-annotated assistant-style conversation corpus containing 161,443 messages distributed across 66,497 conversation trees in 35 different languages;

GPT4 All Prompt Generations, a dataset of 437,605 prompts and responses generated by GPT-3.5 Turbo;

Alpaca, a dataset generated by OpenAI's text-davinci-003 engine, contains 52,000 instructions and demonstrations.

The researchers trained a reward model using trlx. On these RLHF preference data sets below, the researchers derived the SFT model, which is the basis of the reward model.

OpenAssistant Conversations Dataset (OASST1), which contains 7213 preference samples;

Anthropic HH-RLHF, a preference dataset on AI assistant usefulness and harmlessness, containing 160,800 human tags;

Stanford Human Preferences (SHP), a dataset containing 348,718 human collective preferences for a variety of different answers, including 18 different subject areas ranging from cooking to philosophy.

Finally, the researchers used trlX, proximal policy optimization (PPO) reinforcement learning, RLHF training on the SFT model, and then StableVicuna was born!

According to Stability AI, Stable Vicuna will be further developed and will soon be available on Discord.

Stability AI also plans to give Stable Vicuna a chat interface, which is currently under development.

The demo is already available for viewing on HuggingFace, where developers can download model weights as increments to the original LLaMA model.

However, if you want to use StableVicuna, you also need access to the original LLaMA model.

Once you have the delta weights and LLaMA weights, you can combine them using the scripts provided in the GitHub repository to get StableVicuna-13B. However, commercial use is also not allowed.

DeepFloyd IF At the same time, Stability AI also released a big action.

Do you dare believe that AI has been unable to correctly generate text this long-standing problem, unexpectedly solved? (Basically)

That's right, the following "perfect" signboard is made by DeepFloyd IF, an open source image generation model newly launched by Stability AI.

In addition to this, DeepFloyd IF is able to generate correct spatial relationships.

As soon as the model was released, the netizens had already gone crazy:

prompt: Robot holding a neon sign that says "I can spell".

However, DeepFloyd IF has a high probability of error for words that are not explicitly stated in the prompt.

prompt：A neon sign of an American motel at night with the sign javilop

Official presentation

Incidentally, in terms of hardware requirements, if you want to achieve the maximum 1,024 x 1,024 pixel output that the model can support, it is recommended to use 24GB of memory; if you only need 256 x 256 pixels, 16GB of memory can be used.

Yes, RTX 3060 16G can run.

Code implementation: gist.github.com/ Stella2211 / ab17625d63aa03e38d82ddc8c1aae151

Open source version of Google Imagen In May 2022, Google released its own image generation model Imagen.

According to official demos, Imagen not only beats OpenAI's strongest DALL-E 2 in quality, but more importantly, it generates text correctly.

To date, no open-source model has been able to achieve this functionality consistently.

Like other generative AI models, Imagen relies on a frozen text encoder: text cues are first converted into embeddings, which are then decoded into images by the diffusion model. Instead of using CLIP, Imagen uses a large T5-XXL language model.

This time, the DeepFloyd IF replica launched by Stability AI is precisely this architecture.

Even in testing, DeepFloyd IF, with a zero-shot FID score of 6.66 on the COCO dataset, directly surpassed Google's Imagen and a host of competitors (including its own Stable Diffusion).

Specifically, DeepFloyd IF is a modular, cascaded pixel diffusion model.

Modularity:

DeepFloyd IF consists of several neural modules (neural networks that can solve independent tasks) that work together in an architecture.

Cascade:

DeepFloyd IF achieves high-resolution output by cascading multiple models: first generating a low-resolution sample, then upsampling it through successive super-resolution models, resulting in a high-resolution image.

Diffusion:

Both DeepFloyd IF's base model and super-resolution model are diffusion models, in which random noise is injected into the data using the steps of a Markov chain, and then the process is reversed to generate new data samples from the noise.

Pixels:

DeepFloyd IF works in pixel space. Unlike latent diffusion models such as Stable Diffusion, diffusion is implemented at the pixel level, where latent characterization is used.

The above flowchart shows the performance of DeepFloyd IF in three phases:

Stage 1:

The basic diffusion model converts qualitative text into a 64x64 image. DeepFloyd's team has trained three versions of the base model, each with different parameters: IF-I 400M, IF-I 900M, and IF-I 4.3B.

Stage 2:

To "zoom in" on the image, the team applied two text conditional super-resolution models (Efficient U-Net) to the output of the base model. One of them zooms in a 64x64 image to a 256x256 image. Similarly, there are several versions of this model: IF-II 400M and IF-II 1.2B.

Stage 3:

A second super-resolution diffusion model was applied to generate vivid 1024 x 1024 images. The final Phase III model, IF-III, has 700M parameters.

It's worth noting that the team hasn't officially released the Phase 3 model yet, but the modular nature of DeepFloyd IF allows us to use other upsampling models-such as Stable Diffusion x4 Upscaler.

The team says this work demonstrates the potential of larger UNet architectures in the first phase of the cascade diffusion model, thus showing a promising future for text-to-image synthesis.

DeepFloyd IF is trained on a custom high-quality LAION-A dataset containing 1 billion (image, text) pairs.

LAION-A is a subset of the English part of LAION-5B dataset, obtained by similarity hashing, with additional cleaning and modification of the original dataset. DeepFloyd's custom filters are used to remove watermarks, NSFW, and other inappropriate content.

Currently, licensing of the DeepFloyd IF model is limited to research for non-commercial purposes, and after feedback collection is complete, DeepFloyd and the Stability AI team will release a completely free commercial version.

References:

https://stability.ai/blog/stablevicuna-open-source-rlhf-chatbot

https://stability.ai/blog/deepfloyd-if-text-to-image-model

This article comes from Weixin Official Accounts: Xinzhiyuan (ID: AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.