Instead of being an illiterate painter, Google changed the "text encoder": a small operation to let the image generation model learn to "spell" 04/20 Update SLTechnology News&Howtos

Instead of being an illiterate painter, Google changed the "text encoder": a small operation to let the image generation model learn to "spell"

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

The image generation model has finally learned to spell words, but the secret is character features?

In the past year, with the release of image generation models such as DALL-E 2recoverable Diffusion, the image generated by text-to-image model has been improved by leaps and bounds in terms of resolution, quality and text fidelity, which has greatly promoted the development of downstream application scenarios, and everyone has become an AI painter.

However, related research shows that the current generation model technology still has a major defect: it can not present a reliable visual text in the image.

Some research results show that DALL-E 2 is very unstable in generating coherent text characters in pictures, while the newly released Stable Diffusion model directly lists "unable to render readable text" as a known limitation.

Misspelling: (1) California: All Dreams Welcome, (2) Canada: For Glowing Hearts, (3) Colorado: It's Our Nature, (4) St. Louis: All Within Reach. Recently, Google Research released a new paper in an attempt to understand and improve the ability of image generation models to render high-quality visual text.

Link to paper: https://arxiv.org/ abs / 2212.10562 researchers believe that the main reason for text rendering defects in the current text-to-image generation model is the lack of character-level input features.

In order to quantify the influence of the input feature on model generation, a series of control experiments are designed to compare the text encoders (character-aware and character-blind) with or without text input features.

Researchers have found that in the field of plain text, the character-aware model has achieved significant performance benefits on a new spelling task (WikiSpell).

After transferring the experience to the field of vision, the researchers trained a set of image generation models. The experimental results show that the character-aware model is better than character-blind in a series of new text rendering tasks (DrawText benchmark).

And the character-aware model achieves a higher technical level in visual spelling. Although the number of training samples is much smaller, its accuracy on unusual words is still more than 30 percentage points higher than that of the competitive model.

The Character-Aware model language model can be divided into the character-aware model that directly accesses its text input characters and the character-blind model that cannot be accessed.

Many early neurolanguage models operated directly on characters without using multi-character token as markers.

Later models gradually shift to vocabulary-based tokenization, some of which such as ELMo still retain character-aware, while others such as BERT abandon character features to support more effective pre-training.

At present, most of the widely used language models are character-blind, which rely on data-driven subword (subword) segmentation algorithms, such as byte pair coding (BPE) to generate the subword pieces as a vocabulary.

Although these methods can be returned to the character-level representation of unusual sequences, they are still designed to compress common character sequences into indivisible units.

The main purpose of this paper is to try to understand and improve the ability of image generation model to render high-quality visual text.

To this end, the researchers first studied the spelling ability of current text encoders in isolation. From the experimental results, we can find that although character-blind text encoders are very popular, they do not receive direct signals about the character level composition of their input, resulting in their limited spelling ability.

The researchers also tested the spelling ability of text encoders of different sizes, architectures, input representations, languages and adjustment methods.

This paper records for the first time the magical ability of the character-blind model to induce strong spelling knowledge (accuracy > 99%) through network pre-training, but the experimental results show that this ability has not been well generalized in languages other than English, and can only be realized on a scale of more than 100B parameters, so it is not feasible for most application scenarios.

On the other hand, character-aware 's text encoder can achieve powerful spelling ability on a smaller scale.

When applying these findings to image generation scenarios, the researchers trained a series of character-aware text-to-image models and proved that they are significantly better than character-blind models in the evaluation of existing and new text rendering.

However, for pure character-level models, although the performance of text rendering is improved, image-text alignment decreases for prompt that does not involve visual text.

To alleviate this problem, the researchers suggest combining character-level and token-level input representations to achieve optimal performance.

WikiSpell benchmark because the text-to-image generation model relies on text encoders to generate representations for decoding, researchers first sample some words from Wiktionary to create WikiSpell benchmarks, and then explore the capabilities of text encoders based on this dataset in a plain text spelling evaluation task.

For each sample in WikiSpell, the input to the model is a word, and the expected output is its specific spelling (generated by inserting spaces between each Unicode character).

Since the article was only interested in studying the relationship between the frequency of a word and the spelling ability of the model, the researchers divided the words in Wiktionary into five non-overlapping buckets according to the frequency of words in the mC4 corpus: the most frequent top 1% words, the most frequent 1-10% words, 10-20% words, and 20-30% words. And the lowest 50% of words (including words that have never appeared in the corpus).

Then evenly extract 1000 words from each bucket to create a test set (and a similar development set).

Finally, a training set of 10000 words is established by combining two parts: 5000 are sampled uniformly from the bottom 50% buckets (the least common words), and another 5000 are sampled proportionally according to their frequency in the mC4 (so that this half of the training set is biased towards frequent words).

The researchers excluded any words selected into the development set or test set from the training set, so the evaluation results were always aimed at the excluded words.

In addition to English, the researchers evaluated six other languages (Arabic, Chinese, Finnish, Korean, Russian and Thai), which were selected to cover the various features that affect the model's ability to learn spelling. the above dataset construction process is repeated in the evaluation of each language.

Text generation experimental researchers use WikiSpell benchmarks to evaluate the performance of a variety of pre-trained plain text models on different scales, including T5 (a character-blind codec model pre-trained on English data); mT5 (similar to T5 but pre-trained in more than 100 languages); ByT5 (character-aware version of mT5, which operates directly on UTF-8 byte sequences) And PaLM (a larger decoding model, mainly pre-trained in English).

In the results of pure English and multilingual experiments, it can be found that character-blind model T5 and mT5 perform much worse on buckets containing the most frequent words of Top-1%.

The result seems counterintuitive because models usually perform best on examples that occur frequently in data, but because of the way subword words are trained, frequent words are usually represented as a single atomic marker (or a small number of markers), and in fact it is the same: in the top 1 per cent of English buckets, 87 per cent of words are represented by T5 words as a subword.

Therefore, the lower spelling accuracy score indicates that the T5 encoder does not retain enough spelling information about the subword in its vocabulary.

Secondly, for the character-blind model, size is an important factor affecting spelling ability. Both T5 and mT5 get better as they grow in size, but even on the XXL scale, these models do not show a particularly strong spelling ability.

It is only when the character-blind model reaches the size of PaLM that it begins to see near-perfect spelling ability: the PaLM model with 540B parameters achieves > 99% accuracy in all frequency buckets in English, although it sees only 20 examples in prompts (while T5 shows 1000 fine-tuning examples).

However, the poor performance of PaLM in other languages may be due to much less pre-training data for these languages.

Experiments on ByT5 show that character-aware model shows stronger spelling ability. ByT5 lags only slightly behind XL and XXL in Base and Large sizes (though still in the range of at least 90 per cent), and the frequency of one word seems to have little effect on ByT5's spelling ability.

The spelling performance of ByT5 is much better than that of (m) T5, even comparable to the English performance of PaLM with more than 100x parameters, and better than that of PaLM in other languages.

Thus it can be known that the ByT5 encoder retains a considerable amount of character-level information, and this information can be retrieved from these frozen parameters according to the needs of the decoding task.

DrawText benchmark from the COCO data set released in 2014 to the DrawBench benchmark in 2022, from FID, CLIP score to human preference and other indicators, how to evaluate the text-to-image model has been an important research topic.

However, there has been a lack of related work in text rendering and spelling evaluation.

To this end, the researchers proposed a new benchmark DrawText, which aims to comprehensively measure the quality of text rendering from text to image model.

The DrawText benchmark consists of two parts, which measure the different dimensions of the model capability:

1) DrawText Spell, which is evaluated by rendering common words in a large set of English words

The researchers extracted 100 words each from the English WikiSpell frequency bucket and inserted them into a standard template to construct a total of 500 prompts.

For each prompt, four images were extracted from the candidate model and evaluated using human scores and indicators based on optical character recognition (OCR).

2) DrawText Creative, which is evaluated by text rendering of visual effects.

Visual text is not limited to common scenes such as street signs, text can appear in a variety of forms, such as scribbling, painting, carving, sculpture, and so on.

If the image generation model supports flexible and accurate text rendering, this will enable designers to use these models to develop creative fonts, logos, layouts, and so on.

To test the ability of the image generation model to support these use cases, the researchers worked with a professional graphic designer to build 175 different tips requiring text to be rendered in a series of creative styles and settings.

Many hints are beyond the capabilities of the current model, and the most advanced models show misspelled, discarded, or repeated words.

The experimental results of image generation show that among the nine image generation models used for comparison, the accuracy on DrawText Spell benchmark of character-aware model (ByT5 and Concat) is better than that of other models regardless of model size, especially on unusual words.

Imagen-AR showed the benefits of avoiding cropping, and although the training time was 6.6 times longer, it still performed worse than the word character-aware model.

Another obvious difference between models is whether they continuously misspell a given word in multiple samples.

In the experimental results, it can be seen that no matter how many samples are taken, there are many misspelling words in the T5 model, which the researchers think indicates the lack of character knowledge in the text encoder.

By contrast, there are basically only sporadic errors in the ByT5 model.

This observation can be quantified by measuring the ratio of persistent correct (4tick 4) or persistent error (0ap4) of the model in all four image samples.

You can see a sharp contrast, especially in the common words (the first 1%), that is, the ByT5 model never persists in errors, while the T5 model persists in errors on 10% or more words.

Reference:

Https://arxiv.org/abs/2212.10562

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era), editor: LRS

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.