GPT-5 will die of GPT-4 backstab? Oxbridge study warns that AI training AI as "highly toxic" will cause models to collapse 04/16 Update SLTechnology News&Howtos

GPT-5 will die of GPT-4 backstab? Oxbridge study warns that AI training AI as "highly toxic" will cause models to collapse

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

The worst human corpus is also better than the text generated by AI.

With the popularity of GPT-4, Stable Diffusion and Midjourney, more and more people begin to introduce generative AI technology into their work and life.

Even, some people have begun to try to train AI with the data generated by AI. Is this the legendary "data perpetual motion machine"?

However, researchers from Oxford, Cambridge, Imperial College and other institutions have found that heavy use of AI content in training can lead to model model collapse, resulting in irreversible defects.

That is, over time, the model forgets the real underlying data. Even in an almost ideal state of long-term learning, this situation is inevitable.

Therefore, the researchers appeal that if we want to continue to maintain the model superiority of large-scale data, we must take the text written by human beings seriously.

Paper address: https://arxiv.org/ abs / 2305.17493v2, but the problem now is that what you think of as "human data" may not be written by "human".

According to a new study from the Lobsang Federal Institute of Technology (EPFL), it is estimated that 33% of human data are generated by AI.

Training data are all rubbish. There is no doubt that today's big language models have evolved quite powerful capabilities, such as the ability of GPT-4 to generate text that is no different from that of human beings in some scenarios.

But an important reason behind this is that most of their training data come from human communication on the Internet over the past few decades.

If the future language model still relies on crawling data from the web, it is inevitable to introduce self-generated text into the training set.

In response, the researchers predict that when GPT reaches the nth generation, the model will have a serious collapse problem.

Then, in such a situation where it is inevitable to grab LLM-generated content, it becomes particularly important to prepare real data produced by humans for model training.

The famous Amazon data crowdsourcing platform Mechanical Turk (MTurk) has been a sideline choice for many people since it was launched in 2005.

Researchers can release a variety of trivial human intelligence tasks, such as tagging images, conducting surveys, and so on.

These tasks are often beyond the reach of computers and algorithms, and MTurk has even become the "best choice" for researchers and companies with insufficient budgets.

Even Bezos jokingly called MTurk's crowdsourced workers "artificial intelligence".

In addition to MTurk, crowdsourcing platforms, including Prolific, have become the core of researchers and industry practitioners, providing ways to create, label, and summarize a variety of data for investigation and experimentation.

However, research from EPFL found that nearly half of this key source of human data is created by annotators using AI.

Https://arxiv.org/ abs / 2306.07899v1 model collapse and the first mentioned "model crash" is the degradation that can affect many generations after feeding the model too much data from AI.

That is, the training data of the new generation model will be contaminated by the generated data of the previous generation model, resulting in a wrong understanding of the perception of the real world.

Furthermore, this collapse can lead to discrimination issues such as gender, race, or other sensitive attributes, especially if the generated AI learns to generate only one race in its response over time, while "forgetting" the existence of other races.

Moreover, in addition to large language models, model crashes can also occur on variational self-encoder (VAE) and Gaussian mixture models.

It is important to note that the process of model collapse is different from catastrophic forgetting (catastrophic forgetting). Instead of forgetting the data learned before, the model begins to misinterpret the wrong ideas of the model into reality and reinforces its belief in the wrong ideas.

For example, the model is trained on a dataset of 100 cats, including 10 blue cats and 90 yellow cats.

The conclusion of the model is that yellow-haired cats are more common and tend to think blue-haired cats are more yellow than they really are, so they may return results similar to those of green-haired cats when asked to generate new data.

With the passage of time, the original features of blue hair are gradually eroded in multiple training epoch, directly from blue to green, and finally to yellow. This gradual distortion and loss of a few features is the collapse of the model.

Specifically, model crashes can be divided into two situations:

1. Early model crash (early model collapse), and the model begins to lose information about the distribution tail

two。 In the later stage of model collapse (late model collapse), the model is entangled with different patterns of the original distribution and converges to a distribution with little resemblance to the original distribution, and the variance is often very small.

At the same time, the researchers also summarized two main reasons for the collapse of the model:

Among them, more often, we will get a cascade effect, that is, a single inaccurate combination will lead to an increase in the overall error.

1. Statistical approximation error (Statistical approximation error)

In each step of resampling, the non-zero probability in the information may be lost, resulting in a statistical approximation error, which will gradually disappear when the number of samples tends to infinity, which is the main cause of model collapse.

two。 Function approximation error (Functional approximation error)

The error is mainly due to the lack of expression ability of the function approximator in the model, or sometimes the expression ability is too strong beyond the support of the original distribution.

As we all know, the neural network is a general function approximator in the limit case, but in fact this assumption is not always true, especially the neural network can introduce non-zero likelihood outside the support range of the original distribution.

For example, if we try to fit two Gaussian mixture distributions with one Gaussian distribution, even if the model has perfect information about the data distribution, the model error is inevitable.

It should be noted that in the case of no statistical error, the function approximation error will only occur in the first generation. Once the new distribution can be described by the function approximator, it will maintain the exact same distribution in each generation model.

It can be said that the strong approximation ability of the model is a double-edged sword: its expression ability may counteract the statistical noise and better fit the real distribution, but it will also complicate the noise.

In this regard, the paper co-author Ilia Shumailov said: "errors in the generated data will accumulate, eventually forcing the model learned from the generated data to further misunderstand reality." And the model crashes so quickly that the model quickly forgets most of the original data that was originally learned. "

The good news is that the researchers have found that there are ways to avoid model collapse.

The first approach is to keep a high-quality copy of the original, full, or nominally human-generated dataset, avoid mixing with the data generated by AI, and then use the data to retrain the model periodically, or to train the model completely from scratch.

The second way to avoid a decline in response quality and reduce errors or repetitions in the AI model is to reintroduce new, clean, human-generated data sets into training.

To prevent the model from crashing, developers need to ensure that minorities in the original data are fairly represented in subsequent datasets.

The data needs to be carefully backed up and cover all possible boundary situations; when evaluating the performance of the model, you need to take into account the data that the model will process, even the most untrusted data.

Subsequently, when retraining the model, you also need to ensure that both old and new data are included, which will increase the cost of training, but at least to some extent help to alleviate the collapse of the model.

However, these methods must require content producers or AI companies to adopt some large-scale tagging mechanism to distinguish between AI-generated content and human-generated content.

Currently, there are some out-of-the-box solutions, such as GPTZero,OpenAI Detector, or Writer, that work well on simple text.

However, in some special texts, these methods can not be effectively implemented. For example, there were 10 summaries of ChatGPT synthesis in the EPFL study, while only 6 were detected by GPTZero.

In response, the researchers fine-tuned their model to test the use of AI and found that ChatGPT was the most commonly used LLM at the time of writing.

For the constructed method of detecting AI data, the researchers trained a customized "synthetic-real classifier" using the answers from the original study and the data synthesized with ChatGPT.

The classifier is then used to estimate the universality of the synthetic answers in the re-performed task.

Specifically, the researchers first used MTurk responses actually written by humans, and responses generated by synthetic LLM, to train "synthetic-realistic classifiers" for specific tasks.

Secondly, the classifier is applied to the real response of MTurk (in which crowdsourcing may or may not rely on LLM) to estimate the universality of LLM use.

Finally, the researchers confirmed the validity of the results and compared the keystroke data with MTurk responses afterwards.

The experimental results show that the model is as accurate as 99% in correctly recognizing artificial intelligence text.

In addition, the researchers verified the results with keystroke data and found that:

Summaries written entirely in the MTurk text box (unlikely to be synthetic) are classified as real

-in the pasted summary, there is a significant difference between the abstract summary and the use of LLM.

Specifically, the text generated by artificial intelligence usually bears little resemblance to the original summary. This indicates that the AI model is generating new text rather than copying and pasting part of the original content.

"Human data" is very important. now, there is a widespread concern that LLM will shape the human "information ecosystem", that is, most of the information available online is generated by LLM.

The performance of LLM trained with synthetic generated data degrades significantly, just as Ilia Shumailov claims to make the model suffer from "dementia".

This problem will become more serious, because with the popularity of LLM, crowdsourcing workers have widely used a variety of LLM, such as ChatGPT.

But for human content creators, this is good news, making money while improving productivity.

However, real "human data" are needed to save LLM from being on the brink of collapse.

1. Human data is still crucial in science.

two。 Training models on synthetic data may lead to prejudice and ideological permanence.

3. As models become more popular and better / multimodal, adoption will only increase

In general, the raw data generated by humans can better represent the world, although it may also contain some poor-quality, low-probability data, while generative models tend to over-fit popular data and misunderstand lower-probability data.

Then, in a future full of generative AI tools and related content, human-generated content may be more valuable than it is today, especially as a source of AI raw training data.

Reference:

Https://arxiv.org/abs/2306.07899v1

Https://arxiv.org/abs/2305.17493v2

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.