A 30-page paper, newly written by Yu Shilun's team: a comprehensive survey of AIGC, from GAN to the history of ChatGPT 04/15 Update SLTechnology News&Howtos

A 30-page paper, newly written by Yu Shilun's team: a comprehensive survey of AIGC, from GAN to the history of ChatGPT

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

2022 can be said to be the first year of generative AI. Recently, Yu Shilun's team published a comprehensive survey of AIGC, introducing the history of development from GAN to ChatGPT.

The year 2022 just passed is undoubtedly the singularity of generative AI explosion.

Since 2021, generative AI has been selected into the "artificial intelligence technology maturity curve" of Gartner for two consecutive years, which is considered to be an important AI technology trend in the future.

Recently, Yu Shilun's team published a comprehensive survey of AIGC, introducing the history of development from GAN to ChatGPT.

Paper address: https://arxiv.org/ pdf / 2303.04226.pdf this article excerpts some of the contents of the paper to introduce.

The singularity has come? In recent years, artificial intelligence generated content (AIGC, also known as generative AI) has aroused widespread concern outside the computer science community.

The whole society is beginning to have a great interest in a variety of content-generated products developed by large technology companies, such as ChatGPT and DALL-E-2.

AIGC, refers to the use of generative artificial intelligence (GAI) technology to generate content, and can automatically create a large number of content in a short time.

ChatGPT is an AI system developed by OpenAI for building sessions. The system can effectively understand and respond to human language in a meaningful way.

In addition, DALL-E-2 is another state-of-the-art GAI model developed by OpenAI that creates unique high-quality images from text descriptions in minutes.

The example of AIGC in image generation technically speaking, AIGC refers to a given instruction that can guide the model to complete the task and use GAI to generate content that satisfies the instruction. This generation process usually includes two steps: extracting the intention information from the instruction and generating the content according to the extracted intention.

However, as previous studies have proved, the paradigm of the GAI model containing the above two steps is not completely novel.

Compared with previous work, the core of recent AIGC advances is to train more complex generation models on larger data sets, use a larger underlying model framework, and have access to a wide range of computing resources.

For example, GPT-3 is the same as the main framework of GPT-2, but the pre-training data size increases from WebText (38GB) to CommonCrawl (filtered to 570GB), and the base model size increases from 1.5B to 175B.

Therefore, GPT-3 has better generalization ability than GPT-2 in various tasks.

In addition to the benefits of increased data volume and computing power, researchers are also exploring ways to combine new technologies with GAI algorithms.

For example, ChatGPT uses human feedback reinforcement learning (RLHF) to determine the most appropriate response to a given instruction, thus improving the reliability and accuracy of the model over time. This approach enables ChatGPT to better understand human preferences in long conversations.

At the same time, in CV, Stable Diffusion proposed by Stability AI in 2022 has also achieved great success in image generation.

Different from previous methods, the generation diffusion model can help generate high-resolution images by controlling the balance between exploration and development, so as to achieve diversity in the generated images and a harmonious combination of similarity with training data.

By combining these advances, the model has made significant progress in the mission of AIGC and has been adopted in a variety of industries such as art, advertising and education.

In the near future, AIGC will continue to be an important research field of machine learning.

Generally speaking, GAI models can be divided into two types: single-mode model and multi-mode model.

Therefore, it is very important to conduct a comprehensive review of past research and find out the problems in this field. This is the first survey focusing on core technologies and applications in the field of AIGC.

This is the first time that AIGC has summarized a comprehensive survey of GAI in terms of technology and applications.

Previous surveys were mainly introduced from different perspectives of GAI, including natural language generation, image generation, and multimodal machine learning generation. However, these previous work focused only on specific parts of AIGC.

In this survey, we first reviewed the basic technologies commonly used in AIGC. Then, a comprehensive summary of advanced GAI algorithms is provided, including single-peak generation and multi-peak generation. In addition, the paper also studies the application and potential challenges of AIGC.

Finally, the future direction of this field is emphasized. In a word, the main contributions of this paper are as follows:

To the best of our knowledge, we are the first to provide a formal definition and comprehensive investigation of the AIGC and AI enhanced generation process.

We review the history and basic technology of AIGC, and comprehensively analyze the latest progress of GAI tasks and models from the point of view of unimodal generation and multi-peak generation.

This paper discusses the main challenges and future research trends of AIGC.

Generative AI history generation model has a long history in artificial intelligence, which can be traced back to the development of hidden Markov model (HMMs) and Gaussian mixture model (GMMs) in 1950s.

These models generate continuous data, such as speech and time series. However, until the emergence of deep learning, the performance of the generation model has been significantly improved.

In the early depth generation model, there was usually not much overlap in different areas.

To generate the development history of AI in CV, NLP and VL in NLP, the traditional way to generate sentences is to use the N-gram language model to learn the distribution of words, and then search for the best sequence. However, this method can not effectively adapt to long sentences.

In order to solve this problem, recurrent neural network (RNNs) was later introduced into language modeling tasks, allowing relatively long dependencies to be modeled.

The second is the development of long-term short-term memory (LSTM) and gated recursive unit (GRU), which use gating mechanism to control memory in training. These methods can handle about 200tokens (token) in a sample, which marks a significant improvement over the N-gram language model.

At the same time, in CV, before the emergence of depth-based learning methods, traditional image generation algorithms use techniques such as texture synthesis (PTS) and texture mapping.

These algorithms are based on hand-designed features and have limited ability to generate complex and diverse images.

In 2014, GANs was proposed for the first time as a milestone in the field of artificial intelligence because of its impressive results in various applications.

Variant automatic encoders (VAEs) and other methods, such as generating diffusion models, have also been developed to have more fine-grained control over the image generation process and to generate high-quality images.

The development of generation models in different areas follows different paths, but eventually there is an intersection: Transformer architecture.

In 2017, Transformer was introduced into the NLP task by Vaswani et al. and later applied to CV, and then became the dominant architecture for many generated models in various fields.

In the field of NLP, many well-known large language models, such as BERT and GPT, use Transformer architecture as their main building blocks. It has advantages over previous building blocks, namely LSTM and GRU.

In CV, Vision Transformer (ViT) and Swin Transformer later developed this concept, combining the Transformer architecture with visual components to enable it to be applied to image-based downlink systems.

In addition to the improvements brought by Transformer to single modes, this crossover also enables models from different fields to integrate and perform multimodal tasks.

An example of a multimodal model is CLIP. CLIP is a joint visual language model. It combines Transformer architecture with visual components and allows training on large amounts of text and image data.

Because of the combination of visual and language knowledge in pre-training, CLIP can also be used as an image encoder in multimodal prompt generation. In a word, the emergence of Transformer-based model has completely changed the generation of artificial intelligence and led to the possibility of large-scale training.

In recent years, researchers have also begun to introduce new technologies based on these models.

For example, in NLP, people sometimes prefer few-shot prompts to help models better understand task requirements. It refers to including some examples of selections from the dataset in the prompt.

In visual language, researchers combine the model of a specific pattern with the pattern of self-supervised comparative learning goals to provide a more powerful representation.

In the future, as AIGC becomes more and more important, more and more technologies will be introduced, which will give great vitality to this field.

In this section on the basics of AIGC, we introduce the basic models commonly used in AIGC.

The underlying model TransformerTransformer is the backbone architecture of many of the most advanced models, such as GPT-3, DALL-E-2, Codex, and Gopher.

It was first proposed to address the limitations of traditional models, such as RNNs, in dealing with variable length sequences and context awareness.

The architecture of Transformer is mainly based on a self-attention mechanism that enables the model to notice different parts of the input sequence.

Transformer consists of an encoder and a decoder. The encoder receives the input sequence and generates a hidden representation, while the decoder receives the hidden representation and generates an output sequence.

Each layer of the encoder and decoder consists of a multi-head attention and a feedforward neural network. Long attention is the core component of Transformer, learning to assign different weights according to the correlation of tags.

This information routing method enables the model to better handle long-term dependencies, thus improving performance in a wide range of NLP tasks.

Another advantage of Transformer is that its architecture makes it highly parallel and allows data to overcome inductive biases. This feature makes Transformer very suitable for large-scale pre-training, so that the Transformer-based model can adapt to different downstream tasks.

Since the introduction of Transformer architecture, pre-training language model has become the mainstream choice of natural language processing because of its parallelism and learning ability.

Generally speaking, these Transformer-based pre-training language models can be divided into two categories according to their training tasks: autoregressive language model and mask language model.

Given a sentence consisting of multiple tags, the targets of language modeling, such as BERT and RoBERTa, are masked, that is, the probability of masking tags that predict given context information.

The most striking example of the masking language model is BERT, which includes masking language modeling and the following sentence prediction task. RoBERTa uses the same architecture as BERT to improve its performance by increasing the amount of pre-training data and incorporating more challenging pre-training goals.

XL-Net is also based on BERT, which combines permutation operations to change the prediction order of each training iteration, so that the model can learn more information across tags.

Autoregressive language models, such as GPT-3 and OPT, model the probability of a given previous tag, so it is a left-to-right language model. Different from the mask language model, the autoregressive language model is more suitable for generative tasks.

Reinforcement learning from human feedback although trained with large amounts of data, AIGC may not always output content that is consistent with the user's intention.

In order to make AIGC output more in line with human preferences, reinforcement learning from human feedback (RLHF) has been applied to model fine-tuning in a variety of applications, such as Sparrow, InstructGPT and ChatGPT.

In general, the whole process of RLHF consists of the following three steps: pre-training, reward learning and fine-tuning of reinforcement learning.

In recent years, hardware technology has made remarkable progress, which promotes the training of large models.

In the past, it could take days or even weeks to train a large neural network with CPU. However, with the increase of computing power, the process has been accelerated by several orders of magnitude.

For example, Nvidia's NVIDIA A100 GPU is 7 times faster than V100 and 11 times faster than T4 in BERT large-scale reasoning.

In addition, Google's Tensor processing Unit (TPU) is designed for deep learning and provides higher computing performance than the A100 GPU.

The accelerated progress of computing power significantly improves the efficiency of artificial intelligence model training, which provides a new possibility for the development of large-scale complex models.

Another major improvement in distributed training is distributed training.

In traditional machine learning, training is usually done on a single machine using a single processor. This method can be well applied to small datasets and models, but it becomes impractical when dealing with large datasets and complex models.

In distributed training, the training tasks are distributed to multiple processors or machines, which greatly improves the training speed of the model.

Some companies have also released frameworks that simplify the distributed training process of deep learning stacks. These frameworks provide tools and API that allow developers to easily distribute training tasks across multiple processors or machines without having to manage the underlying infrastructure.

Cloud computing also plays a vital role in training large models. In the past, models were often trained locally. Now, as cloud computing services such as AWS and Azure provide access to powerful computing resources, deep learning researchers and practitioners can create large GPU or TPU clusters for large model training as needed.

In general, these advances have made it possible to develop more complex and accurate models, opening up new possibilities in all areas of artificial intelligence research and application.

The author introduces Philip S. Yu, a computer scholar and ACM / IEEE Fellow, a distinguished professor in the Department of computer Science at the University of Illinois at Chicago (UIC).

He has made remarkable achievements in the theory and technology of big data's excavation and management. In view of big data's challenges in scale, speed and diversity, he put forward effective cutting-edge solutions in the methods and techniques of data mining and management. especially in the fusion of diverse data, mining data streams, frequent patterns, subspaces and graphs have made breakthrough contributions.

He also made pioneering contributions in the field of parallel and distributed database processing technology and applied it to IBM S / 390 Parallel Sysplex systems, successfully transforming traditional IBM mainframes into parallel microprocessor architectures.

Reference:

Https://arxiv.org/pdf/2303.04226.pdf

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.