Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use variational self-Encoder VAE to generate Animation characters

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

How to use variational self-encoder VAE to generate animation characters, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail, people with this need can come to learn, I hope you can gain something.

Variational self-encoder (VAE) and generated countermeasure network (GAN) are often compared with each other, in which the application range of the former in image generation is much narrower than that of the latter. Does VAE only generate meaningful output on MNIST datasets? In this paper, the author tries to use VAE to automatically generate the avatars of animation characters, and achieved good results.

The above is a sample of animated pictures generated by variational encoders.

In the field of image generation, people always like to try to compare variational self-encoder (VAE) with counterproductive network (GAN). The consensus is that VAE is easier to train and has explicit distribution assumptions (Gaussian distribution) for explicit representation and observation, while GAN can better capture the distribution of observations and does not have any assumptions about the distribution of observations. As a result, everyone believes that only GAN can create clear and vivid pictures. While this may be true, because in theory, GAN captures the correlation between pixels, not many people have tried to train VAE with larger images than 28028-dimensional MNIST data as input.

There are too many variational self-encoder (VAE) implementations on MNIST data sets, but few people do anything different on other data sets. Is this because the original variational self-encoder papers only use MNIST data sets as an example?

MythBusters!

Now, let's do a "MythBusters" practice to see how unsatisfactory the VAE image generator is. For example, the following images.

An obscure sample of VAE.

Let's start by looking for some GAN contrast groups. I searched Google for "GAN applications" and found a very interesting Github code repository that summarizes some GAN applications: https://github.com/nashory/gans-awesome-applications

Why is "GAN application" all right? Well, it's hard to find a GAN application that isn't generated by images, isn't it? In order to make this practice more exciting, we will try to generate models to output some anime images this time!

First, let's see how well a GAN model can accomplish this task. The following two groups of pictures come from two projects that generate anime images, which are selected by many people and work on this basis:

1) https://github.com/jayleicn/animeGAN

2) https://github.com/tdrussell/IllustrationGAN

It's not bad, isn't it? I like their colors. They are very similar to the real pictures.

Although there are some ghosts in these pictures, they look better. I guess the trick is to enlarge the image and just look at the face.

The results show that the excellent performance of GAN is impressive. It puts a lot of pressure on me.

Uh... Should we continue?

Where do you get the data?

Unfortunately, there is no standard animation image data set available on the Internet. But that doesn't stop people like me from looking for it. After browsing through some GitHub code repositories, I got some hints:

A Japanese website called Getchu has a large number of anime pictures.

You need some tools to download pictures from the Internet, but you need to find the tool yourself. (I'm here to offer you one that may be illegal.)

There are many pre-trained U-net/ RCNN anime face detectors, such as lbpcascade_animeface, so that you can extract faces into 64 × 64 images.

Variational self-encoder VAE

This article assumes that you have read a lot of posts about variational self-encoders. If you don't, I'd like to recommend the following articles to you:

Intuitively Understanding Variational Autoencoders (https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf)

Tutorial-What is a variational autoencoder? (https://jaan.io/what-is-variational-autoencoder-vae-tutorial/)

Introducing Variational Autoencoders (in Prose and Code) (http://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html)

Compare two generation models in TensorFlow: VAE and GAN

So, after you know what VAE is and how to implement it, the question now is, "is knowing the objective function and implementation enough to train a variational self-encoder?" "I think the answer is yes, but it is not as simple as it is usually said. For example, the question of where the objective function comes from and what the KL divergence component does here. In this post, I will try to explain the secret behind VAE.

Variational inference is a technique to infer complex distribution in probabilistic graph model (PGM). Intuitively, if you can't easily capture the best of a complex distribution, you can use some simple distributions like Gaussian distribution to approximate its upper or lower bounds. For example, the following figure shows how to use Gaussian distribution approximation to estimate the local optimal solution.

Picture from: https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html

Please ignore the EM (maximum expectation algorithm) in the title. This is a classic optimization method in the probability graph model, it can update the variational lower bound, but now you will use the random gradient descent algorithm (SGD) in deep learning.

KL divergence is another very important technique that will be used in probability graph models. It is used to measure the difference between two distributions. It is not a distance measure because KL [Q | | P] is not equal to KL [P | | Q]. The following slide shows this difference.

Picture from: https://www.slideshare.net/Sabhaology/variational-inference

Obviously, KL [Q | | P] does not allow paired 0 when Q > 0. In other words, when minimizing KL [Q | | P], you want to use Q distribution to capture some patterns of P distribution, but you are bound to ignore the risks of some patterns. Also, when P > 0, KL [P | | Q] does not allow QQ 0. In other words, when minimizing KL [P | | Q], you want Q to capture the entire distribution and, if necessary, completely ignore the mode of P.

So far, we have intuitively understood two facts:

"variation" is roughly an approximation of the upper or lower bound.

"KL" measures the difference between two divisions.

Now let's go back and see how the objective function of VAE was obtained.

This is my derivation of VAE. Although it may seem different from what you might see in the paper, this is the easiest derivation I think can be understood.

Given some images as training data, we want to fit some parameters (theta) that can represent the training data as accurately as possible. Officially, we want to fit the model that maximizes the joint probability of the observations. So, you get the expression on the left.

Where does "z" come from?

Z is a potential representation of creating observations (images). Intuitively, we assume that some mysterious painter creates these images (x) in the data set, which we call Z. Moreover, we found that Z is uncertain, sometimes the No. 1 painter created the picture, and sometimes the No. 2 painter created the picture. We only know that all artists have a special preference for the pictures they draw.

Where did the greater than or equal sign come from?

The Jensen inequality is shown below. Note: log is a concave function, so in our example, the inequality is reversed.

Picture from Youtube: https://www.youtube.com/watch?v=10xgmpG_uTs

Why take the approximation in the last line?

We can't do a fraction of the infinite possible z, so we use numerical approximation, which means that we sample from the distribution to approximate expectations.

What is the P (x | z) distribution?

In the variational self-encoder, we assume that it is a Gaussian function. This is why you need to do mean square error (MSE) when optimizing VAE.

The f function is the decoder! Ohh! There should be a square sign after the norm.

@ staticmethod

Def _ gaussian_log_likelihood (targets, mean, std):

Se = 0.5 * tf.reduce_sum (tf.square (targets-mean)) / (2*tf.square (std)) + tf.log (std)

Return se

@ staticmethod

Def _ bernoulli_log_likelihood (targets, outputs, eps=1e-8):

Log_like =-tf.reduce_sum (targets * tf.log (outputs + eps)

+ (1.-targets) * tf.log ((1.-outputs) + eps)

Return log_like

The assumption of P (x | z): Gaussian and Bernoulli distribution. The code shows negative logarithmic likelihood because we always want to minimize errors rather than explicitly maximize likelihood in deep learning.

The reason you see so many softmax functions in Github is that for binary images like MNIST, we assume that the distribution is a Bernoulli distribution.

What is the P (z | x) distribution?

This is a Gaussian distribution. This is why you see that the implementation of KL divergence is an approximate solution. Don't you understand? Don't worry, you can read this article: https://stats.stackexchange.com/questions/318184/kl-loss-with-a-unit-gaussian

@ staticmethod

Def _ kl_diagnormal_stdnormal (mu, log_var):

Var = tf.exp (log_var)

Kl = 0.5 * tf.reduce_sum (tf.square (mu) + var-1.-log_var)

Return kl

The expression of KL divergence approximate form written by Python language

How can this equation become a self-encoder?

There are two types of parameters in the equation. The parameter theta is used to model the distribution P (x | z), which decodes z into image x. A variant of theta is used to model the distribution Q (z | x), which encodes x as a potential representation of z.

A self-made schematic diagram of variational self-encoder. The green and blue parts are differentiable, and the amber parts represent non-differentiable white noise. Everyone uses pictures of famous cats, so here I use dogs. I don't know where I got this lovely dog picture. If you know, please let me know so that I can quote the original website correctly.

The corresponding TensorFlow calculation graph:

Def _ build_graph (self):

With tf.variable_scope ('vae'):

Self.x = tf.placeholder (tf.float32, shape= [None, self._observation_dim])

With tf.variable_scope ('encoder'):

Encoded = self._encode (self.x, self._latent_dim)

With tf.variable_scope ('latent'):

Self.mean = encoded [:,: self._latent_dim]

Logvar = encoded [:, self._latent_dim:]

Stddev = tf.sqrt (tf.exp (logvar))

Epsilon = tf.random_normal ([self._batch_size, self._latent_dim])

# Reparameterization Trick

Self.z = self.mean + stddev * epsilon

With tf.variable_scope ('decoder'):

Decoded = self._decode (self.z, self._observation_dim)

Self.obs_mean = decoded

If self._observation_distribution = 'Gaussian':

Obs_epsilon = tf.random_normal ([self._batch_size

Self._observation_dim])

Self.sample = self.obs_mean + self._observation_std * obs_epsilon

Else:

Self.sample = Bernoulli (probs=self.obs_mean). Sample ()

The significance of two components of VAE objective function

Minimize the KL term: think of P (z | x) as N (0primel) (standard normal distribution). We want to generate an image by sampling from a standard normal distribution. Therefore, it is best to make the potential distribution as close to the standard normal distribution as possible.

Minimize the loss of reconstruction: create as vivid / realistic images as possible. Minimize the error between the real image and the generated image.

It's easy to see that balancing these two parts is critical in order for VAE to work well.

If we completely ignore the KL term, the variational self-encoder will converge to the standard self-encoder, which will delete the random part of the objective function. Therefore, VAE cannot generate new images, only remember and display training data (or create pure noise, because there is no encoded image in that potential location! ) if you are lucky, the ideal result is to achieve kernel principal component analysis!

If we completely ignore the reconstruction term, then the potential distribution will degenerate into a standard normal distribution. So whatever the input is, you always get a similar output.

A case of GAN degradation. The same is true of VAE. Picture from: http://yusuke-ujitoko.hatenablog.com/entry/2017/05/30/011900

Now we understand:

We want VAE to generate reasonable images, but we don't want it to display training data.

We want to sample from the standard normal distribution, but we don't want to see the same image over and over again. We want the model to produce very different images.

So how do we balance them? We set the standard deviation of the observation to a superparameter.

With tf.variable_scope ('loss'):

With tf.variable_scope ('kl-divergence'):

Kl = self._kl_diagnormal_stdnormal (self.mean, logvar)

If self._observation_distribution = 'Gaussian':

With tf.variable_scope ('gaussian'):

# self._observation_std is hyper parameter

Reconst = self._gaussian_log_likelihood (self.x

Self.obs_mean

Self._observation_std)

Else:

With tf.variable_scope ('bernoulli'):

Reconst = self._bernoulli_log_likelihood (self.x, self.obs_mean)

Self._loss = (kl + reconst) / self._batch_size

I see that people often set the KL item to a value like 0.001 × KL + Reconstruction_Loss, which is incorrect! By the way, is that why so many people only do VAE on MNIST datasets?

What else is there to note? The complexity of the model is the key factor to support the loss function. If the decoder is too complex, even a weak loss cannot prevent it from overfitting. As a result, the potential distribution is ignored. If the decoder is too simple, the model will not be able to decode the potential representation reasonably, and eventually can only capture some rough contours, as we have shown before.

Finally, if everything we've done above is right, it's time to look at the power of VAE.

Succeed!

Well, I admit, small pictures are unconvincing.

Zoom in a little bit.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report