How to analyze the loss function loss function commonly used in Deep Learning 04/18 Update SLTechnology News&Howtos

How to analyze the loss function loss function commonly used in Deep Learning

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to analyze the loss function loss function commonly used in Deep Learning, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can get something.

Do you still remember how the BP algorithm updates the parameter wrecom b? When we give the network an input, multiplied by the initial value of w, and then get an output through the activation function. Then a difference is obtained according to the subtraction of the output value and the label. Then reverse propagation is done according to the difference. This difference is generally called loss, and the loss function is the function of loss. Loss function = F (loss), that is, F. Let's talk about a similar concept, cost function. Note that cost function is not a cost function in economics.

The first point is that there is a difference in the definition of loss function between machine learning and deep learning. Today we are talking about the loss function commonly used in deep learning. So what is the loss function? as the name implies, loss is the feeling that something is missing, and what is missing is the loss. The explanation of the professional point is that the loss function represents the difference between the predicted value and the real value. The loss function is generally called lost function, and there is another one called cost function, both of which are actually called loss functions. I always thought they were the same concept, but after I checked some information, I found that there were some differences. First of all, let's take a look at what is defined in Bengio's "deep learning":

Where J (theta) is called cost function,L (*) is called loss function. And cost function is called average over the training set, the average of the training set. And loss function is called per-example loss function. How do you understand this? Think about it, when we usually train the model, do we finish the training at once? Certainly not, it is after epoch iterations, or many times of back propagation, finally get the model parameters. So the loss function I understand is a local concept, relative to the whole training set. Where f (*) represents the output of the model when input x. Y stands for target output, that is, label, true value.

There is another way to understand that loss function is for a training sample, while cost function is for the sample as a whole. The difference lies in whether our task is to do regression or classification. Generally speaking, if you are doing a classification problem, when the predicted value is y1 and the actual value is y, then loss function is y-y1. And cost function means taking the mean of n samples. If it is a regression problem, loss function is numpy.square (y-y1). And costfunction is 1Po (numpy.square (y-y1)). It is also known as the mean square error (mean square error,MSE).

In machine learning, there is another way to understand loss function and cost function. Have you ever heard of structural risk and empirical risk? If you don't know, it doesn't matter. Let me briefly talk about their relationship:

Structural risk = empirical risk + penalty term (or regular term)

What does this mean? If we don't talk about it today, there will be more things involved. Interested in children's shoes to see the support vector machine (support vector machine, SVM), this algorithm. I have feelings for SVM, which I have studied for a long time. I will talk about it in more detail later. I suggest you read a Chinese paper first. In 2000, "on Statistical Learning Theory and support Vector Machine" written by Zhang Xuegong, a teacher at Tsinghua University, is quite classic. It is recommended to read it several times. Then what I want to say is that structural risk is generally called cost function, and empirical risk is called loss function. The penalty items just mentioned are generally not used in in-depth learning. However, adding a penalty term to the loss function is a good way to write a water paper! Embarrassed.

Before we begin to introduce the loss function, we also want to talk about the function of the loss function, or why there should be a loss function in deep learning. First of all, it is certain that, for now, no. Let's take the problem of classification as chestnuts and explain it to you. The task of the classification problem is to correctly distinguish the data in a given sample according to a certain category. Pay attention to the correct distinction, ha, if you finally separate, but together are not the same class, that is useless work. Since it is necessary to make a correct distinction, then the result of your prediction should be very close to his original value. And the way to measure this proximity is the loss function. So after we have the loss function, the goal is to make the value of the loss function as small as possible, that is:

Min f (*)

Where f stands for loss function, so the classification problem is transformed into an optimization problem optimization problem. How hot are the optimization methods in mathematics! The problem becomes simple.

Okay, let's start today's topic. Introduces two kinds of loss function commonly used in two kinds of deep learning. One is mean squared loss function, the mean square error loss function, and the other is cross entropy loss function, cross entropy loss function.

1. Mean squared loss function

The sigma function is the activation function we talked about in the last article, so of course it doesn't matter which activation function it is. In BP, we pass it back according to the difference of the loss, and update the wjournal b. So how to calculate the difference of this loss? Yes, it is to derive loss function from WBI b and calculate their gradients. Here is a picture that has been used before. In particular, I would like to talk about how this derivative is calculated! There is a big hole here, and the derivative here is not quite the same as we usually take the derivative of a function. The derivative here refers to the matrix derivative, which is also called vector derivative. If you take a look at Ref. 1, you must look at it, otherwise it is difficult to fully understand this piece.

The term in which f is derived from e in the graph is the loss function, where e is the function of wjournal b.

The mean square error is relatively simple, making the difference and finding the square is ok. Here's a training technique. When we use MSE as a loss function, it's best not to use activation functions like sigmoid,tanh. I remember that in the activation function, there is a problem that is not clear, that is, the saturation of the activation function, how to understand it. Let's understand from a mathematical point of view that when x of the sigmoid function tends to positive infinity or negative infinity, the value of the function is close to 1 and 0, that is, when the independent variable is greater than a certain value, the function becomes very gentle, the slope is relatively small, or even becomes 0. Hand animation function image, this is what it looks like. = (well, ugly)

Then when the slope is very small, its derivative is very small, and BP depends on the derivative when backpropagating to update the parameters.

New parameter = old parameter + gradient * learning rate

In this way, the parameters will basically remain unchanged, so that we can have an approximate understanding of what saturation is. no, no, no.

2. Cross entropy loss function

In order to understand the cross entropy loss function, it will involve what is cross entropy. With cross entropy, there will be the concept of entropy, and entropy has something to do with the amount of information. Besides, is there any other entropy besides cross entropy? Yes, it is conditional entropy. Let me say it briefly.

2.1 amount of information

The amount of information simply put, in a word, the amount of information of an event An indicates the extent to which its occurrence reacts to people. If the reverse is large, it means that event A has a large amount of information, and vice versa. Generally speaking, we can use probability to represent the possibility of event A, the greater the probability, the less the amount of information, on the contrary, the smaller the probability, the greater the amount of information. The p (x0) in the formula represents probability, while the logarithmic function is a monotone increasing function, adding a negative sign to become a monotone decreasing function. The larger the independent variable, the smaller the function value.

2.2 Entropy

In fact, the concept of entropy is not unfamiliar. I remember that it seems to exist in junior high school chemistry. In chemistry, entropy indicates the degree of confusion of a system. The more chaotic the system, the greater the entropy. In chemistry, we often do purification, and after purification, the entropy becomes smaller. That's the truth. Mathematically, for an event A, its entropy is defined as:

Where E stands for mathematical expectation.

2.3 relative entropy

Relative entropy is also called KL (Kullback-Leibler divergence) divergence, or KL distance. This thing is now very famous because in the last two years, it is quite popular to generate confrontation networks (Generative Adversarial Networks,GAN). In his paper, the great god Goodfellow uses KL divergence to measure the distance between two distributions, and another is called JS divergence. They are all methods to measure the distribution of two random variables, and of course there are other methods. Interested students can take a look at Ref. 2. Relative entropy is defined as the distribution An and B of two random variables.

KL (AB) = E (log (A _ hand B)) [do not want to type the formula, embarrassed]

2.4 Cross Entropy

Cross entropy is very similar to conditional entropy and is defined as:

Cross Entropy (A _ ~ ~ B) = conditional Entropy (A _ ~ ~ B) + H (A)

H (A) represents the entropy of event A.

2.5 Cross Entropy loss function

Where N represents the sample size.

In deep learning, the cross-entropy loss function is defined as:

Then we take the derivative of WBI b:

[ask for it yourself]

After derivation, you can see that the derivative of the derivative function is not active. In this way, the saturation of the activation function is skillfully avoided.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.