Tensorflow Series Topics (4): Overview of pre-Feed Neural Networks 07/01 Update SLTechnology News&Howtos

Tensorflow Series Topics (4): Overview of pre-Feed Neural Networks

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Table of contents:

Neural network preface

Neural network

Perceptron model

Multilayer neural network

Activation function

Logistic function

Tanh function

ReLu function

Loss function and output unit

Selection of loss function

Mean square error loss function

Cross entropy loss function

Selection of output unit

Linear element

Sigmoid unit

Softmax unit

reference

I. preface of neural network

From this chapter, we will formally begin to introduce the neural network model and learn how to use TensorFlow to implement deep learning algorithms. Artificial neural network (artificial neural network) is inspired by biology to a certain extent, and expects to simulate biological nervous system through a certain topological structure. it is a main connectionism model (artificial intelligence three major principles: symbolism, connectionism and behaviorism). In this chapter, we will start with the simplest neural network model perceptron model, first understand what kind of problems can be solved by the perceptron model (single-layer neural network), as well as its limitations. In order to overcome the limitations of single-layer neural networks, we must extend to multi-layer neural networks. Around multi-layer neural networks, we will further introduce activation functions and back propagation algorithms. The content of this chapter is the basis of in-depth learning and is very important for understanding the contents of the following chapters.

The concept of deep learning is developed from the research of artificial neural network. The early perceptron model can only solve the simple linear classification problem. Later, it is found that the linear inseparable problem similar to "XOR problem" can be solved by increasing the number of layers of the network. This kind of multi-layer neural network is also called multi-layer perceptron. For multilayer perceptrons, we use BP algorithm to train the model [1], but we find that BP algorithm has some shortcomings such as slow convergence speed and easy to fall into local optimization, so that BP algorithm can not train multilayer perceptrons well. In addition, the activation function used at that time also has the problem of gradient disappearance, which makes the development of artificial neural network almost stagnant. In order to train the multi-layer neural network, scholars have explored many improvement schemes. Until 2006, Hinton and others put forward the unsupervised greedy layer-by-layer training algorithm based on deep confidence network (DBN), which made it hopeful to solve this problem, and the wave of deep learning also started.

This chapter mainly consists of five parts. in the first part, we introduce the basic structure of neural network, from the basic perceptron model to the multi-layer neural network structure; the second part introduces the activation function commonly used in neural network; the third part introduces the selection of loss function and output unit; the fourth part introduces an important basic knowledge of neural network model-back propagation algorithm. Finally, we use TensorFlow to build a simple multi-layer neural network to realize the recognition of handwritten numbers in mnist.

2. Neural network

1. Perceptron model

The perceptron (Perceptron) is the simplest artificial neural network, which can also be called a single-layer neural network, as shown in figure 1. The perceptron was proposed by Frank Rosenblatt in 1957. Its structure is very simple. The input is a vector of real values, and the output has only two values: 1 or-1. It is a two-class linear classification model.

Figure 1 Perceptron model

As shown in figure 3-1, the perceptron first performs a weighted summation of the input vector to get an intermediate value, assuming that the value is:

Formula 1

Then the final output is obtained through an activation function, which is a symbolic function:

Type 2

Formula 1 can be regarded as a threshold (we usually call it an offset term). When the weighted sum of the input vector is greater than the threshold (the sum of the two), the output of the perceptron is 1, otherwise the output is-1.

two。 Multilayer neural network

Perceptrons can only solve the problem of linear separability. Take logical operations as an example:

Fig. 2 logical operation

The perceptron can solve the problem of logic and or, but it can not solve the XOR problem, because the result of XOR operation cannot be divided by a straight line. In order to solve the problem of linear inseparability, we need to introduce multi-layer neural network. In theory, multi-layer neural network can fit any function (there are relevant materials for reference in the GitHub project supporting this book).

Compared with the single-layer neural network, the multi-layer neural network needs at least one hidden layer in addition to the input layer and the output layer. As shown in figure 3, there is a two-layer neural network with a hidden layer.

Fig. 3 two-layer neural network

In order to more intuitively compare the difference between single-layer neural network and multi-layer neural network, we use TensorFlow PlayGround to demonstrate two examples. TensorFlowPlayGround is a visual demonstration platform for deep learning launched by Google: http://playground.tensorflow.org/.

Let's first look at an example of linear separability, as shown in figure 4. On the right side of the graph is the visualization of the data, which can be divided by a straight line. As can be seen from the figure, we use a single-layer neural network. There are two neurons in the input layer and only one neuron in the output layer, and a linear function is used as the activation function.

Figure 4 TensorFlowPlayGround example: linearly divisible data

We click the start training button, and the final classification result is shown in figure 5:

Figure 5 TensorFlowplayground example: linearly divisible data

In the above example, we use a single-layer neural network to solve a linearly separable dichotomy problem, and then let's look at a linearly inseparable example, as shown in figure 6:

Figure 6 TensorFlow playground example: linearly inseparable data

In this example, we use a set of linearly inseparable data. In order to classify this set of data, we use a neural network with a hidden layer, which has four neurons and a nonlinear activation function ReLU. In order to classify linear inseparable data, we must introduce nonlinear factors, that is, nonlinear activation functions. In the next section, we will introduce some commonly used activation functions.

The final classification result is shown in figure 7.

Figure 7 TensorFlowplayground example: linearly inseparable data

Interested readers can try to use linear activation functions to see what the effect will be, and can also try other data to try to increase the number of layers of the network and the number of neurons to see how they affect the effect of the model.

III. Activation function

In order to solve the nonlinear classification or regression problem, our activation function must be nonlinear. In addition, we use the gradient-based method to train the model, so the activation function must also be continuously derivable.

1. Logistic function

The mathematical expression and function image of the Logistic function (also known as the sigmoid function) is shown in figure 8:

Fig. 8 Logistic function expression and function image

The Logistic function increases monotonously in the definition domain, and the range is, the closer to the two ends, the smoother the change of the function value is. Because the Logistic function is simple and easy to use, the previous neural network often uses it as the activation function, but because of some shortcomings of the Logistic function, the current neural network seldom uses it as the activation function. One of its disadvantages is that it is easy to saturate. from the function image, we can see that the Logistic function only has a very obvious gradient change near the coordinate origin, and the change of the function at both ends is very smooth, which will lead to the problem that the gradient disappears when we use the back propagation algorithm to update the parameters, and the problem will be more serious with the increase of the number of network layers.

2. Tanh function

The mathematical expression and function image of the Tanh function (hyperbolic tangent activation function) are shown in figure 9:

Fig. 9 Tanh function expression and function image

The Tanh function is very much like an enlarged version of the Logistic function, with a range of. In practical use, Tanh function is better than Logistic function, but Tanh function is also faced with the problem of saturation in most of its domain.

3. ReLu function

The ReLU function (also known as modified linear unit or rectified linear unit) is currently the most popular and most frequently used activation function. Its mathematical expression and function image are shown in figure 10:

Fig. 10 ReLU function expression and function image

The convergence rate of the ReLU activation function is much faster than that of the Logistic function and the Tanh function, and the value of the ReLU function on the left side of the axis is always zero, which makes the network sparse, thus reducing the dependency between parameters and alleviating the problem of overfitting, and the partial derivative of the ReLU function on the right side of the axis is a constant value 1, so there is no problem of gradient disappearance. However, the ReLU function also has some shortcomings, such as the forced sparse processing of ReLU can alleviate the problem of over-fitting, but it may also cause the problem of too much feature shielding, resulting in the model can not learn effective features.

In addition to the three activation functions described above, there are many other activation functions, including some improved versions of the ReLU activation function, but in actual use, the effect of the ReLU activation function is still better. At this stage, activation function is also a very active research direction, interested readers can query more information, including some references given in the GitHub project of this book.

Four. loss function and output unit

Loss function (LossFunction), also known as cost function (Cost Function), is an important part of neural network design. The loss function is used to represent the error between the predicted value of the model and the real class. The training of the deep learning model is the process of minimizing the loss function using the gradient-based method. The choice of loss function is also closely related to the choice of output unit.

1. Selection of loss function

1.1 mean square error loss function

Mean square error (MeanSquared Error,MSE) is a commonly used loss function. We use the distance between the predicted value and the actual value (that is, error) to measure the quality of the model. In order to ensure consistency, we usually use the square of the distance. In the deep learning algorithm, we use the gradient-based method to train the parameters, input one batch of data into the model at a time, and get the prediction results of this batch of data. then use the distance between the predicted results and the actual values to update the parameters of the network. The mean square error loss function takes the expectation of the error of this batch of data as the final error, and the formula of the mean square error is as follows:

Formula 3

The above formula is the actual value of the sample data and the predicted value of the model. In order to simplify the calculation, we usually multiply the mean square error as the final loss function:

Formula 4

1.2 Cross Entropy loss function

The cross entropy (Cross Entropy) loss function uses the cross entropy between the real class scale of the training data and the predicted value of the model as the loss function, which is more popular than the mean square error loss function. Suppose we use a quadratic function such as mean square error as the cost function, and when updating the parameters of the neural network, the error term will include the partial derivative of the activation function. In the previous introduction of activation functions, we have introduced that activation functions such as Logistic are easily saturated, which will make the update of parameters slow or even impossible. The derivative of the cross entropy loss function does not introduce the derivative of the activation function, so this problem can be well avoided. The definition of cross entropy is as follows:

Type 5

The above formula is the real distribution of the sample data and the distribution of the predicted results of the model. Taking the two-classification problem as an example, the form of cross-entropy loss function is as follows:

Type 6

In the above equation, it is the real value and the predicted value. For the multi-classification problem, we can calculate the cross entropy and sum the prediction results of each category.

two。 Selection of output unit

2.1 Linear element

Linear output units are often used in regression problems. When linear units are used in the output layer, after receiving the output from the upper layer, the output layer outputs a vector. One of the advantages of linear element is that it does not have the problem of saturation, so it is very suitable to use gradient-based optimization algorithm.

2.2 Sigmoid unit

Sigmoid output unit is often used in binary classification problem. On the basis of linear unit, Sigmoid unit adds a threshold to limit its effective probability and makes it constrained in the interval. The definition of linear output unit is:

Type 7

The above expression is the symbolic representation of the Sigmoid function, and its mathematical expression is introduced in Section 3.2.1.

2.3 Softmax Unit

The Softmax output unit is suitable for multi-classification problems and can be regarded as an extension of the Sigmoid function. For the output of Sigmoid output unit, we can think of its value as the probability that the model prediction sample is of a certain class, while Softmax needs to output multiple values, and the number of output values corresponds to the number of categories of the classification problem. The form of Softmax function is as follows:

Type 8

We use a simple diagram to explain the role of the Softmax function, as shown in figure 3-11. The output of the original output layer is, with the addition of the Softmax layer, the final output is:

Type 9

Type 10

Type 11

The value of the sum in the above formula can be regarded as the result predicted by the classifier, and the size of the value represents the probability that the classifier thinks that the sample belongs to the category.

Fig. 11 Softmax output unit

It should be noted that the input and output dimensions of the Softmax layer are the same, and if they are inconsistent, you can solve the problem by adding a fully connected layer in front of the Softmax layer.

Next, we will introduce the fourth part: an important basic knowledge of the neural network model-back propagation algorithm; and the fifth part: using TensorFlow to build a simple multi-layer neural network to achieve mnist handwritten digit recognition.

5. References

1. "Parallel Distributed processing" Rumelhart & McCelland. 1986

Original link: https://mp.weixin.qq.com/s/hYxM9VAW_9j6jOEWycY8Rg

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.