What is the principle of Python multilayer perceptron? 04/25 Update SLTechnology News&Howtos

What is the principle of Python multilayer perceptron?

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "What is the principle of Python multilayer perceptron". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "What is the principle of Python multilayer perceptron"!

directory

hidden layer

From linear to nonlinear

activation function

ReLU function

sigmoid function

tanh function

hidden layer

We described the affine transformation earlier, which is a linear transformation with an offset term. First, recall the model structure for softmax regression shown earlier in the figure below. This model maps our input directly to the output through a single affine transformation, followed by a softmax operation. If our labels are indeed related to our input data by affine transformation, then this method is sufficient. However, linearity in affine transformations is a strong assumption.

Our data may have a representation that takes into account the correlated interactions between our features. A linear model based on this representation might be appropriate, but we do not know how to calculate such a representation manually. For deep neural networks, we use observation data to jointly learn the hidden layer representation and the linear predictor applied to that representation.

We can overcome the limitations of linear models by adding one or more hidden layers to the network, allowing it to handle more general types of functional relationships. The easiest way to do this is to stack many fully connected layers together. Each layer outputs to the layer above until the final output is generated. We can think of the first L−1 layer as a representation and the last layer as a linear predictor. This architecture is commonly referred to as multilayer perceptron, often abbreviated as MLP. In the following, we describe the multilayer perceptron graphically.

This multilayer perceptron has 4 inputs, 3 outputs, and its hidden layer contains 5 hidden units. The input layer does not involve any computation, so using this network to produce output requires only the computation of the hidden layer and the output layer; therefore, the number of layers of this multilayer perceptron is 2. Note that this layer is fully connected. Every input affects every neuron in the hidden layer, and every neuron in the hidden layer affects every neuron in the output layer.

However, the parameter overhead of multilayer perceptrons with fully connected layers can be prohibitively high, prompting tradeoffs between parameter savings and model effectiveness even without changing input and output sizes.

From linear to nonlinear

Note that after adding hidden layers, the model now needs to track and update additional parameters.

But what do we get out of it? Here we are surprised to find that in the model defined above we have no benefit. The hidden unit above is given by the affine function of the input, while the output (before the softmax operation) is simply the affine function of the hidden unit. The affine function of an affine function is itself an affine function. But our previous linear models have been able to represent any affine function.

Since each row in X corresponds to one sample in the mini-batch, we define the nonlinear function σ to also act on its inputs in a row-wise manner, i.e., one sample is calculated at a time, in consideration of notation convention. We used the softmax notation in the same way as before to indicate line-by-line operations. But in this section, the activation functions we apply to hidden layers are usually not only row-wise, but element-wise as well. This means that after calculating the linear part of each layer, we can calculate each activation value without looking at the values taken by the other hidden cells. This is true for most activation functions.

activation function

The activation function determines whether a neuron should be activated by calculating a weighted sum and adding a bias. They are differentiable operations that convert input signals into outputs. Most activation functions are nonlinear. Since activation functions are the foundation of deep learning, the following is a brief description of some common activation functions.

import torch from d2l import torch as d2lReLU function

The most popular option is the linear rectifier unit because it is simple to implement while performing well in various prediction tasks. ReLU provides a very simple nonlinear transformation. Given an element x, the ReLU function is defined as the maximum of that element and 0:

In layman's terms, the ReLU function keeps only positive elements and discards all negative elements by setting the corresponding activation value to 0. To get a feel for it, we can plot the function. As shown in the figure below, the activation function is piecewise linear.

x = torch.arange(-8, 8, 0.1, requires_grad=True)y = torch.relu(x)d2l.plot(x.detach(), y.detach(), 'x', 'relu(x)', figsize=(5, 2.5))

Note that the ReLU function is not differentiable when the input value is exactly zero. At this point, we default to using the derivative on the left, i.e., zero when the input is zero. We can ignore this because the input may never be 0. To use the old adage,"If subtle boundary conditions matter, we're probably doing math rather than engineering," which applies here. Next we plot the derivative of the ReLU function.

y.backward(torch.ones_ilke(x), retain_graph=True)d2l.plot(x.detach(), x.grad, 'x', 'grad of relu', figsize=(5, 2.5))

The reason for using ReLU is that it performs exceptionally well on derivatives, either letting parameters disappear or letting them pass. This makes optimization perform better, and ReLU alleviates the vanishing gradient problem that plagued previous neural networks.

Note that the ReLU function has many variations, including parameterized ReLU functions. This variant adds a linear term to ReLU, so even if the parameter is negative, some information can still be passed through:

sigmoid function

In the earliest neural networks, scientists were interested in modeling biological neurons that were "firing" or "not firing." So pioneers in the field, such as McCulloch and Pitts, inventors of artificial neurons, have focused on threshold units since their inception. The threshold unit takes a value of 0 when its input is below a certain threshold and 1 when the input exceeds the threshold.

As attention gradually shifts to learning gradients, the sigmoid function is a natural choice because it is a smooth, differentiable threshold-cell approximation. Sigmoid is still widely used as an activation function on the output unit when we want to treat the output as a probability of a binary classification problem (sigmoid can be treated as a special case of softmax). However, sigmoid has been used less in hidden layers, and it has been replaced most of the time by ReLU, which is simpler and easier to train.

tanh function

Similar to sigmoid functions, tanh (hyperbolic tangent) functions compress their inputs into the interval (-1,1). The tanh function is formulated as follows:

Next we plot the tanh function. Note that the tanh function approaches a linear transformation when the input is near zero. The shape of the function is similar to that of the sigmoid function, except that the tanh function is centrosymmetric about the origin of the coordinate system.

At this point, I believe that everyone has a deeper understanding of "what is the principle of Python multilayer perceptron", so let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.