Decrypt convolution neural network! 07/16 Update SLTechnology News&Howtos

Decrypt convolution neural network!

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Brief history

Convolution neural network

In the past decade, the progress made in the field of computer vision is unprecedented. Machines can now recognize images or a frame in a video, with an accuracy (98%) higher than that of humans (97%). The reason why there is such a great breakthrough is due to the study of the human brain.

At the time, neuroscientists were experimenting with cats and found that similar parts of the image caused similar parts of the cat's brain to become active. In other words, alpha in the brain is activated when a cat looks at the circle. When it looks at the square, the beta area of the brain is activated. Their findings conclude that the animal's brain contains a neural area that responds to specific features of the image, that is, they perceive the environment through the hierarchical structure of neurons in the brain. Each image goes through some kind of feature extractor before going deep into the brain.

Inspired by brain function, mathematicians gathered to create a system to simulate how different groups of neurons perceive different images and communicate with each other to form larger images.

Feature extractor

They materialize the idea that the neuron group is activated into the mathematical concept of a multidimensional matrix representing a detector of a specific feature set, and provide a specific input. Among them, the detector that represents the specific feature is also called filter or core. Each such filter will be used to detect specific things in the image, such as the filter used to detect edges. These learned features will be passed through another set of filters designed to detect higher-level features, such as eyes, nose, and so on.

Edge detection by convolution image with Laplacian filter

Mathematically, we will convolution a given input image (represented by a pixel intensity matrix) and a filter to generate the so-called feature graph. This feature mapping will be used as input to another layer of filters.

Why convolution?

Convolution is a process in which a network tries to mark an input signal by referring to what has been learned in the past. If the input signal is similar to the image of the cat seen before, the "cat" as the reference signal will be convoluted or mixed with the input signal. The output signal is then passed to the next layer (here, the input signal is a three-dimensional representation of the input image in terms of RGB pixel strength, and "cat" as a reference signal is the core of cat identification).

Convolution operation of image and filter

A good property of convolution is that it does not change easily. This means that each convolution filter represents a specific feature set. Such as eyes, ears and so on. The reference result of the learning composition of the feature set by the CNN algorithm, such as cats. It needs to be added that the strength of the output signal does not depend on the location of the feature, but only on the existence of the feature. As a result, the cat can sit in different positions, while the CNN algorithm can still recognize it.

Pond

By tracking the trajectory formed by the biological function of the brain, we can build the mathematical instruments needed for feature extraction. However, after knowing the complexity of the geometry to be tracked and the total number of levels and features that need to be analyzed, we realize that we do not have enough memory to hold all the data. Even the computing power required to handle all this increases exponentially with the number of features. Soon, we had to turn to a technology called "pool" to solve our current predicament. Its core idea is simple.

If an area contains a very representative feature, we can avoid searching for other features in that area.

Demonstration of the maximum pool

This not only saves unnecessary memory and computing power, but also helps to eliminate noise in the image.

Fully connected layer

We've done well so far, but what's the use if the network ends up detecting only one set of features in the image? We need a way to enable the network to classify a given image. This is an application of traditional neural network. In particular, we can have a fully connected layer that maps the features detected by the previous layer to the number of classification tags we have. The last layer assigns probabilities to each class in the output category. Based on these output probabilities, we can finally classify the images.

Fully connected layer

The final framework

The only thing left is to sort out and merge all these learned concepts into what we call a convolution neural network, also known as CNN. In essence, CNN consists of a series of convolution layers and pooling layers, which are selectively combined to generate a feature graph, which is then fed back to the fully connected layer to generate the probability of the class. If we return the wrong output, we will be able to train better to produce accurate results.

Now, from a functional perspective, let's take a closer look at how CNN works.

Convolution neural network

Convolution layer

Convolution layer is the main component of CNN. Each such layer consists of a set of independent filters, each of which looks for a different feature set in a given image.

Convolution operation

Mathematically, we take a fixed-size filter, slide on the complete image, and do a dot product between the filter and the input image block. The result of this dot product will be a scalar, which will enter the final feature mapping. Then we slide the filter to the right, do the same thing, and add the results to the feature graph. After convoluting the complete image with the filter, we finally get a feature map representing different feature sets, which will be used as the input of the next layer.

Step size

The amount of movement of the filter is the step. In the picture above, our sliding filter coefficient is 1. This may not be what we always need. The size of the step size used is highly related to the adjacent pixels (especially at the bottom). Therefore, it makes sense to reduce the size of the output by using the appropriate step size. However, a large step size may lead to a high loss of information. Therefore, we must be careful when choosing the step size.

Step size is 2

Fill

Fill a single layer

One unpopular reason for the step size is that the size of the feature graph decreases as we continue to convolution. This may not be what we want. Because contraction also means loss of information. To understand why this happens, note the difference in the number of filters applied to middle and corner cells. Obviously, the information from the middle cell is more important than that from the edge cell. To preserve the useful information in the early layers, we can surround a given matrix with layer 0.

Parameter sharing

Why CNN already has a good deep neural network. Interestingly, if we use depth neural network for image classification, the number of parameters in each layer will be thousands of times that of CNN.

Share parameters in CNN

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.