What is Inception and GoogleNet structure 07/03 Update SLTechnology News&Howtos

What is Inception and GoogleNet structure

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What is Inception and GoogleNet structure, for this problem, this article details the corresponding analysis and solution, hoping to help more small partners who want to solve this problem find a simpler and easier way.

GoogleNet is the champion model of ILSVRC in 2014, GoogleNet has made a bolder attempt on the web, rather than inheriting everything from lenet and alexnet like vgg. Although GoogleNet has 22 layers, the number of parameters is only 1/12 of AlexNet.

The GoogleNet paper points out that the safest way to get a high-quality model is to increase the depth of the model, or its width, but in general, deeper and wider networks have the following problems:

Too many parameters, easy to overfit, if the training data is limited, this problem is more prominent; the larger the network, the greater the computational complexity, difficult to apply; the deeper the network, easy to appear gradient disappearance problem

In short, larger networks tend to overfit and increase the computational effort.

GoogleNet solution

Converting fully connected layers and even general convolutions into sparse connections

In order to keep the sparsity of neural network structure and make full use of the high computational performance of dense matrix, GoogleNet proposes a modular structure called Inception to achieve this purpose. According to a large number of literatures, clustering sparse matrices into dense submatrices can improve computational performance. (I don't understand this piece very well. It is Baidu's knowledge, but the key is that GoogleNet has proposed the modular structure of Inception. Today, in 2020, this module still has a huge role.)

1 Inception

It's a king within a king structure, oh no, a Network within a network structure. Even the original node is also a network, using Inception, the width and depth of this network structure can be expanded. This leads to an improvement in performance. Explanation:

Convolution kernels of different sizes mean receptive fields of different sizes, and final stitching means fusion of features of different scales. The reason why convolution kernel adopts 1, 3 and 5 is mainly for convenience. After setting convolution stride=1, you only need to set pad=0, 1 and 2 respectively, so that after convolution, you can get features of the same dimension, and then directly splice them together. In the article, it is said that many places show that pooling is very effective, so it is also introduced here. The further the network goes, the more abstract the features are, and the larger the receptive field designed for each feature is, so the proportion of 3x3 and 5x5 convolutions increases with the increase of the number of layers (see later). Using a 5x5 convolution kernel still brings a huge amount of computation. For this reason, the article draws lessons from NIN2 and uses a 1x1 convolution kernel to reduce the dimensionality. How exactly is it possible to reduce dimensionality using a 1x1 convolution kernel? Suppose we have a 3 x 3 x 10 image feature with both dimensions 1, and then channel, which I like to call thickness 10, and we want to reduce the dimensionality to 5. First we need to find a 1x1x10 convolution kernel, and then the convolution result should be a 3x3x1 feature map, we can find 5 convolution kernels of this, convolution five times, the resulting image will make a 3x3x5 feature map, of course, the same 1x1 convolution kernel can also be used to improve the dimension.

I found a beautiful picture of GoogleNet from the Internet:

There was another one:

The GoogLeNet network structure breakdown is analyzed as follows:

0, input

The original input image is 224x224x3, and both are preprocessed with zero-averaging (subtracting the mean from each pixel of the image).

1. The first layer (convolution layer)

Using a 7x7 convolution kernel (sliding step 2, padding 3), 64 channels, output 112x112x64, ReLU operation after convolution

After 3 x 3 max pooling (step size 2), the output is ((112 - 3+1)/2)+1 = 56, i.e. 56 x 56 x 64, and then ReLU operation

2. The second layer (convolution layer)

Using a 3x3 convolution kernel (sliding step of 1, padding of 1), 192 channels, output of 56x56x192, ReLU operation after convolution

After 3 x 3 max pooling (step size 2), the output is ((56 - 3+1)/2)+1 = 28, i.e. 28 x 28 x 192, and then ReLU operation

3a, Level 3 (Inception 3a)

It is divided into four branches and processed by convolution kernels of different scales.

(1) 64 1x1 convolution kernels, then RuLU, output 28x28x64

(2) 96 1x1 convolution kernels, as dimensionality reduction before 3x3 convolution kernels, become 28x28x96, then ReLU calculation, then 128 3x3 convolutions (padding is 1), output 28x28x128

(3) 16 1x1 convolution kernels, as dimension reduction before 5x5 convolution kernels, become 28x28x16, after ReLU calculation, 32 5x5 convolutions (padding is 2) are performed, and the output is 28x28x32.

(4) pool layer, using a 3x3 kernel (padding is 1), outputs 28x28x192, then performs 32 1x1 convolutions, outputs 28x28x32.

Connect the four results and output the third dimension of the results in parallel, i.e. 64+128+32+32=256, and finally output 28x28x256.

3b. Level 3 (Inception 3b)

(1) 128 1x1 convolution kernels, then RuLU, output 28x28x128

(2) 128 1x1 convolution kernels, as dimensionality reduction before 3x3 convolution kernels, become 28x28x128, ReLU, and then 192 3x3 convolutions (padding is 1), output 28x28x192

(3) 32 1x1 convolution kernels, as dimension reduction before 5x5 convolution kernels, become 28x28x32, after ReLU calculation, 96 5x5 convolutions (padding is 2) are performed, and the output is 28x28x96.

(4) pool layer, using a 3x3 kernel (padding is 1), outputs 28x28x256, then 64 1x1 convolutions, outputs 28x28x64.

Connect the four results, parallel the third dimension of the output results of these four parts, i.e. 128+192+96+64=480, and the final output is 28x28x480.

The fourth layer (4a, 4b,4c,4d, 4e), the fifth layer (5a, 5b)... are similar to 3a, 3b, and will not be repeated here.

Come here! We should be able to see one point from the beautiful picture. GoogleNet's structure is composed of 3+3+3 total nine inception modules. Each Inception has two layers, plus the first three convolutional layers and the FC layer before output, for a total of 22 layers! And then there's an output after every three layers of inception, and there are three outputs in this network. What's going on?

This is an auxiliary classifier, and GoogleNet uses auxiliary classifiers. Because the classification of intermediate nodes may be good in addition to the output of the last layer, GoogleNet takes one of the intermediate layers as the output and adds it to the final classification result with a small weight. In fact, it is a kind of model fusion in disguised form. At the same time, it adds gradient signals to the network in reverse propagation, which also plays a certain regularization role.

Finally, I would like to ask two more questions:

If I use 1x1 for feature compression, does it affect the final result? Answer: No, the author's explanation is that if you want to change the feature thickness from 128 to 256, you can directly use 3 x 3 for feature extraction. If you compress 1x1 to 64, and then expand the 64channel feature to 256 with 3x3, it has no effect on the subsequent accuracy and reduces the number of operations.

Why is inception a convolution reaggregation at multiple scales? Answer: Intuitively, simultaneous convolution on multiple scales can extract features at different scales. This also means that the final classification judgment is more accurate. In addition, this is the principle that sparse matrices can be decomposed into dense matrices to speed up convergence. (The second benefit is not too understood.) Personal understanding is that when extracting features at multiple scales and aggregating them, you can find features with strong correlation, so as to avoid wasting computational power on features with weak correlation.

About what is Inception and GoogleNet structure questions to share here, I hope the above content can be of some help to everyone, if you still have a lot of doubts not solved, you can pay attention to the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.