How to calculate the memory occupancy of models and intermediate variables in web development 07/11 Update SLTechnology News&Howtos

How to calculate the memory occupancy of models and intermediate variables in web development

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces how to calculate the model and the memory occupation of intermediate variables in web development, which is very detailed and has a certain reference value. Friends who are interested must read it!

Preface torch.FatalError: cuda runtime error (2): out of memory at / opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu:58

This must be one of the last mistakes all alchemists want to see.

OUT OF MEMORY, obviously the display memory can not hold so many of your model weights and intermediate variables, and then the program crashed. How to do, in fact, there are many ways, timely emptying intermediate variables, optimizing code, reducing batch, and so on, can reduce the risk of video memory overflow.

But what this article is going to talk about is the basis of all the optimization operations above, how to calculate the video memory we use. Learn how to calculate the size of the model we design and the size of the display memory occupied by the intermediate variables, we must know this, and we will be comfortable with our own display memory.

How to calculate

First of all, we should know the basic data volume information:

1 G = 1000 MB

1 M = 1000 KB

1 K = 1000 Byte

1 B = 8 bit

Well, I'm sure someone will ask why it's 1000 instead of 1024. There's not much discussion here, and we can only say that both statements are correct, but the application scenarios are slightly different. Here the calculation is carried out according to the above standard.

Then let's talk about the amount of space we normally use, taking Pytorch's official data format as an example (all deep learning framework data formats follow the same standard):

We only need to look at the information on the left. In normal training, we often use these two types:

Float32 single precision floating point type

Int32 integer

Generally speaking, the integer variable of a 8-bit takes up 1B, that is, 8bit. The 32-bit float accounts for 4B, that is, 32bit. However, double-precision floating-point double and long integer long are generally not used in normal training.

Ps: consumer-grade graphics cards optimize single-precision computing, while server-level graphics cards optimize double-precision computing.

In other words, suppose there is a RGB three-channel true color picture with a length and width of 500x500 and a single-precision floating-point data type, then the size of the video memory occupied by this picture is 500x500x3x4B = 3m.

On the other hand, the space occupied by a FloatTensor of (256 ~ 3100100)-(N ~ C ~ ~ H ~ W) is 256 x 3 x 100 x 100 x 4B = 31m.

It's not much, isn't it? it doesn't matter. The fun has just begun.

Where's the video memory?

It seems that a picture (3x256x256) and convolution layer (256x100x100) take up a small amount of space, so why do we still use more video memory? the reason is very simple: what takes up more space in video memory is not our input image, but the intermediate variables in the neural network and the huge intermediate parameters generated when using the optimizer algorithm.

First of all, let's briefly calculate the video memory that the Vgg16 net needs to occupy:

Usually, the video memory occupied by a model is two parts:

Parameters of the model itself (params)

Intermediate variables generated by model calculation (memory)

The picture comes from cs231n, which is a typical sequential-net, which is very smooth from top to bottom. We can see that what we entered is a three-channel image of 224x224x3. We can see that an image only takes up 150x4k, but it is 150k above. This is because the default data format for calculation here is 8-bit rather than 32-bit, so the final result is multiplied by a 4.

We can see that the threshold value on the left represents the space occupied by the input image, the picture, and the resulting intermediate convolution layer. As we all know, these various deep convolution layers are the "thinking" process of deep neural networks:

The picture changed from 3 channels to 64-> 128-- > 256-- > 512. These are convolution layers, and our video memory is mainly occupied by them.

And the params on the right above, these are the weights of the neural network, you can see that the first layer convolution is 3x3, while the input image channel is 3, and the output channel is 64, so it is obvious that the space occupied by the first convolution layer weight is (3x3 x3) x 64.

Another thing to note is that the intermediate variable doubles when backward!

For example, here is a calculation diagram, enter x, pass through the intermediate result z, and then get the final variable L:

We need to save the median value when we are in backward. The output is L, and then enter x. When we ask for the gradient of L to x in backward, we need to calculate the z between the chain L and x:

Of course, the intermediate value of dz/dx should be reserved for calculation, so roughly speaking, the intermediate variable in backward takes up twice as much as forward!

Optimizer and momentum

Note that the optimizer will also take up our video memory!

Why, look at this formula:

The above formula is the general formula of the typical SGD random descent method. When the weight W is updated, it will produce a saved intermediate variable, that is, when optimizing, the amount of explicit memory occupied by the params parameters in the model will double.

Of course, this is just the SGD optimizer, and other complex optimizers will take up more memory if they need more intermediate variables in their calculations.

Which layers in the model will occupy the video memory?

The layer with parameters will occupy the layer of video memory. Our general convolution layer will take up video memory, but the activation layer Relu that we often use will not be occupied without parameters.

The layers that occupy video memory are generally:

Convolution layer, usually conv2d

Full connection layer, that is, Linear layer

BatchNorm layer

Embedding layer

On the other hand, the ones that do not occupy video memory are:

The activation layer Relu mentioned just now.

Pooled layer

Dropout layer

Specific calculation method:

Conv2d (Cin, Cout, K): number of parameters: Cin × Cout × K × K

Linear (M-> N): number of parameters: M × N

BatchNorm (N): number of parameters: 2N

Embedding (NMagol W): number of parameters: n × W

Extra video memory

To sum up, in our overall training, the occupation of video memory can be divided into the following categories:

Parameters in the model (convolution layer or other layer with parameters)

The intermediate parameters generated by the model during calculation (that is, the input and output of each layer of the input image during calculation)

Additional intermediate parameters generated during backward

Additional model parameters generated by the optimizer during optimization

But in fact, the reason why we take up more video memory space than we theoretically calculated is probably due to some additional overhead of the deep learning framework, but through the above formula, the theoretical calculation of the video memory will not be much different from the actual memory.

These are all the contents of the article "how to calculate the size of video memory occupied by models and intermediate variables in web development". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.