Deep Learning GPU purchase Guide: which graphics card is worthy of my alchemy furnace 07/02 Update SLTechnology News&Howtos

Deep Learning GPU purchase Guide: which graphics card is worthy of my alchemy furnace

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Recently, Dr. offer, who got Stanford, UCL, CMU, NYU, and Tim Dettmers, a well-known evaluation blogger who is now a PhD student at the University of Washington, launched the GPU in-depth assessment in the field of deep learning on his own website. Who is the king of performance and cost performance?

It is well known that when dealing with deep learning and neural network tasks, it is best to use GPU rather than CPU, because in terms of neural networks, even a relatively low-end GPU will outperform CPU.

Deep learning is an area with a large demand for computing. To a certain extent, the choice of GPU will fundamentally determine the experience of deep learning.

But the problem is, how to choose the right GPU is also a headache and brain-burning matter.

How to avoid stepping on mines and how to make cost-effective choices?

Dr. offer of Stanford, UCL, CMU, NYU, UW, and Tim Dettmers, a well-known evaluation blogger who is now a doctoral student at the University of Washington, wrote a long article of ten thousand words based on his own experience on what kind of GPU is needed in the field of deep learning, and finally gave a recommended GPU in the field of DL.

Tim Dettmers's research direction is deep learning in representation learning and hardware optimization, and the website he created himself is also well-known in the fields of deep learning and computer hardware.

All the GPU recommended by Tim Dettmers in this article come from N factory, and he obviously believes that AMD is not worthy of a name for machine learning.

The original link editor is also posted below.

Https://timdettmers.com/2023/01/16/which-gpu-for-deep-learning/#GPU_Deep_Learning_Performance_per_Dollar

Advantages and disadvantages of RTX40 and 30 series compared with Nvidia Turing architecture RTX 20 series, the new Nvidia ampere architecture RTX 30 series has more advantages, such as sparse network training and reasoning. Other features, such as new data types, should be seen more as ease-of-use features because they provide the same performance improvements as the Turing architecture, but do not require any additional programming requirements.

The Ada RTX 40 series has even more advances, such as the Tensor memory Accelerator (TMA) and 8-bit floating-point arithmetic (FP8) described above. Compared to the RTX 30, the RTX 40 series has similar power and temperature problems. The problem of melting the power connector cable on the RTX 40 can be easily avoided by properly connecting the power cable.

Sparse network training amperes allow automatic sparse matrix multiplication of fine-grained structures at dense speeds. How do you do this? Take a weight matrix as an example, cut it into fragments of four elements. Now imagine that two of these four elements are zero. Figure 1 shows what this looks like.

Structure supported by sparse matrix multiplication in 1:Ampere architecture GPU when you multiply this sparse weight matrix by some dense inputs, Ampere's sparse matrix tensor core function automatically compresses the sparse matrix into a dense representation, half the size shown in figure 2.

After compression, the densely compressed matrix tiles are fed into the tensor core, and the matrix multiplication calculated by the tensor core is twice the usual size. This effectively produces twice the speed because the bandwidth is required to be halved during matrix multiplication of shared memory.

Figure 2: sparse matrices are compressed into dense representations before matrix multiplication. I devoted myself to sparse network training in my research, and I also wrote a blog post about sparse training. One criticism of my work is: "you reduced the FLOPS required for the network, but did not produce a speed increase, because GPU can not do fast sparse matrix multiplication."

With the increase of Tensor Cores's sparse matrix multiplication capabilities, my algorithm or other sparse training algorithms now actually provide up to twice the speed during training.

The sparse training algorithm developed has three stages: (1) determine the importance of each layer. (2) remove the least important weights. (3) increase the new weight proportional to the importance of each layer.

Although this feature is still in the experimental stage and training sparse networks is not common, having this feature on your GPU means you are ready for the future of sparse training.

Low-precision computing in my work, I have previously shown that new data types can improve stability during low-precision back propagation.

Figure 4: low-precision deep learning of 8-bit data types. Deep learning training benefits from highly specialized data types. At present, if you want to use 16-bit floating point numbers (FP16) for stable backpropagation, the biggest problem is that ordinary FP16 data types only support numbers in the range of [- 65 FP16 504, 65504]. If your gradient slides across this range, your gradient will explode to a NaN value.

To prevent this from happening in FP16 training, we usually do loss scaling, that is, multiplying the loss by a small number before back propagation to prevent this gradient explosion.

The Brain Float 16 format (BF16) uses more bits for the index, so the possible range of numbers is the same as that of FP32. BF16 has lower precision, that is, significant numbers, but gradient accuracy is not that important for learning.

So what BF16 does is, you no longer need to do any loss scaling, and you don't have to worry about the gradient exploding quickly. Therefore, we should see that the stability of the training is improved by using the BF16 format, because the accuracy is slightly lost.

What this means to you. With BF16 precision, training may be more stable than using FP16 precision, while providing the same speed improvement. With TF32 precision, you can get stability close to FP32 while providing a speed increase close to FP16.

The good news is that to use these data types, you just need to use TF32 instead of FP32 and BF16 instead of FP16-- without changing the code.

In general, however, these new data types can be seen as lazy data types, because you can get all the benefits of the old data types through some extra programming effort (appropriate loss of scaling, initialization, normalization, using Apex).

Therefore, these data types do not provide speed, but improve the ease of use of low precision in training.

Fan Design and GPU temperature although the new fan design of the RTX 30 series performs very well in cooling the GPU, there may be more problems with different fan designs of the non-original GPU.

If your GPU has a fever of more than 80C, it will throttle itself and slow down its calculation speed / power. The solution to this problem is to use the PCIe extender to create space between GPU.

Dispersing GPU with a PCIe expander is very effective for heat dissipation, and other Ph.D. students at the University of Washington and I have used this setting with great success. It doesn't look beautiful, but it can keep your GPU cool!

The following system has been running for 4 years and has no problem at all. If you don't have enough space to install all the GPU in the PCIe slot, you can do the same.

Figure 5: 4 graphics card system with PCIE expansion port, looks like a mess, but has high heat dissipation efficiency. Elegantly solve the power limit problem it is possible to set a power limit on your GPU. Therefore, you will be able to programmatically set the power limit of the RTX 3090 to 300W instead of its standard 350W. Of the four GPU systems, this is equivalent to 200W savings, which may be just enough to build a 4x RTX 3090 system with 1600W PSU.

This also helps to keep the GPU cool. Therefore, setting the power limit can solve the two main problems of 4x RTX 3080 or 4x RTX 3090 settings, cooling and power. For a 4x setting, you still need the GPU of an efficient cooling fan, but this solves the power problem.

Figure 6: lowering the power limit has a slight cooling effect. If you lower the power limit of the RTX 2080 Ti by 50-60W, the temperature drops slightly and the fan runs quieter, you may ask, "won't this slow down the GPU?" "Yes, it will, but the question is how much.

I benchmark the 4x RTX 2080 Ti system shown in figure 5 under different power constraints. I benchmark the time of the500 small batches of BERT Large during reasoning (excluding the softmax layer). Choosing BERT Large reasoning has the greatest pressure on GPU.

Figure 7: on the RTX 2080 Ti, the speed drop measured under a given power limit can be seen that setting the power limit does not seriously affect performance. When the power is limited to 50W, the performance is reduced by only 7%.

There is a misunderstanding that the RTX 4090 connector caught fire, that the RTX 4090 power cord caught fire because it was bent too much. In fact, only 0.1% of users are for this reason, and the main problem is that the cable is not plugged in correctly.

Therefore, it is completely safe to use RTX 4090 if you follow the installation instructions below.

1. If you use an old cable or old GPU, make sure the contacts are free of debris / dust.

two。 Use the power connector and plug it into the socket until you hear a click-this is the most important part.

3. Test for suitability by twisting the power cord from left to right. The cable should not be moved.

4. Visually check the contact with the socket, there is no gap between the cable and the socket.

8-bit floating point support in H100 and RTX40 support for 8-bit floating point (FP8) is a huge advantage of the RTX40 series and H100 GPU.

With 8-bit input, which allows you to load matrix multiplication data at twice the speed, you can store twice as many matrix elements in the cache, while in Ada and Hopper architectures, the cache is very large, and now with the FP8 tensor core, you can get 0.66 PFLOPS of computation for RTX 4090.

This is higher than the total computing power of the world's fastest supercomputer in 2007. RTX 4090, which is 4 times larger than FP8, is comparable to the world's fastest supercomputer in 2010.

As you can see, the best 8-bit baseline does not provide good zero performance. The method I developed, LLM.int8 (), can do Int8 matrix multiplication, and the result is the same as the 16-bit baseline.

But Int8 is already supported by the RTX 30 / A100 / Ampere generation of GPU, so why is FP8 a big upgrade in RTX 40? FP8 data types are much more stable than Int8 data types and are easy to use in layer specifications or nonlinear functions, which is difficult to do in integer data types.

This will make its use in training and reasoning very simple and clear. I think this will make FP8 training and reasoning relatively common in a few months.

Below you can see one of the main results related to Float vs Integer data types in this paper. We can see that bit by bit, the FP4 data type retains more information than the Int4 data type, thus improving the average LLM zero accuracy of the four tasks.

GPU Deep Learning performance ranking first take a look at the original performance ranking of GPU in the previous chart to see who is the best.

We can see that there is a huge gap between the 8-bit performance of the H100 GPU and the old cards optimized for 16-bit performance.

The figure above shows the raw relative performance of GPU. For example, for 8-bit reasoning, the performance of RTX 4090 is about 0.33 times that of H100 SMX.

In other words, the 8-bit reasoning speed of H100 SMX is three times faster than that of RTX 4090.

For this data, he did not model 8-bit calculations for the old GPU.

Because 8-bit reasoning and training are more efficient on Ada / Hopper GPU, while the Tensor memory Accelerator (TMA) saves a lot of registers, these registers are very accurate in 8-bit matrix multiplication.

Ada / Hopper is also supported by FP8, which makes 8-bit training more effective. On Hopper / Ada, the performance of 8-bit training is likely to be 3-4 times that of 16-bit training.

For the old GPU, the Int8 reasoning performance of the old GPU is close to the 16-bit reasoning performance.

The question is how much math you can buy per dollar. I can't afford the performance of GPU.

For a partner with an insufficient budget, the next chart is his per-dollar performance ranking (Performance per Dollar) based on the price and performance statistics of each GPU, which reflects the performance-to-price ratio of GPU.

Choosing a GPU that completes deep learning tasks and meets the budget can be divided into the following steps:

First determine how much GPU memory you need (at least 12GB for image generation and at least 24GB for processing transformers)

For the choice of 8-bit or 16-bit (8-bit or 16-bit), it is recommended to go up to 16-bit, and 8-bit will still have difficulty in dealing with complex coding tasks.

Based on the metrics in the figure above, find the GPU with the highest relative performance / cost.

We can see that RTX4070Ti is the most cost-effective for 8-bit and 16-bit reasoning, while RTX3080 is the most cost-effective for 16-bit training.

Although these GPU are the most cost-effective, their memory is also a deficiency, and the memory of 10GB and 12GB may not meet all the requirements.

But it may be an ideal GPU for beginners who have just entered the pit for deep learning.

Some of these GPU are very suitable for Kaggle competitions, get good results in Kaggle competitions, and the working method is more important than model size, so many smaller GPU are very suitable.

Kaggle claims to be the world's largest platform for data scientists, full of experts and friendly to Mengxin.

The best GPU for academic research and server operation seems to be the A6000 Ada GPU.

At the same time, H100 SXM has high performance-to-price ratio, large memory and strong performance.

From personal experience, if I want to build a small cluster for a company / academic lab, I recommend 66-80% A6000 GPU and 20-33% H100 SXM GPU.

Having said so much about the comprehensive recommendation, it is finally time for GPU Amway.

Tim Dettmers has specially made a "GPU purchase flow chart". If the budget is sufficient, you can have a higher configuration. If the budget is insufficient, please refer to the cost-effective choice.

First of all, I want to emphasize one point here: no matter which GPU you choose, first make sure that its memory meets your needs. To do this, you need to ask yourself a few questions:

What am I going to do with GPU? Is it used to participate in Kaggle competitions, learn in-depth learning, do CV / NLP research or play small projects?

If you have a sufficient budget, you can review the above benchmark and choose the best GPU for you.

You can also estimate the required GPU memory by running your problem in vast.ai or Lambda Cloud for a period of time to see if it meets your needs.

If you only need an occasional GPU (which lasts for several hours every few days) and you don't need to download and process large datasets, then vast.ai or Lambda Cloud works well.

However, if you use GPU every day for a month and use it frequently (12 hours a day), cloud GPU is usually not a good choice.

Reference:

Https://timdettmers.com/2023/01/16/which-gpu-for-deep-learning/#more-6

Https://timdettmers.com/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era), editor: Joey David

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.