Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the biggest reason why GPU cannot completely replace CPU?

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

The editor today takes you to understand what the biggest reason why GPU can not completely replace CPU refers to. The knowledge points in the article are introduced in great detail. Friends who feel helpful can browse the content of the article together with the editor, hoping to help more friends who want to solve this problem to find the answer to the problem. Let's go deep into the knowledge of "what is the biggest reason why GPU cannot completely replace CPU?"

Recently write CUDA to write about mental illness. Take NVIDIA GPU as an example.

Compared with CPU, GPU has several characteristics.

Computing resources are very rich.

The area occupied by the control parts is relatively small.

The bandwidth of memory is wide. At present, GDDR5 is used for unique display, and the bit width is also high. The bandwidth of mainstream unique display memory is ten times that of CPU (200GB/s vs. 20GB).

Memory latency is high. Compared with CPU using multi-level cache to mask latency, GPU uses multithreading to mask latency.

Register resources are very rich. There are 64k 32bit registers and 255available for a single thread.

Therefore, GPU is only suitable for tasks with few branches, large amount of data and insensitive latency.

First look at a SM (stream multiprocessor) structure of GTX 1080 (Compute capability 6. 1)

& lt;img src=" https://pic1.zhimg.com/v2-aa0d4d996dcc374c1a0a4cb25a925630_b.jpg" data-size="small& Amp;quot; data-rawwidth="698" data-rawheight="1220" data-default-watermark-src=& Amp;quot; https://pic2.zhimg.com/v2-00878f4438c2ac6d6c2974e535f07919_b.jpg" class="origin_image zh-lightbox-thumb" width="698& Amp;quot; data-original=" https://pic1.zhimg.com/v2-aa0d4d996dcc374c1a0a4cb25a925630_r.jpg">CC 6.0SM from NVDIA

As you can see, a SM contains 4 Warp, and each Warp contains 32 CUDA Core [1]. So, is an Warp equivalent to the 32-core of CPU?

1. GPU is not suitable for handling a large number of branches.

As we said above, the area of the GPU control unit is relatively small, and in order to save the controller, 32 CUDA Core must execute the same instructions at all times. In other words, the PC (program counter) of all CUDA Core within a Warp is always synchronized [2], but the memory access address can be different, and each core can have its own independent register group, which is called SIMT (Single Instruction Multi Trhead).

This is, you might ask, if the same instructions are always executed in this Warp, what if it branches?

That's a good question. In fact, CUDA Core in Warp doesn't always execute the same instructions, and it doesn't have to.

& lt;img src=" https://pic2.zhimg.com/v2-76c403af291e234075bd93453a7a03dd_b.jpg" data-size="small& Amp;quot; data-rawwidth="800" data-rawheight="600" data-default-watermark-src=& Amp;quot; https://pic1.zhimg.com/v2-eba259420dbe266ec4bfe31e8507916c_b.jpg" class="origin_image zh-lightbox-thumb" width="800& Amp;quot; data-original=" https://pic2.zhimg.com/v2-76c403af291e234075bd93453a7a03dd_r.jpg"> An example of warp divergence (http://15418.courses.cs.cmu.edu/spring2013/article/11)

This results in Warp Divergence (see figure above). If, in extreme cases, the instruction flow of each Core is different, it may even cause only one Core in a Warp to work, reducing the efficiency to 1 Core 32.

2. GPU requires high data alignment

Despite the fact that there are so many Warp cores in GPU and the bandwidth seems to be so large, in fact, memory access to one Warp is grouped, and only contiguous and aligned 128byte can be read at a time. [3] (this happens to be WarpSize 32 * 4 byte)

& lt;img src=" https://pic4.zhimg.com/v2-65c845c668b20978c3a4f1f945d5d107_b.jpg" data-size="normal& Amp;quot; data-rawwidth="578" data-rawheight="83" class=" Origin_image zh-lightbox-thumb" width="578" data-original=" https://pic4.zhimg.com/v2-65c845c668b20978c3a4f1f945d5d107_r.jpg& Amp;quot;>

The above picture shows that this operation is the most efficient. If the access is completely distributed, then the efficiency may become 1max 32. As shown in the following picture.

& lt;img src=" https://pic2.zhimg.com/v2-e81cc83a0c305ae35b76be32e469e1a5_b.jpg" data-size="normal& Amp;quot; data-rawwidth="599" data-rawheight="93" class=" Origin_image zh-lightbox-thumb" width="599" data-original=" https://pic2.zhimg.com/v2-e81cc83a0c305ae35b76be32e469e1a5_r.jpg& Amp;quot;>

And the caching strategy of NVIDIA GPU is different from that of CPU, and there is no time locality.

DIFFERENCE BETWEEN CPU L1 CACHE AND GPU L1 CACHE

The CPU L1 cache is optimized for both spatial and temporal locality. The GPU L1 cache is designed for spatial but not temporal locality. Frequent access to a cached L1 memory location does not increase the probability that the data will stay in cache.

-- "Professional CUDA Programming"

You may ask again, CPU's Cache line also has 64bytes, which is half less than GPU. What's the difference? Of course, the CPU is a core and an L1 Cache is two Warp and an L1 GPU [4]. The whole Warp has a core data that cannot be executed until it is ready.

Of course, with such stringent memory access conditions, there will be no problem if you really do C = A + B. in reality, memory access will not be so aligned, so NVIDIA has also made a lot of efforts to prepare Cache and Shared Memory, Constant Cache and other components to enable programmers to access memory efficiently.

Third, GPU memory access delay is large.

Speaking of memory access delay has a lot to do with the alignment of the previous section, which is discussed separately here.

You may also notice that a SM (CC6.1) can start up to 1024 threads at the same time, but there are only four Warp in a SM totaling 4 * 32 = 128CUDA Core. Obviously, a SM can start a lot more threads than CUDA Core. That's why.

Let's take a look at the typical GPU memory access latency (the "Professional CUDA Programming" data may be a little old)

10-20 cycles for arithmetic operations

400,800 cycles for global memory accesses

Memory access can do 40 operations at a time! But the video memory bandwidth of GPU is actually very high. How can CudaCore be fully loaded as much as possible? That's when SIMT came on.

It doesn't matter, this Warp (here refers to 32 threads, which confuses the scheduling unit and the hardware unit) is waiting for the data to be ready, and we can execute another set of 32 threads, so that although the delay is still very large, both CUDA Core and Memory can make full use of it.

GPU thread switching is different from CPU, switching threads on CPU needs to save the site and store all registers in main memory, and we said at the beginning that there are as many as 64k registers in a SM (note that it is not 64kbytes, some Chinese characters are miswritten) 4 bytes registers. The maximum number of registers used per Thread is 255. Did you find anything, young man?

256 * 4 * 32 = 32k. That is to say, I only use half of the registers for each thread. What are the extra registers for?

In fact, the thread switching of GPU only switches the register group, the latency is super low, and there is almost no cost. Considering that threads don't usually use up to 255registers, a CUDA Core can actually jump back and forth between eight threads at any time, and execute which thread data is ready [5]. This is where GPU is superior to CPU, and it is also to cover up the lack of delay.

All in all, GPU memory access still needs to be aligned, and the latency is still large, but the maximum throughput (a longer unit of time, when the scenario is appropriate, and the amount of data processed) is much higher than that of CPU.

[note 1] LD/SD is an access unit for accessing video memory, and SFU is a transcendental function unit.

[note 2] Volta architecture is a major update, which currently allows each thread to have a separate PC

[note 3] the data read through L1 Cache is 128 byte as a unit, and it can also be configured without caching, the unit size is 32byte, and the write operation unit size can be 32 bytes 64128 bytes. This article refers to Global Memory access.

[note 4] the architecture of NVIDIA GPU's Cache has changed significantly in recent generations. Please analyze the specific architecture.

[note 5] in fact, thread switching is based on Warp

Thank you for your reading, the above is the whole content of "what is the biggest reason why GPU can not completely replace CPU". Friends who learn to learn to hurry up to do it. I believe that the editor will certainly bring you better quality articles. Thank you for your support to the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report