Hierarchical structure of memory 04/22 Update SLTechnology News&Howtos

Hierarchical structure of memory

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Storage technology

When we buy a computer, we will pay attention to the performance of memory, processor, hard disk and other components. We all want the memory to be as large as possible, and the hard disk should be solid state.

I don't know if you have ever written a document for most of the day, because your phone was turned off accidentally, and your hard work for several hours had to be rewritten. But have you ever wondered why you lose this information when you turn it off? Why are the files on the hard drive not lost?

The part of information that will be lost must have something to do with electricity, otherwise it would not be lost as soon as the power is cut off. Memory is such a component, which is more professionally called random access memory.

There are two types of random access memory (RAM): static and dynamic. Static RAM stores information in a bistable memory cell. What is bistability? Even if there are only two stable states, although there are other states, even a slight disturbance will make it enter a stable state immediately.

Dynamic RAM uses capacitance to store information. Anyone who has studied physics knows the concept of capacitance. It easily leaks electricity, making dynamic RAM units lose charge (information) within 10: 100 ms. But don't forget that the running time of a computer is calculated in nanoseconds. The clock cycle of a 1 GHz processor is 1 ns, not to mention that today's processors are more than 1 GHz, so ms is very long relative to nanoseconds. Computers don't have to worry about losing information.

The dynamic RAM chip is encapsulated in the memory module, and the larger storage component than memory is the disk. Do you really know about hard drives when you find yourself in the old text? The summary of the disk is good, so let's go straight to the locality.

Locality

Locality usually has two different forms: time locality and spatial locality. In a program with good time locality, a memory location that has been referenced once is likely to be referenced many times in the near future; similarly, in a program with good spatial locality, if a memory is referenced once, then the program is likely to reference a memory location near it in the near future.

Don't underestimate locality. Programs with good locality will run faster than programs with poor locality. It's important to know that you need to go to senior programmers. We choose to remove some commonly used files from the network disk, making use of the time locality.

The following code is very simple, we just look at the v vector, the elements of the vector v are read one after another, that is, they are read in the order stored in memory, so it has a good spatial locality; but each element is accessed only once, which makes the time locality very bad. In fact, for each variable in the body of the loop, this function has either good spatial locality or good temporal locality.

Int sumvec (int v [N]) {

Int I, sum = 0

For (I = 0; I)

< N; i++){ sum += v[i]; } return sum; } 像上面的代码，每隔 1 个元素进行访问，称之为步长为 1 的引用模式。一般而言，随着步长的增加，空间局部性下降。当然，不仅数据引用有局部性，取指令也有局部性。比如for循环，循环体中的指令是按顺序执行的，并且会被执行多次，所以它有良好的空间局部性和时间局部性。高速缓存不同存储技术的访问时间差异很大，而我们想要的是又快又大的体验，然而这又是违背机械原理的。为了让程序运行的更快，计算机设计者在不同层级之间加了缓存，比如在 CPU 与内存之间加了高速缓存，而内存又作为磁盘的缓存，本地磁盘又是 Web 服务器的缓存。多次访问一个网页，会发现有一些网络请求的状态码是 300，这就是从本地缓存读取的。如下图所示，高速缓存通常被组织为下面的形式，计算机需要从具体的地址去拿指令或者数据，而这个地址也被切分为不同的部分，可以直接映射到缓存上去。看下面详细的介绍应该更容易理解。

The direct mapping cache has only one row per group. The cache determines whether a request is hit or not, and then the process of extracting the requested word is divided into three steps: group selection, line matching, and word extraction.

For example, when CPU executes an instruction to read the word w in memory, first extract s group index bits from the middle of the w address and map them to the corresponding group, then use the t-bit mark to determine whether a copy of the word w is stored in the group; finally, use the block offset of the b bit to determine where the desired block starts.

The picture above and the table below correspond to see, because of the limited ability, I feel that I can't speak well any way. If I stare at it for a while, I should get a sense of sudden enlightenment.

The reason for the collision misses caused by the direct mapping cache is that there is only one row per group, and the group association cache relaxes this restriction, and each group saves more than one cache row, so after the group selection is complete, you need to traverse the row matching in the corresponding group.

Of course, we can continue to expand the number of cache rows in each group, that is, the fully associative cache, where all cache lines are in one group, which has only one group. Therefore, the division of addresses does not require a group index, as shown in the following figure.

Write cache-friendly code float dotprod (float x [8], float y [8]) {

Float sum = 0.0

Int i

For (I = 0; I < 8; iTunes +) {

Sum + = x [I] * y [I]

}

Return sum

}

This function is very brief, which is a function that calculates the dot product of two vectors, and for x and y, this function has a good spatial locality, and its cache hit rate is not high if the direct mapping cache is used.

As you can see from the table, each reference to x and y causes a collision miss because we wobble between the blocks of x and y, that is, the cache repeatedly loads and replaces the same cache block group.

We only need to make a small change to greatly improve the hit rate, that is, to make the program run faster. This change is to change floatx [8] to floatx [12], and the index mapping after the change is as follows, which is very friendly.

Let's look at a multi-dimensional array, the function of the function is to sum all the elements, two different ways of writing.

/ / the first kind

Int sumarrayrows (int a [M] [N]) {

Int I, j, sum = 0

For (I = 0; I < M; iTunes +) {

For (j = 0; j < N; jacks +) {

Sum + = a [I] [j]

}

Return sum

}

/ / the second kind

Int sumarrayrows (int a [M] [N]) {

Int I, j, sum = 0

For (j = 0; j < M; jacks +) {

For (I = 0; I < N; iTunes +) {

Sum + = a [I] [j]

}

Return sum

}

From the point of view of the programming language, the effect of the two writing methods is the same, which is to find the sum of all the elements of the array, but after in-depth analysis, we will find that the first writing method will run faster than the second one, because the second writing method will not happen in a single cache hit, while the first writing method will have 24 cache hits, so it is inevitable that the first method runs faster than the second one. The first and second cache hit modes are shown below (bold indicates miss).

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.