Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is Cache Memory in Linux

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/03 Report--

What is Cache Memory in Linux? In view of this problem, this article gives the corresponding analysis and answers, hoping to help more friends who want to solve this problem to find a more simple and feasible way.

Why do you need cache memory

Before thinking about what cache is, let's first think about the first question: how does our program work? We should know that the program runs in RAM, and RAM is what we often call DDR (such as DDR3, DDR4, etc.). We call it main memory (main memory) when we need to run a process, we first load the executable program from the Flash device (for example, eMMC, UFS, etc.) to main memory, and then start execution. There is a stack of universal registers (register) inside the CPU. If CPU needs to add a variable (assuming address A) by 1, it is generally divided into the following three steps:

CPU reads address A data from main memory to the internal general register x0 (one of the general registers of the ARM64 architecture).

General register x0 plus 1.

CPU writes the value of general register x0 to main memory.

We can express this process as follows:

In fact, in reality, there is too much difference between the speed of the CPU general register and the main memory. The relationship between the two speeds is roughly as follows:

The speed of CPU register is generally less than that of 1ns, and the speed of main memory is generally about 65ns. The speed difference is nearly a hundred times. Therefore, of the three steps illustrated above, steps 1 and 3 are actually very slow. When CPU tries to load/store from main memory, CPU has to wait for this long 65ns time because of the speed limit of main memory. If we can improve the speed of main memory, then the performance of the system will be greatly improved.

Today's DDR storage devices, often with a few GB, have a large capacity. If we use faster materials to make faster main memory, and have almost the same capacity. Its costs will rise sharply. It is a bit difficult for us to try to increase the speed and capacity of main memory and expect its cost to be very low. Therefore, we have a compromise, that is to make a very fast but very small storage device. Then the cost will not be too high. This storage device is called cache memory.

In hardware, we put cache between CPU and main memory as the cache of main memory data. When CPU attempts to load/store data from main memory, CPU will first find out from cache whether the data of the corresponding address is cached in cache. If its data is cached in cache, get the data directly from cache and return it to CPU. When there is a cache, the flow of the above example of how the program runs will be as follows:

The mode of direct data transmission between CPU and main memory is transformed into direct data transmission between CPU and cache. Cache is responsible for data transfer between main memory and main memory.

Multi-level cache memory

The speed of cahe also affects the performance of the system to some extent.

In general, the speed of cache can reach 1ns, which is almost comparable to the speed of CPU register. But does this satisfy people's pursuit of performance? No. When the data we want is not cached in cache, we still have to wait a long time to load the data from main memory. In order to further improve performance, multi-level cache is introduced.

The cache mentioned earlier is called L1 cache (first-level cache). We connect the L2 cache behind the L1 cache and the L3 cache between the L2 cache and the main memory. The higher the level, the slower the speed and the greater the capacity. But compared with the main memory, it is still very fast. The relationship between different levels of cache speed is as follows:

After 3 levels of cache buffering, the speed difference between all levels of cache and main memory is also gradually reduced. In a real system, how are all levels of cache related to each other in hardware? Let's take a look at the hardware abstract block diagram between all levels of cache on the Cortex-A53 architecture as follows:

In the Cortex-A53 architecture, L1 cache is divided into separate instruction cache (ICache) and data cache (DCache). L1 cache is private to CPU, and each CPU has an L1 cache. All CPU in a cluster share an L2 cache,L2 cache that does not distinguish between instructions and data and can be cached. L3 cache is shared among all cluster. L3 cache is connected to main memory through a bus.

Cooperation between multi-level cache

First of all, two noun concepts are introduced, hit and missing. The data to be accessed by CPU is cached in cache, called "hit", and vice versa is called "miss". How do multi-level cache work together? Let's assume that the system under consideration now has only two levels of cache.

When CPU attempts to load data from an address, it first queries the L1 cache for a hit, and if so, returns the data to CPU. If the L1 cache is missing, continue to look in the L2 cache. When L2 cache hits, the data is returned to L1 cache and CPU. If L2 cache is also missing, unfortunately, we need to load the data from main memory and return the data to L2 cache, L1 cache, and CPU. This multi-level cache operation is called inclusive cache.

The data of an address may be stored in a multi-level cache. The counterpart of inclusive cache is exclusive cache, which ensures that the data cache of an address will only exist in one of the multi-level cache. That is, data at any address cannot be cached in both L1 and L2 cache.

Direct Mapping Cache (Direct mapped cache)

Let's continue to introduce some nouns related to cache. The size of cache is called cahe size, which represents the size of the most big data that cache can cache. We divide the cache into many equal blocks on average, each of which is called cache line and its size is cache line size.

For example, a 64 Bytes size cache. If we divide 64 Bytes into 64 blocks on average, then cache line is 1 byte, with a total of 64 lines of cache line. If we divide 64 Bytes into 8 blocks on average, then cache line is 8 bytes, a total of 8 lines of cache line. In current hardware design, the general size of cache line is 4-128Byts. Why is there no 1 byte? The reasons will be discussed later.

One thing to note here is that cache line is the smallest unit of data transfer between cache and main memory. What do you mean? When CPU attempts to load a byte of data, if the cache is missing, the cache controller will transfer the one-time load cache line-sized data from the main memory to the cache. For example, the cache line size is 8 bytes. Even if CPU reads a byte, after the cache is missing, cache will populate the entire cache line with 8 bytes of load from main memory. And what is it? We'll understand later.

Let's assume that the following instructions are all for cache with a size of 64 Bytes, and that the size of the cache line is 8 bytes. We can similarly think of this piece of cache as an array with a total of eight elements, each with a size of 8 bytes. Just like the picture below.

Now let's consider the question that CPU reads a byte from the 0x0654 address. How does the cache controller determine whether the data is hit in the cache? The cache size is negligible compared to the main memory. So cache must only cache a very small amount of data in main memory. How do we look up data in a limited-size cache based on the address? The current approach adopted by the hardware is to hash the address (which can be understood as address modularization). Let's see how it is done next.

We have eight lines of cache line,cache line with a size of 8 Bytes. So we can use the address lower 3 bits (such as the blue part of the address above) to address a byte in 8 bytes, and we call this part of the bit combination offset. Similarly, 8 lines of cache line, in order to cover all lines.

We need 3 bits (such as the yellow part of the address above) to find a line, which is called index. Now we know that if two different addresses have exactly the same bit3-bit5, then both addresses will find the same cache line after hardware hashing. So, when we find the cache line, the data that only represents the address we are accessing may exist in this cache line, but it may also be the data corresponding to other addresses. Therefore, we introduce the tag array region, tag array and data array one-to-one correspondence.

Each cache line corresponds to a unique tag,tag that holds the entire address bit width minus the rest of the bit used by index and offset (such as the green part of the address above). The combination of tag, index and offset can uniquely determine an address. Therefore, when we find the cache line based on the index bit in the address, we take out the tag corresponding to the current cache line and compare it with the tag in the address. If it is equal, it means that the cache is hit. If it is not equal, it means that the current cache line stores data from other addresses, which means that cache is missing.

In the above figure, we see that the value of tag is 0x19, which is equal to the tag part of the address, so it will be hit in this visit. Due to the introduction of tag, it answers our previous question, "Why doesn't the hardware cache line make a byte?" . This will lead to an increase in hardware costs, because the original 8 bytes corresponding to one tag now requires 8 tag, which takes up a lot of memory.

We can see from the figure that there is a valid bit next to the tag, which is used to indicate whether the data in the cache line is valid (for example: 1 for valid; 0 for invalid). When the system starts up, the data in the cache should be invalid because no data has been cached yet. The cache controller can confirm whether the current cache line data is valid according to the valid bit. Therefore, before the above comparison tag confirms whether the cache line is hit, it also checks whether the valid bit is valid. Comparing tag makes sense only if it is valid. If it is invalid, determine the absence of cache directly.

In the above example, cache size is 64 Bytes and cache line size is 8 bytes. Offset, index, and tag use 3 bits, 3 bits, and 42 bits (assuming the address width is 48 bits). Now let's look at another example: 512 Bytes cache size,64 Bytes cache line size. According to the previous address division method, offset, index, and tag use 6 bits, 3 bits, and 39 bits, respectively. This is shown in the following figure.

Advantages and disadvantages of direct mapping caching

Direct mapping caching is simpler in hardware design and therefore cheaper in cost. According to how the direct mapping cache works, we can draw a map of the cache distribution corresponding to the main memory address 0x00-0x88 address.

We can see that the corresponding data at the address 0x00-0x3f address can cover the entire cache. The data of the 0x40-0x7f address also covers the entire cache. Let's now think about a question: what happens if a program tries to access the data in the address 0x00, 0x40, and 0x80 cache in turn?

First of all, we should understand that the index part of the 0x00, 0x40, and 0x80 addresses is the same. Therefore, the cache line for these three addresses is the same. So, when we access the 0x00 address, the cache is missing and the data is loaded from main memory into line 0 cache line in the cache. When we access the 0x40 address, we still index to line 0 cache line in the cache. Because the data corresponding to the address 0x00 address is stored in the cache line, the cache is still missing. Then load the 0x40 address data from main memory into the first line of cache line. By the same token, if you continue to access the 0x80 address, you will still be missing cache.

This is equivalent to reading data from main memory every time you access it, so the existence of cache does not improve performance. When the 0x40 address is accessed, the data cached by the 0x00 address is replaced. This phenomenon is called cache cache thrashing. To solve this problem, we introduce multi-group connection cache. Let's first study how the simplest two-way group connected cache works.

Two-way group connection cache (Two-way set associative cache)

We still assume that 64 Bytes cache size,cache line size is 8 Bytes. What is the concept of way. We divide the cache equally into multiple parts, each of which is the same way. Therefore, the two-way group connected cache is to divide the cache into 2 parts on average, each 32 Bytes. This is shown in the following figure.

The cache is divided into two paths, each containing four lines of cache line. We combine all the cache line with the same index as a group. For example, in the figure above, a group has two cache line, a total of four groups. We still assume that a byte of data is read from the address 0x0654 address. Because cache line size is 8 Bytes, offset requires 3 bits, just like the direct mapping cache before. The difference is index. In the two-way group connection cache, index only needs 2 bits because there are only 4 lines of cache line along the way.

The above example finds the second line of cache line based on index (calculated from 0), and the second line corresponds to two cache line, corresponding to way 0 and way 1, respectively. So index can also be called set index (group index). First find the set according to the index, and then compare the tag corresponding to all the cache line in the group with the tag part of the address. If one of them is equal, it means a hit.

Therefore, the biggest difference between the two-way group connected cache and the direct mapping cache is that the data corresponding to the first address can correspond to two cache line, while the direct mapping cache corresponds to only one cache line. So what are the benefits of this?

Advantages and disadvantages of two-way group connected cache

The hardware cost of two-way group connected caching is higher than that of direct mapping caching. Because each time it compares tag, it needs to compare the corresponding tag of multiple cache line (some hardware may also do parallel comparison to increase the comparison speed, which increases the complexity of hardware design).

Why do we still need two-way group connection cache? Because it can help to reduce the possibility of cache bumps. So how is it reduced? According to how the two-way group connected cache works, we can draw the cache distribution map corresponding to the main memory address 0x00-0x4f address.

We still consider the question of the direct mapping cache section, "what happens if a program tries to access the data in the address 0x00, 0x40, and 0x80 cache in turn?" . Now the data of the 0x00 address can be loaded into way 1 and 0x40 can be loaded into way 0. Does this avoid the embarrassment of direct mapping caching to some extent? In the case of two groups connected to the cache, the data of both 0x00 and 0x40 addresses are cached in cache. Just imagine, if we are connected to the 4-way group cache and then continue to access the 0x80, it may also be cached.

Therefore, when the cache size is certain, the performance improvement of group-connected cache is the same as that of direct mapping cache in the worst case, and in most cases, the effect of group-connected cache is better than that of direct mapping cache. At the same time, it reduces the frequency of cache bumps. To some extent, direct mapping caching is a special case of group-connected caches, where each group has only one cache line. Therefore, the direct mapping cache can also be called a single-way group connected cache.

Fully connected cache (Full associative cache)

Since the group join cache is so good, if all the cache line are in one group. Isn't it better performance. Yes, this cache is a fully connected cache. We still use the 64 Byts size cache as an example.

Because all the cache line are in one group, the set index part of the address is not required. Because there is only one group for you to choose, indirectly, that is, you don't have a choice. We compare the tag portion of the address with the corresponding tag of all cache line (either parallel or serial on hardware). Which tag is equal means that a cache line is hit. Therefore, in the fully connected cache, data at any address can be cached in any cache line. Therefore, this can minimize the frequency of cache bumps. But the cost of hardware is also higher.

A four-way group connected cache instance problem

Consider that the 32 KB size 4-way connected cache,cache line size is 32 Bytes. Please think about the question:

1)。 How many groups? 2)。 Assuming that the address width is 48 bits,index, how many bit are occupied by offset and tag?

A total of 4 routes, so the size of each path is 8 KB. Cache line size is 32 Bytes, so there are 256 groups (8 KB / 32 Bytes). Because cache line size is 32 Bytes, offset requires 5 bits. There are 256 groups, so the index needs 8 bits, and the rest is the tag part, which takes up 35 bits. This cache can be shown in the following figure.

Cache allocation Policy (Cache allocation policy)

The cache allocation strategy refers to the circumstances under which we should allocate cache line to the data. Cache allocation strategy is divided into two cases: read and write.

Read allocation (read allocation):

When CPU reads data, a missing cache occurs, in which case an cache line cache is allocated to read data from main memory. By default, cache supports read allocation.

Write allocation (write allocation):

The write allocation policy is considered only when cache is missing from CPU write data. When we do not support write allocation, the write instruction only updates the main memory data, and then it is over. When write allocation is supported, we first load data from main memory into cache line (equivalent to doing a read allocation action first), and then update the data in cache line.

Cache Update Policy (Cache update policy)

The cache update policy refers to how the write operation should update the data when a cache hit occurs. There are two cache update strategies: write-through and write-back.

Write through (write through):

When CPU executes the store instruction and hits the cache, we update the data in the cache and update the data in the main memory. The data of cache and main memory are always the same.

Write back (write back):

When CPU executes the store instruction and hits the cache, we only update the data in the cache. And each cache line will have a bit bit to record whether the data has been modified, called dirty bit (flip through the previous picture, there is a D next to the cache line is dirty bit). We will set the dirty bit. The data in main memory is updated only when the cache line is replaced or when the clean operation is displayed. Therefore, the data in main memory may be unmodified data, while the modified data lies in the cache line.

At the same time, why is the cache line size the smallest unit of data transfer between the cache controller and main memory? This is also because there is only one dirty bit per cache line. This dirty bit represents the state in which the entire cache line is modified.

Example

Suppose we have a 64 Bytes size direct mapping cache, and the cache line size is 8 Bytes, using write allocation and write back mechanisms. When CPU reads a byte from the address 0x2a, how does the data in cache change? Assume that the current cache state is shown in the following figure.

According to index to find the corresponding cache line, the corresponding tag part of valid bit is legal, but the value of tag is not equal, so missing occurs. At this point, we need to load 8 bytes of data from the address 0x28 address into the cache line. However, we found the current dirty bit setting for cache line. Therefore, the data in cache line cannot be simply discarded. Due to the writeback mechanism, we need to write the data 0x11223344 in cache to the address 0x0128 address (this address is calculated according to the value in tag and the cache line line). This process is shown in the following figure.

When the writeback operation is complete, we load the eight bytes starting with the 0x28 address in main memory into the cache line and clear the dirty bit. Then find the 0x52 according to offset and return it to CPU.

About the Cache Memory in Linux to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report