Bus locking and consistent caching 04/25 Update SLTechnology News&Howtos

Bus locking and consistent caching

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Bus locking and cache consistency

These are two concepts at the operating system level. With the advent of the multi-core era, concurrent operations have become a very normal phenomenon. The operating system must have some mechanisms and primitives to ensure the atomicity of some basic operations. For example, the processor needs to ensure that a byte is atomic to read or write a byte, so how is it implemented? There are two mechanisms: bus locking and cache consistency.

We know that the communication speed between CPU and physical memory is much slower than the processing speed of CPU, so CPU has its own internal cache. According to some rules, the data in memory is read into the internal cache to speed up frequent reads. We assume that there is only one CPU and one internal cache on a PC, then all processes and threads see the number in the cache, so there will be no problem; but now the server is usually multi-CPU, more commonly, there are multiple cores in each CPU, and each kernel maintains its own cache, so there will be cache inconsistencies in multithreading concurrency, which will lead to serious problems.

In the case of iTunes +, the initial value of I is 0. So at the beginning, each cache stores the value of 0 of I, and when the first kernel does iDepression +, the value in the cache becomes 1. Even if it is written back to the main memory immediately, then the value of I in the second kernel cache is still 0 after writing back, and writing back to memory will overwrite the operation of the first kernel, so that the final result is 1 instead of 2.

So how to solve the whole problem? The operating system provides a mechanism for bus locking. The front-end bus (also known as CPU bus) is the backbone of all CPU and chipset connections, which is responsible for the communication between CPU and all external components, including cache, memory and North Bridge. Its control bus sends control signals to each component, sends address signals through the address bus to specify the components to be accessed, and transmits bidirectionally through the data bus. When CPU1 sends a LOCK# signal on the bus, other processors cannot manipulate the cache that caches the memory address of the shared variable, that is, blocking other CPU, so that the processor can enjoy the shared memory exclusively.

But we only need to operate on this shared variable is atomic, and the bus lock locks the communication between CPU and memory, so that during locking, other processors can not operate the data of other memory addresses, so it is expensive, so later CPU provides cache consistency mechanism, and Intel's Pentium 486 provides this optimization.

Overall, the cache consistency mechanism is that when a CPU operates on the data in the cache, it notifies other CPU to discard the cache stored in them or re-read it from the main memory, as shown below:

Here, the principle of MESI protocol, which is widely used in Intel series, is described in detail.

The MESI protocol is named after several states of cached rows (the basic data unit of the cache, which is usually 64 bytes on Intel's CPU) (full names are Modified, Exclusive, Share or Invalid). The protocol requires that two state bits be maintained on each cache line, so that each data unit may be in one of the four states M, E, S, and I. the meanings of each state are as follows:

M: modified. Data in this state is cached only in this CPU, but not in other CPU. At the same time, its state has been modified relative to the value in memory, and has not been updated to memory.

E: exclusive. The data in this state is cached only in this CPU, and the data is not modified, that is, it is consistent with memory.

S: shared. Data in this state is cached in multiple CPU and is consistent with memory.

I: invalid. This cache in this CPU is no longer valid.

First of all, the corresponding snooping on the cache of the protocol is introduced:

A cache line in M state must listen to all operations that attempt to read the main memory address of the cache line at all times, and if so, the data in its cache line must be written back to CPU before the operation is executed.

A cache line in the S state must always listen for requests that invalidate the cache line or have exclusive access to the cache line, and if it does, it must set its cache line status to I.

A cache line in the E state must always listen to other operations that try to read the main memory address corresponding to the cache line, and if so, its cache line state must be set to S.

When CPU needs to read data, if the state of its cache line is I, it needs to read from memory and change its state to S, if it is not I, then the value in the cache can be read directly, but before that, you must wait for the listening result of other CPU. If other CPU also has the cache of the data and the state is M, you need to wait for it to update the cache to memory before reading.

When CPU needs to write data, it can only be executed when its cache line is M or E, otherwise it needs to issue a special RFO instruction (Read Or Ownership, which is a bus transaction) to inform other CPU to set cache invalid (I). In this case, the performance overhead is relatively high. After the write is complete, modify its cache state to M.

Therefore, if a variable is frequently modified by only one thread in a certain period of time, it is possible to use its internal cache without involving bus transactions. If cache one is monopolized by this CPU and one by that CPU, then RFO instructions will continue to affect concurrent performance. The fact that cache is frequently monopolized here does not mean that the more threads, the easier to trigger, but the CPU coordination mechanism here. This is somewhat similar to the fact that sometimes multithreading does not necessarily improve efficiency, because thread suspension and scheduling costs are even greater than task execution costs. The same is true of multi-CPU here. If the scheduling between CPU is unreasonable, the cost of RFO instructions will be even higher than that of tasks. Of course, this is not something the programmer needs to consider, and the operating system will have a judgment about the corresponding memory address, which is beyond the scope of this article.

Cache consistency is not used in all cases. If the manipulated data cannot be cached within the CPU or the operational data spans multiple cache lines (the state cannot be identified), the processor will call bus locking; in addition, when CPU does not support cache locking, it can only be locked by bus, such as Pentium 486 and older CPU.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.