The Origin and realization of memory Barrier in computer 07/19 Update SLTechnology News&Howtos

The Origin and realization of memory Barrier in computer

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article is to share with you about the origin and implementation of the memory barrier in computers. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

01 CPU cache

If you don't understand why you're talking about CPU caching about memory barriers, move on.

Students who have studied the principle of computer composition should have heard a word: clock cycle. What is a clock cycle? In popular terms, it is the period of time it takes for CPU to complete a basic action. Students who know a little bit about hardware all know a parameter that you must look at to see whether CPU is good or not: how much GHZ. There is a certain conversion relationship between the GHZ and the clock cycle, and students who are interested can study it by themselves. To be clear: not understanding this conversion relationship does not affect what you see later, as long as you have a basic understanding of the clock cycle.

A long time ago, this area was not cached in CPU, that is, CPU read and write memory directly. So why did you add caching to CPU later? Because there is a huge gap between the running efficiency of CPU and the efficiency of read and write memory, the waiting in the process of reading and writing memory wastes a lot of CPU computing power. At present, the latest memory is DDR4 specification, but writing data to memory requires 107 CPU clock cycles, that is, CPU is 107 times more efficient than write memory. If CPU only needs one clock cycle to perform write operations, isn't it a waste of CPU arithmetic that CPU needs to wait for 106clock cycles to wait for this write to complete? Then how to solve the problem? It's the same way of thinking about how to solve the read-write bottleneck in MySQL when we find it in our work: caching.

Take our mainstream CPU architecture today, for example, CPU mainly uses three layers of caching:

L1 and L2 caches become local core caches, that is, one core at a time. If your machine is 4 cores, there are four L1s and four L2s.

L3 cache is shared by all cores. That is, no matter how many cores your CPU is, there is only one L3 in this CPU.

The size of L1 cache is 64K, that is, 32K instruction cache + 32K data cache. L2 is 256K and L3 is 2m. This is not absolute. At present, Intel CPU is basically designed like this.

Here is an additional point of knowledge: cache line (Cache-line). The cache row is the smallest unit of data stored in the CPU cache, with a size of 64B. If we talk about this area for a long time, we will not talk about it in this article. Interested students can study it on their own. If you haven't learned about computer hardware, you may not understand it.

According to the philosophical theory of contradictory relativity: the solution to any problem is a contradiction between pros and cons. Adding cache does effectively improve the execution efficiency of CPU, but the data consistency between CPU cache and CPU cache and memory are the problems that have to be considered and solved. And you have to make sure that the efficiency of solving the data consistency of these two layers is higher than the CPU computing power wasted before not caching, otherwise this solution is a pseudo-solution: it sounds high-end and does not solve the problem.

02 consistency of cache

MESI protocol is to ensure the CPU core cache, memory data consistency and born, do not know can be popularized by Baidu, this is relatively simple. There are two points to expand here:

1. Why should buffer be added between CPU computing unit and L1 cache? The idea of CPU to achieve data consistency between cache and memory of each core is a bit like socket's three-way handshake: CPU0 modifies a certain data and needs to broadcast to tell other CPU. At this time, CPU0 enters blocking state and waits for other CPU to modify the state in its cache, and only enters running state after other CPU returns reply message after modifying status. Although the blocking time is very short, it is very long in the world of CPU. In order to ensure that this part of the blocking time can be fully utilized, I joined buffer. Store the pre-read information so that after the CPU is unblocked, it can take out the request processing directly from the buffer.

2. The implementation idea of the MESI protocol is that if CPU0 modifies a certain data and needs to broadcast it to other CPU, the CPU that does not have this data in the cache discards the broadcast message, and the CPU with this data in the cache will change the corresponding cache line to invalid state after listening to the broadcast, so that when CPU finds that the cache line is invalid the next time it reads the data, it will read it in memory. Do children's shoes find a problem: as long as there is data modification, CPU needs to access data inside, so why not implement CPU cache to share data? In this way, CPU can go to the cache line of CPU0 to read it the next time it is read, and the performance is higher. Now CPU does realize this idea, and the corresponding protocols are: MOESI of AMD and MESIF of Intel. Study the children's shoes you are interested in.

03 the origin of the memory barrier

At present, there are two mainstream strategies for writing CPU:

1. Write back: when CPU writes data to memory, it first puts the real data into store buffer, and CPU will not brush the data from store buffer into memory until a certain point in time, and the two operations are asynchronous. In a multithreaded environment, this is acceptable in some cases, but unacceptable in others. In order to give programmers the ability to achieve synchronous completion according to business needs, a memory barrier is designed. The memory barrier will be discussed in more detail later.

2. Write through: when CPU writes data to memory, write store buffer and memory synchronously.

At present, most of CPU adopts write back strategy. There may be children's shoes to ask: why? Because in most cases, some of the latency caused by CPU asynchronously completing write memory is acceptable, and the delay is extremely short. Only in a few special cases, such as strictly ensuring the visibility of memory in a multithreaded environment, it is necessary to ensure that the writing of CPU is done synchronously in the outside world, and it needs to be realized with the help of the memory barrier provided by CPU. If the strategy 2:write through is adopted directly, each write memory needs to wait for data to be brushed into memory, which greatly affects the execution efficiency of CPU.

04 memory barrier realization idea

Why insert the barrier? The essence is that the business level cannot accept even a small delay caused by the two asynchronous operations of writing store buffer and flushing memory, that is, the requirement for memory visibility is extremely high.

What exactly is the memory barrier? The memory barrier is nothing, it's just an abstract concept, just like OOP. If you do not understand this, then you understand it as a wall, where the instructions on the front and back of the wall cannot be executed out of order by CPU, and the reading and writing operations on the front and back of the wall need to be carried out in an orderly manner.

CPU provides three assembly instructions to serialize and run read-write instructions to ensure read-write order:

SFENCE: the write operation before the instruction must be completed before the write operation after the instruction

LFENCE: the read before the instruction must be completed before the read after the instruction

MFENCE: the read and write operations before the instruction must be completed before the read and write operations after the instruction

What is serialization? You can understand that CPU puts read, write, read and write requests into a queue and executes them in first-in-first-out order; what is meant by read operation is completed, that is, CPU performs a read operation and reads the value into the register; what is meant by write operation is completed, that is, CPU performs a write operation, and the data is brushed into memory.

Thank you for reading! This is the end of this article on the origin and implementation of the memory barrier in the computer. I hope the above content can be of some help to you, so that you can learn more knowledge. If you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.