The memory model of Clear11 is understood this way. 07/12 Update SLTechnology News&Howtos

The memory model of Clear11 is understood this way.

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "Clear11 memory model is understood in this way". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

A brief historical review

Initially, the developer did not release a public specification for the processor memory model. However, according to a set of rules, weakly serialized processors work well with memory. Personally, I think that developers must want to introduce some new strategies someday in the future (why should some specifications be followed in architecture development? ). However, doom continues, and gigabytes is enough to make developers grumpy. Developers introduce multi-core, which eventually leads to a surge in multi-threading.

It is the operating system developers who are initially alarmed because they have to maintain multicore CPU, but there are no weak ordered architectural rules. Since then, other standards committees have been involved, and as programs become more and more parallel, the standardization of the language memory model arises at the historic moment, providing some guarantee for multithreaded concurrent execution, but now we have processor memory model rules. In the end, almost all modern processor architectures have open memory model specifications.

C++ has always been known for writing the underlying code in a high-level language, which naturally cannot be destroyed in the development of C++ 's memory model, which is bound to give programmers the greatest flexibility. After analyzing the memory models of languages such as JAVA, the internal structure of typical synchronization primitives and the case of lock-free algorithms, the developers introduced three memory models:

Sequence consistency model

Get / release semantic model

Loose memory serialization model (relaxed)

All of these memory models are defined in a C++ list-std::memory_order, which contains the following six constants:

Memory_order_seq_cst pointing sequence consistency Model

Memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_consume point to a model based on acquisition / release semantics

Memory_order_relaxed points to loose memory serialization model

Before you start to look at these models, you should determine which memory model is used in the program, and then look at atomic operations again. The atomicity of this operation has been described in the article, and this operation is no different from the operation defined in Clover 11. Because they are all based on such a criterion: memory_order as a parameter of atomic operations. There are two reasons:

Semantics: in fact, by serialization (memory barrier), we mean the atomic operations performed by the program. The magic of the barrier in the read / write method is that it has nothing to do with the code, but is actually an instruction for the equivalent existence of the barrier. In addition, the location of the barrier in read / write depends on the architecture itself.

In practical application: Intel Itanium is a special and distinctive architecture. The serialized memory mode of this architecture is suitable for reading and writing instructions and RMW operations. The old version of Itanium was an optional instruction tag: get, release, or relaxed. However, there is no separate get / release semantic instruction in the architecture, only a heavyweight memory barrier instruction.

The following are real atomic operations, and each specification of the std::atomic class should contain at least the following methods:

Void store (T, memory_order = memory_order_seq_cst)

T load (memory_order = memory_order_seq_cst) const

T exchange (T, memory_order = memory_order_seq_cst)

Bool compare_exchange_weak (tipped, T, memory_order = memory_order_seq_cst)

Bool compare_exchange_strong (tipped, T, memory_order = memory_order_seq_cst)

Independent memory barrier

Of course, there are also two independent memory barrier methods available in Craft 11:

Void atomic_thread_fence (memory_order)

Void atomic_signal_fence (memory_order)

Atomic_thread_fence can also be run as an independent read-write barrier, which is told to be out of date. Although the memory_order serialization method atomic_signal_fence does not provide read barrier (load/load) or write barrier (Strore/Store), atomic_signal_fence can be used for signal processors (signal handler); as a rule, this method does not produce any code, just a compiler barrier. (translator's note) it seems more appropriate to call it a fence.

As you can see, the default Category 11 memory model is a sequence consistent model, which is exactly what we're going to discuss, but before we do that, let's talk briefly about compiler barriers.

Compiler barrier

Who will rearrange the code we write? Processors can rearrange code, as well as compilers. Many heuristic and optimized development methods are based on the assumption of single-thread execution. Therefore, it is quite difficult for the compiler to understand that your code is multithreaded. So it needs a hint-barrier. Barriers like this tell the compiler, "Don't mix the code in front of the barrier with the code behind the barrier, and vice versa." the compiler barrier does not generate any code.

The compiler barrier for MS Visual ReadWriteBarrier + + is a pseudo method: _ compiler (). (I can't remember its name in the past: related to read / write memory barriers-heavyweight memory barriers.) for GCC and Clang, it is a smart _ _ asm__ volatile__ (": «memory ») structure.

It is also worth noting that assembly _ _ asm__ volatile__ (… Insertions is also a GCC and Clang barrier. The compiler has no right to abandon or rearrange the code before and after the barrier. The C++ memory_order constant, to some extent, supports the compiler to exert influence on the processor. As a compiler barrier, it limits the rearrangement of code (such as optimization). Therefore, there is no need to set specific compiler barriers, of course, as long as the compiler fully supports this new standard.

Serialization consistency model

Suppose we have implemented an unlocked stack, compiled and tested. If we get a core file, we will ask what went wrong. Start looking for the root cause of the error, thinking line by line of code implementation in the lock-free stack (none of the debuggers can help us), trying to simulate multithreading and answering the following questions:

"when thread 1 executes line k and thread 2 executes line N, what fatal problem will cause the program to fail?" Maybe you will find the root cause of the errors and get rid of them, but the unlocked stack still reports errors. Why?

(translator's note: the so-called core file, also known as core dump, is used for debugging when the operating system writes the contents of the process address space and other information about the status of the process to the file when the process receives certain signals and terminates.)

In fact, we are trying to find the root cause of the error and compare the different lines of the program under multi-thread concurrent execution in mind, which is actually serialization consistency. It is a strict memory model that ensures that the processor executes program instructions in the established order of the program itself, for example, the following code:

/ / Thread 1

Atomic a, b

A.store (5)

Int vb = b.load ()

/ / Thread 2

Atomic x,y

Int vx = x.load ()

Y.store (42)

Any execution scenario is allowed by the serialization consistency model, except for swapping a.store / b.load or x.load / y.store. Note that instead of explicitly setting the memory_order parameter for the load store, I rely on the default parameter value.

The same specification extends to the compiler: operations under memory_order_seq_cst must not be migrated to this barrier, while operations on seq_cst-barrier must not be migrated below this barrier.

The serialization consistency model is close to the human brain, but it has a rather fatal flaw and too many restrictions on modern processors. This creates extremely heavy memory barriers, which largely restricts processor heuristic execution, which is why the new C++ standard has the following tradeoffs:

The ordered consistency model is regarded as the default model of atomic operation because of its strict characteristics and easy to understand.

At the same time, C++ introduces a weak memory barrier to cope with more possibilities of modern architecture.

The model based on acquisition / release semantics serves as a good complement to the serialization consistency model.

Get / release semantics

As you can see from the title, to some extent, this semantics is related to the acquisition and release of resources. Indeed, the acquisition of a resource is to read it from memory into a register, and to release is to write it back from the register to memory.

Load memory, register

Membar # LoadLoad | # LoadStore; / / acquire-

/ / Operation within acquire/release-sections

...

Membar # LoadStore | # StoreStore; / / release-barrier

Store regiser, memory

As you can see, we don't use heavyweight barrier applications like # StoreLoad. Getting the barrier and releasing the barrier is half the barrier. Getting does not rearrange previous storage operations with subsequent loads and storage, while release does not rearrange previous loads with subsequent loads, and similarly, previous storage and subsequent loads are not rearranged. All of this applies to compilers and processors, acquiring and releasing barriers that serve as all code in the interval. Some operations in front of the acquisition barrier (which can be rearranged by the processor or compiler) can infiltrate into the acquisition / release module. Similarly, the subsequent operation of releasing the barrier can be transferred to the top to enter the acquisition / release interval. But getting / releasing the operations in it will not go out of bounds.

I guess spin lock (spin lock) is the simplest example of acquiring / releasing semantic applications.

No lock and spin lock

You may find it strange that it seems inappropriate to give an example of a locking algorithm in a series of articles on unlocked algorithms. Let me explain.

I am not a pure lock-free powder, but the pure lock-free (especially wait-free) algorithm does make me very happy. I'll be even happier trying to make it happen. As a pragmatist: anything that works is good. If the use of locks brings benefits, I also think it is good. Spin locks can bring more benefits than synthetic mutexes (mutex), such as protecting a small program-a small number of assembly instructions. Similarly, spin lock is an inexhaustible resource for different optimizations.

The simplest spin lock implementation based on acquisition / release goes something like this: (C++ experts believe that a particular atomic_flag should be used for spin locking, but I prefer to build spin locks on atomic variables, not even of type boolean. From the point of view of this article, it will look clearer. )

Class spin_lock

{

Atomic m_spin

Public:

Spin_lock (): m_spin (0) {}

~ spin_lock () {assert (m_spin.load (memory_order_relaxed) = = 0);}

Void lock ()

{

Unsigned int nCur

Do {nCur = 0;}

While (! m_spin.compare_exchange_weak (nCur, 1, memory_order_acquire))

}

Void unlock ()

{

M_spin.store (0, memory_order_release)

}

What puzzles me in this code is that if the CAS is not executed successfully, the compare_exchange method, the first parameter receives a reference and modifies it. So we have to use a do-while with a non-empty body.

The acquisition-semantics is used in the lock method, and the release semantics is used in the unlock method (by the way, the acquisition / release semantics comes from the synchronization primitive, and the standard developer carefully analyzes the different synchronization primitive implementations, which leads to the acquisition / release model). As mentioned earlier, the barrier in this example does not allow code overflow between lock and unlock, which is exactly what we need.

The atomic m_spin variable ensures that no one can get the lock when m_spin=1, which is what we need!

You see that compare_exchange_weak is used in the algorithm, but what is it?

Weak and Strong CAS

As you remember, processor architectures usually choose one of two types, either implementing atomic CAS primitives or implementing LL/SC pairs ((load-linked/store-conditional). The LL/SC pair can implement atomic CAS, but it is not atomic for many reasons. One reason is that the code being executed in LL/SC can be interrupted by the operating system. For example, at this point OS decides to push the current thread out; after recovery, store-conditional no longer responds. CAS returns false, and the cause of the error is not the data itself, but an external event-the thread is interrupted.

It is because of this that prompts developers to add two compare_exchange primitives to the standard-weak and strong. Therefore, these two primitives are named compare_exchange_weak and compare_exchange_strong respectively. Even if the current variable value is equal to the expected value, this weak version may fail, such as returning false. It can be seen that any weak CAS can break the CAS semantics and return false, which should have returned true. On the other hand, Strong CAS strictly follows the semantics of CAS. Of course, it's worth it.

Under what circumstances do you use Weak CAS and when do you use Strong CAS? I made the following modifications: if CAS were in a loop (which is a basic CAS application pattern) and there were no thousands of operations in the loop (for example, the loop body is lightweight and simple), I would use compare_exchange_weak. Otherwise, use strongly typed compare_exchange_strong.

Memory sequence for getting / releasing semantics

As mentioned above, get / release the memory_order definition under the semantics:

Memory_order_acquire

Memory_order_consume

Memory_order_release

Memory_order_acq_rel

For reading (loading), memory_order_acquire and memory_order_consume are optional. For write (storage), only memory_order_release can be selected. Memory_order_acq_rel is the only thing that can be used to do RMW operations, such as compare_exchange, exchange, fetch_xxx. In fact, atomic RMW primitives have the ability to acquire semantic memory_order_acquire and release semantic memory_order_release or memory_order_acq_rel.

These constants determine the semantics of RMW operations because RMW primitives can perform atomic reads and writes concurrently. RMW operations are semantically thought to have get-load, or release-store, or both.

It is feasible to only define the semantics of RMW operations in the algorithm, and the part similar to the spin lock to some extent is very special in the lock-free algorithm. First, get the resource, do some operations, such as calculating the new value, and finally, release the new resource value. If resource acquisition is performed by RMW operations (usually CAS), such operations are likely to have acquisition semantics. If a new value is executed by the RMW primitive, this type is likely to have release semantics. It is not aimless to use "most likely" description, and it is necessary to analyze the details of the algorithm in order to understand what semantics match what RMW operations.

The get / release mode is not possible if the RMW primitive is executed separately, but there are three possible semantic variants:

Memory_order_seq_cst is the core of the algorithm. In RMW operations, code rearrangement, loading and storage up and down migration will report errors.

Memory_order_acq_rel is somewhat similar to memory_order_seq_cst, but the RMW operation is located inside the get / release.

Memory_order_relaxed RMW operations can be migrated up and down without causing errors. (for example, the operation is in the acquisition / release interval)

These details should be well understood, and then try to adopt some basic principles, using semantics of one kind or another on RMW primitives. After that, a detailed analysis must be made for each algorithm.

Consumption semantics (COnsume-Semantic)

This is an independent, weaker type of acquisition semantics, a read consumption semantics. This semantics is introduced into DECAlpha processors as a "gift of memory". Alpha architecture is very different from other modern architectures in that it destroys data dependencies. The following code is an example:

Struct foo {

Int x

Int y

}

Atomic pFoo

Foo * p = pFoo.load (memory_order_relaxed)

Int x = p-> x

Rearrange p-> x reads and p fetch (don't ask me how this is possible! This is one of the features of Alpha. I haven't used Alpha, so I'm not sure whether it's right or not. To prevent this rearrangement, consumer semantics are introduced for atomic reading of struct pointers, and struct field reading. This is the case with the pFoo pointer in the following example:

Foo * p = pFoo.load (memory_order_consume)

Int x = p-> x

Consumption semantics is between the loose semantics of reading and acquisition semantics, and most architectures are based on the loose semantics of reading.

Talk about CAS again

I have introduced two CAS atomicity interfaces-weak and Strong, but more than two CAS variants and other CAS have an extra memory_order parameter:

Bool compare_exchange_weak (tipped, T, memory_order successOrder, memory_order failedOrder)

Bool compare_exchange_strong (tipped, T, memory_order successOrder, memory_order failedOrder)

But what are the parameters of failedOrder?

Remember that CAS is the RMW primitive, and atomic reading is performed even if it fails. If CAS fails, the failedOrder parameter will determine the semantics of this read operation. Ordinary reading of the same value is also supported, in practical applications, "semantics for failure" is extremely rare, of course, it depends on the algorithm.

Loose semantics

Finally, let's talk about the third atomicity model. Loose semantics apply to all atomic primitives-load, store, and all RMW- have almost no restrictions. Therefore, it allows the processor to rearrange instructions to the maximum extent, which is its greatest advantage. Why almost?

First of all, the standard needs to ensure the atomicity of loose computing. This means that even loose computing should be atomic and there is no partial effects.

Second, heuristic writing is prohibited in atomic loose writing.

These requirements will be strictly applied to the atomic loose operations of some weak serialization architectures. For example, loose loading of atomic variables is implemented by load.acq in Intel Itanium (acquire-read, do not mix Itanium acquire with C++ acquire).

The Requiem of Itanium

I mentioned Intel Itanium many times in this article, which makes me feel like a fan of Intel architecture; in fact, the architecture is slowly fading away, and of course I'm not a fan of Intel. Itanium VLIW architecture is different from other architectures in that it is the construction rule of its command system. Memory serialization is accomplished by prefixes of loading, storing, and RMW instructions. You won't find these in modern architectures, and these acquisition and release terms make me think that Clover 11 may have been copied from Itanium.

In the past, we used Itanium or its subarchitecture until AMD introduced AMD64- to extend x86 to 64 bits. At that time, Intel was slowly developing a 64-bit computing architecture. There are some details hidden in this architecture, and through it, you will learn that the desktop Itanium was originally intended for us. In addition, the Microsoft Windows operating system port and Visual C++ compiler for Itanium architecture also indirectly prove this point (has anyone seen any other Windows operating systems running on Itanium? ). AMD obviously messed up Intel's plans, and Intel had to catch up and integrate 64-bit into x86. Finally, Itanium stays in the server segment and fades away because it can't get the right development resources.

However, a set of VLIW instructions from Itanium is interesting and has made a breakthrough. These instructions executed by modern processors (loading execution blocks, rearranging operations) have been embedded in Itanium compilers. However, the compiler cannot handle the task, nor can it produce complete optimization code. As a result, Itanium performance has hit rock bottom several times, so Itanium is a future that we cannot achieve.

But who knows, maybe it's too early to write the Requiem of Dreams?

People familiar with the Clear11 standard will surely ask, "where does relations determine the semantics of atomic operations: happened before, synchronized with, or something else?" I would say "in the standard".

Anthony Williams describes this in detail in chapter 5 of his book C++ Concurrency in Action, and you can find many detailed examples.

Standard developers have an important task to make some changes to the rules of C++ 's memory model. This rule is not used to describe the location of memory barriers, but to ensure communication between threads.

As a result, a concise specification of C++ memory model is produced.

Unfortunately, in practical applications, this relationship is too difficult to use; whether in complex or simple lock-free algorithms, a large number of variables need to be considered to ensure the correctness of memory_order.

This is why the default model is the serialization consistency model, which does not need to set any special memory_order parameters for atomic operations. As mentioned earlier, the model is in a state of deceleration, and the application of weak models-such as acquisition / release or loosening-requires algorithm verification.

Supplementary note: after reading some articles, I found that the final discussion was not accurate enough. In fact, the sequence consistency model itself doesn't guarantee anything, and even with its help, you can make a mess of your code. Therefore, no matter what kind of memory model, lock-free algorithm verification is necessary. It's just that in weak models, it's especially necessary.

Verification of lock-free algorithm

The first authentication method I know is the relacy library written by Dmitriy Vyukov. Unfortunately, this approach requires a special model. As a first step, the simplified lock-free model should be built in relacy library; and the model should be debugged (why is it simplified? When modeling, you usually have to carefully discard things that have nothing to do with algorithms); only in this way can you write an algorithm product. This approach is particularly suitable for software engineers engaged in the development of lock-free algorithms and lock-free data structures, and in fact it is.

But it is usually difficult to do two steps, perhaps because of people's inert nature, they can need something right away.

I guess the authors of relacy are also aware of this flaw (not ironically, it's a groundbreaking project in this small field). The author uses a verification method as part of the standard library, which means you don't have to do any additional models. This looks a bit like the safe iterators concept in STL.

Recently, a new tool, ThreadSanitizer, developed by Dmitriy and his Google colleagues, can be used to detect data competition in programs; it is therefore very useful in the rearrangement of atomic operations. More importantly, the tool is not built into STL, but in lower-level compilers (such as Clang3.2, GCC4.8).

The use of ThreadSanitizer is particularly simple, compiling a program with only specific keystrokes, running unit tests, and then you can see a rich log analysis structure. Therefore, I also apply this tool to my libcds library to make sure there is no problem with libsds.

"I don't quite understand"-critical criteria

I dare to criticize the C++ standard, but I just don't understand why the standard sets the semantics as a parameter for atomic operations. Logically, however, you should use templates, which is the right thing to do:

Template

Class atomic {

Template

T load () const

Template

Void store (T val)

Template

Bool compare_exchange_weak (T & expected, T desired)

/ / and so forth, and so on

}

Let me talk about why my idea is more correct.

As mentioned more than once, atomic operational semantics work not only on processors, but also on compilers. Semantics is the optimized (semi) barrier of the compiler. In addition, the compiler should monitor whether atomic operations are given appropriate semantics (for example, release semantics applied to read operations). Then the semantics should be determined at compile time, but in the following code, it's hard for me to imagine how the compiler can do this:

Formally, this code does not violate the Category 11 standard, but the only thing the compiler can do is:

Extern std::memory_order currentOrder

Std::Atomic atomicInt

AtomicInt.store (42, currentOrder)

Either an error is reported, but why is the atomic computing interface allowed to throw an error?

Or apply serialization consistency semantics, which in short is "not too bad". But the variable currentOrder will be ignored and the program will encounter a lot of problems that we would like to avoid.

Or generate a switch/case statement for all possible values of currentOrder. But in this way, we get a lot of inefficient code instead of one or two assembly instructions. The proper semantic problem has not been solved, you can call release read or get write.

However, the template method does not have this defect, in the template function, the compile-time constant is defined in the memory_order list. Indeed, the invocation of atomic operations is a bit cumbersome.

Std::Atomic atomicInt

AtomicInt.store (42)

/ / or even like that:

AtomicInt.template store (42)

However, these tedious tasks can be offset by templates, and the operational semantics can be clearly displayed at the compilation time. The only explanation for C++ 's non-template approach is to be compatible with the C language. In addition to the std::atomic class, C atomicity functions such as atomic_load, atomic_store and other C atomicity functions are also introduced in the Category 11 standard.

This is the end of the content of "Clear11 memory Model is understood this way". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.