Atomicity and atomicity primitive analysis of C++ unlocked data structure 04/20 Update SLTechnology News&Howtos

Atomicity and atomicity primitive analysis of C++ unlocked data structure

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, the editor will share with you the relevant knowledge points about the atomicity and atomicity primitive analysis of C++ unlocked data structures. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge. So share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

Atomic operation can be simply divided into two parts: read-write (read and write) and atomic exchange operation (read-modify-write,RMW). Atomic operation can be thought of as an inseparable operation; either it happens or it doesn't happen, we don't see any intermediate process of execution, and there is no partial effects. Simple read and write operations are not even atomic, for example, without memory-aligned data, the reading of that data is not atomic. In X86-based computers, such read operations can lead to internal avoidance. In this way, the data read by the processor is divided into several parts. In other architectures such as Sparc and Intel Itanium, such read operations can lead to segment errors, which can be intercepted and processed, while atomic operations do not have such a problem. In modern processors, atomic read and write operations only work on aligned complete types (integers and pointers), while modern compilers are the guarantee of correct alignment of volatile basic types. If you want four to eight larger data structures to be atomic, you should be careful to use compiler instructions to make sure they are properly aligned. Each compiler has its own unique data and type alignment method. By the way, the libcds library supports a set of alternate types and macros that hide what the compiler depends on when you declare alignment data.

Compare-and-swap

Even if you try your best, it is still difficult to design an unlocked container algorithm that only uses read and write (I don't know the data structure for thread random numbers). This is why processor architecture developers use RMW operations. RMW can atomically perform read operations and write operations to align memory cells: compare-and-swap (CAS), fetch-and-add (FAA), test-and-set (TAS), and so on. In academic circles, compare-and-swap (CAS) is considered to be the most basic operation. The pseudo code is as follows:

Bool CAS (int * pAddr, int nExpected, int nNew)

Atomically {

If (* pAddr = = nExpected) {

* pAddr = nNew

Return true

}

Else

Return false

}

Literally, if the value of the current variable in the pAddr address is equal to the expected nExpected, assign the value of nNew to the variable and return true;. Otherwise, false is returned, and the value of the variable remains unchanged. All execution processes are atomic and inseparable and do not produce any visible partial results. With the help of CAS, other RMW operations can be valued. The following fetch-and-add looks like this:

Int FAA (int * pAddr, int nIncr)

{

Int ncur

Do {

Ncur = * pAddr

} while (! CAS (pAddr, ncur, ncur + nIncr)

Return ncur

}

The academic type of CAS operation is not so handy in practice. After CAS fails, we often wonder what the current value in the memory unit is. At this point, you can consider another kind of CAS (so-called valued CAS, which is still atomic execution):

Int CAS (int * pAddr, int nExpected, int nNew)

Atomically {

If (* pAddr = = nExpected) {

* pAddr = nNew

Return nExpected

}

Else

Return * pAddr

}

There are two derivative types in the compare_exchange function in Clover 11 (strictly speaking, there are no such functions in Clover 11, they are ompare_exchange_strong and compare_exchange_weak, which I will tell you later):

Bool compare_exchange (int volatile * pAddr, int& nExpected, int nNew)

The parameter nExpected passes a value by reference and contains the expected variable value of the pAddr address. On the output side, the value before the change is returned. (translator's note: it actually returns the old address of pAddr. If the value nExpected exists in the function address, true is returned, and if join fails, false is returned (nExpected contains the current variable value of the address pAddr). The multipurpose CAS operation build covers two derivative types defined by the academic CAS. But in practical application, there will be some errors in compare_exchange, you need to know that the nExpected parameter is referenced, it can be changed, which is unacceptable.

However, fetch-and-add basic types can be implemented with compare_exchange, and the code can be written as follows:

Int FAA (int * pAddr, int nIncr)

{

Int ncur = * pAddr

Do {} while (! compare_exchange (pAddr, ncur, ncur + nIncr)

Return ncur

}

ABA problem

The CAS basic type is suitable for a variety of ways. However, in the process of application, a serious problem may occur, that is, the so-called ABA problem. To describe this problem, we need to consider a typical pattern of CAS operations applications:

Int nCur = * pAddr

While (true) {

Int nNew = calculating new value

If (compare_exchange (pAddr, nCur, nNew))

Break

}

In fact, we are in the loop until the CAS execution. A loop is necessary between reading the current variable value in the pAddr address and calculating the new value nNew, a variable that can be changed by other threads in the pAddr address.

ABA problems can be described in the following ways. Suppose thread A reads the A value from the shared memory unit, and at the same time, the memory cell pointer points to some data; then thread Y changes the value of the memory unit to B, and then back to A, but at this point the pointer points to some other data. But when thread A tries to change the value of a memory cell through the CAS primitive type, the pointer compares successfully with the previously read A value, and the CAS result is correct. But at this point A points to completely different data. As a result, the thread breaks the internal object connection (internal object connections), resulting in failure.

Here is an implementation of a lock-free stack that recreates the ABA problem [Mic04]:

/ / Shared variables

Static NodeType * Top = NULL; / / Initially null

Push (NodeType * node) {

Do {

/ * Push2*/ NodeType * t = Top

/ * Push3*/ node- > Next = t

/ * Push4*/} while (! CAS (& Top,t,node))

}

NodeType * Pop () {

Node * next

Do {

/ * Pop1*/ NodeType * t = Top

/ * Pop2*/ if (t = = null)

/ * Pop3*/ return null

/ * Pop4*/ next = t-> Next

/ * Pop5*/} while (! CAS (& Top,t,next))

/ * Pop6*/ return t

}

The following series of activities lead to the destruction of the stack structure (it should be noted that this sequence is not the only way to cause ABA problems).

Thread XThread YCalls Pop ().

Line Pop4 is performed

Variables values: t = = A

Next = = A-> nextNodeType * pTop = Pop ()

PTop = = top of the stack, i.e. A

Pop ()

Push (pTop)

Now the top of the stack is An again

Note, that A-> next has changedPop5 line is being performed.

CAS is successful, but the field Top- > next

Is assigned with another value

Which doesn't exist in the stack

As Y thread has pushed An and A-> next out

Of a stack and the local variable next

Has the old value of A-> next

The ABA problem is a huge disaster for all unlocked containers based on the basic type of CAS. It appears in multithreaded code if and only if element An is removed from one container, then stored in another element B, and then changed to element A. Even if other threads point the pointer to an element, the element may be being deleted. Even if the thread physically deletes An and then calls the new method to create a new element, there is no guarantee that allocator will return the address of A. This problem often occurs in scenarios with more than two threads. In view of this, we can discuss the ABCBA problem, ABABA problem and so on.

To deal with ABA issues, you should physically delete (defer memory unit reallocation, or secure memory recycling) the element, and do so if there is no competitive thread local, or globally points to the element to be deleted.

Therefore, element deletion in an unlocked data structure involves two steps:

The first step is to expel the element from the unlocked container

In the second step (delay), the element is physically removed without any connection.

I will describe the different strategies for deferred deletion in detail in a subsequent article.

Load-Linked / Store-Conditional

My guess is that the ABA problem in CAS has prompted processor developers to look for another RMW operation that is not affected by ABA problems. So we found the operation pairs such as load-linked, store-conditional (LL/SC). This operation is extremely simple, and the pseudo code is as follows:

Word LL (word * pAddr) {

Return * pAddr

}

Bool SC (word * pAddr, word New) {

If (data in pAddr has not been changed since the LL call) {

* pAddr = New

Return true

}

Else

Return false

}

The LL/SC pair runs as a parenthesis operator, and the Load-linked (LL) operation returns only the current variable value of the pAddr address. If the data in the pAddr does not change after reading, the Store-conditional (SC) operation stores the data that the LL reads the pAddr address. Under this change, any cache line changes referenced by pAddr are unambiguous. In order to implement LL/SC pairs, programmers have to change the cache structure. In short, each cache line must contain an additional bit state value (status bit). Once the LL performs a read operation, the bit value is associated. Once any cache row is written, this bit value is reset; before storing, the SC operation checks whether the bit value is for a specific cache line. If the bit value is 1, it means that the cache line has not changed, the value in the pAddr address will be changed to the new value, and the SC operation is successful. Otherwise, the operation will fail and the value in the pAddr address will not be changed to the new value.

CAS is implemented through LL/SC pairs, as follows:

Bool CAS (word * pAddr, word nExpected, word nNew) {

If (LL (pAddr) = = nExpected)

Return SC (pAddr, nNew)

Return false

}

Note that although there are multiple steps in the code, it does perform an atomic CAS. The contents of the target memory unit are either unchanged or atomically changed. For LL/SC pairs implemented in the framework, it is possible to support only the CAS base type, but not limited to that type. I do not intend to discuss it further here. If you are interested, you can refer to the citation [Mic04].

Modern processor architecture is divided into two parts. The first part supports the basic type of CAS in computer code; the second part is the LL/SC pair. CAS is implemented in X86, Intel Itanium, and Sparc frameworks. Basic types appear for the first time in IBM system 370 basic types, while LL/SC pairs in PowerPC, MIPS, Alpha, and ARM architectures first appear in DEC. If the LL/SC primitive type is not perfectly implemented in modern architecture, it is nothing. For example, an embedded LL/SC cannot be called with a different address, and the connection tag may be mistakenly abandoned.

From C++ 's point of view, C++ does not consider LL/SC pairs, but only describes the atomic primitive compare_exchange (CAS), and the atomic primitives derived from it-fetch_add, fetch_sub, exchange and so on. This standard means that CAS; can be easily implemented through LL/SC, while the implementation of backward compatibility with LL through CAS is definitely not that simple. Therefore, in order not to make it more difficult for C++ library developers, the Standards Committee only introduced C++ compare_exchange. This is sufficient for the implementation of lock-free algorithms.

Pseudo sharing (False sharing)

In modern processors, the length of cache lines is 64-128 bytes, which has a further increasing trend in the new model. The main storage and cache data are exchanged in L blocks of L byte size. Even if a byte in the cache line changes, all rows are considered invalid and must be synchronized with main memory. These are managed by cache consistency protocols in a multi-processor, multi-core architecture.

Assuming that different shared data (areas of adjacent addresses) are stored in the same cache row, from a processing point of view, a data change will invalidate other data in the same cache row. This scenario is called pseudo-sharing. For the LL/SC base type, error sharing can be destructive. The execution of these basic types depends on the cache line. The load connection (LL) operation connects to the cache line, while the storage state (SC) operation checks to see if the connection flag in that line is reset before writing. If the flag is reset, the write cannot be performed, and SC returns false. Considering that the length L of the cache line is quite long, any change to the cache line, that is, inconsistent with the target data, will cause the SC primitive type to return false. The result is a livelock phenomenon: in this scenario, even if the multi-core processor runs at full load, it is still useless.

In order to handle error sharing, each shared variable must fully handle the cache line. It is usually handled by borrowing padding. The physical structure of the cache affects all operations, not only LL/SC, but also CAS. In some studies, data structures are created in a special way that takes into account the cache structure (mainly cache row length). Once the data structure is properly built, performance will be greatly improved.

Struct Foo {

Int volatile nShared1

Char _ padding1 [64]; / / padding for cache line=64 byte

Int volatile nShared2

Char _ padding2 [64]; / / padding for cache line=64 byte

}

CAS derived type

Again, I'd be happy to introduce two more useful basic types: double-word CAS (dwCAS) and double CAS (DCAS).

Double-word CAS is similar to generic CAS, except that the former runs in double-size memory units: the 32-bit architecture is 64 bits, and the 64-bit architecture is 128 bits (requires at least 96 bits). Given that this architecture provides LL/SC rather than CAS,LL/SC, it should run on top of double-word. What I know is that only X86 supports dwCAS. So why is dwCAS so useful? With it, we can organize a solution to the ABA problem-tagged pointers. This scheme relies on each associated shared tagged pointer integer. Tagged pointer can be described by the following structure:

Template

Struct tagged_pointer {

T * ptr

Uintptr_t tag

Tagged_pointer ()

: ptr (new T)

, tag (1)

{}

}

To support atomicity, variables of this type must be aligned with double-word: 8 bytes for 32-bit architecture and 16 bytes for 64-bit architecture. Tag contains version number data, which ptr points to. I'll cover tagged pointers in detail in a next article, focusing on secure memory recycling and secure memory recycling. Currently, only memory is discussed, and when T-type data (and its corresponding tagged_pointer) is involved, it should not be physically deleted, but moved into a free-list (isolating each T-type). In the future, with the growth of tag, data can be distributed storage. Solution to the ABA problem: in reality, this pointer is complex, and tag contains a version number (distributed location number). If the tagged_pointer pointer type is the same as the dwCAS parameter, but the value of the tag is different, then the dwCAS will not execute successfully.

The second atomic primitive, double CAS (DCAS), is purely theoretical and has not been implemented in any modern processor architecture. The DCAS pseudo code is as follows:

Bool DCAS (int * pAddr1, int nExpected1, int nNew1

Int * pAddr2, int nExpected2, int nNew2)

Atomically {

If (* pAddr1 = = nExpected1 & & * pAddr2 = = nExpected2) {

* pAddr1 = nNew1

* pAddr2 = nNew2

Return true

}

Else

Return false

}

The DCAS runs on two randomly sorted memory units. If the current value is consistent with the expected value, you can change the value of the two memory cells.

Why is this basic type so interesting? Because it is easy to build an unlocked double linked list (deque). Data structures are the basis of many interesting algorithms. Many academic jobs focus on data structures that are based on DCAS. Although this basic type has not been implemented in hardware, there is still some work (such as [Fra03]-the most popular one) describing DCAS builds based on conventional CAS (for any number of pAddr1... Multi-CAS) algorithm for pAddrN address.

Performance

So what is the performance of atomic primitives?

Modern processors are so complex and difficult to predict that programmers often find it difficult to follow computer instructions. In particular, the working mechanism of atomic instructions involves internal synchronization, processor bus signals and so on. A lot of work is trying to test the processor instruction length. And the test I mentioned came from [McKen05]. In this article, the author compares atomicity growth (atomic increment) with CAS primitive type length and nop (no-operation) length. For example, the Intel Xeon 3.06 hHz processor (2005 model) has an atomic growth length of 400 nop,CAS and a length of 850-1000 nop. The atomic growth length of IBM Power4 1.45 hHz processor is 180 nop, and the CAS length is 250 nop. The testing time is a long time ago, and there have been some improvements in the processor architecture, but I guess it is still on the same order of magnitude.

As you can see, atomic primitives are quite complex. So it is quite disadvantageous to use it in any scene without trade-off. For example, the binary tree search algorithm uses CAS to read the nodes of the current tree, but I don't like this kind of algorithm. Meaningless, each generation of Intel core architecture, its CAS will become faster. Obviously, Intel has made a lot of efforts to improve the micro architecture.

Volatile and atomicity

There is a mysterious keyword Volatile in C++. In many cases, Volatile is considered to be related to atomicity and regulation. In fact, this is wrong, of course, there are historical reasons for such an understanding. Volatile simply prevents the compiler from caching values in registers (the more registers the compiler optimizes, the more data the compiler caches in it). Reading the Volatile variable means always reading from memory, and the writing of the Volatile variable is written directly to memory. If you change Volatile data concurrently, you need to be aware of this.

In fact, we didn't do that, mainly because of the lack of memory fences. Some languages, such as Java and clippings, are endowed with a magical state value to provide calibration. But this is not the case in Central11. Volatile does not have any special calibration, and now we know that proper calibration is necessary for atomicity.

Therefore, it is not necessary for compliant compilers to provide volatile for atomic variables. However, in previous compilers, it was necessary to use volatile if you wanted to implement atomicity yourself. In the following statement:

Class atomic_int {

Int m_nAtomic

/ / … .

}

The compiler has the authority to optimize m_nAtomic calls (albeit indirectly). Therefore, it is useful to declare an int volatile m_nAtomic here frequently.

Libcds

So what can we get from the libcds library? We have implemented atomic primitives in the architecture of x86, amd64, and Intel Itanium atoms Sparc in the form of Clear11. If the compiler does not support Clover 11, libcds can adopt its own atomicity implementation. To build lock-free data structures, apart from regular atomic writes and reads, the main basic type is CAS, while DwCAS is rarely used. So far, the libcds library has no implementation of DCAS and multi-CAS, but these are likely to emerge in the future. Many studies have shown that the only constraint is that it is too difficult to implement the DCAS algorithm [Fra03]. Nevertheless, I have mentioned that individual efficient guidelines already exist in an unlocked world. At present, it is the hardware that is inefficient, and I believe that it will become extremely efficient for different hardware and tasks in the days to come.

These are all the contents of the article "atomicity and atomicity primitive analysis of C++ unlocked data structures". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.