What is the key technology of Linux multi-core parallel programming 04/21 Update SLTechnology News&Howtos

What is the key technology of Linux multi-core parallel programming

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article focuses on "what is the key technology of Linux multi-core parallel programming". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what is the key technology of Linux multi-core parallel programming"?

Background of multi-core parallel programming

Before Moore's Law fails, the improvement of processor performance can meet the needs of applications through technologies such as frequency enhancement and hardware hyper-threading. As the increase of the main frequency slowly approaches the wall of the speed of light, Moore's Law begins to fail gradually, and multi-core integration has become the mainstream means to improve the performance of processors. Now it is difficult to see a single-core processor on the market, which is the evidence of this development trend. In order to give full play to the advantages of multi-core computing resources, multi-core parallel programming is inevitable. Linux kernel is a typical multi-core parallel programming scenario. However, parallel programming in multi-core environment has many challenges.

The Challenge of multicore parallel programming

At present, the mainstream computers are von Neumann architecture, that is, the shared memory computing model, this process computing model is not friendly to parallel computing. The following figure shows a typical computer hardware architecture.

In this architecture, there are the following design features:

Multiple CPU cores improve the computing power of processors

Multi-level cache improves the efficiency of CPU access to main memory

Each CPU has local memory (NUMA (inconsistent memory access)), which further improves the efficiency of CPU accessing main memory.

Store buffer module improves the write pause problem caused by reply delay in cache write

The invalidate queue module improves the delay of invalidating the reply and sends the reply as soon as the invalid command is put into the queue.

Peripheral DMA supports direct access to main memory to improve the efficiency of CPU use.

These hardware architecture design features also introduce a lot of problems, the biggest of which are cache consistency and out-of-order execution.

The problem of cache consistency is solved by cache consistency protocol MESI. MESI is guaranteed by hardware and is transparent to software. The MESI protocol ensures that all CPU changes to a single variable in a single cache line are in the same order, but there is no guarantee that changes to different variables are seen in the same order on all CPU. This leads to disorder. Not only that, there are many reasons for disorder:

Delayed processing caused by store buffer can cause disorder

Delayed processing caused by invalidate queue can cause disorder

Compiler optimization will cause disorder

CPU hardware optimization techniques such as branch prediction and multi-pipelining will cause disorder.

Peripheral DMA, which can cause data disorder

As a result, the atomicity of even simple + + operations cannot be guaranteed. These problems must be solved by new technical means of multi-core parallel programming.

Key Technologies of Multi-core parallel programming

Locking technology

Linux kernel provides a variety of locking mechanisms, such as spin lock, semaphore, mutex, read-write lock, sequential lock and so on. The simple comparison of various locks is as follows, the specific implementation and use details will not be expanded here, you can refer to the relevant chapters of books such as "Linux Kernel Design and implementation".

Spin lock, no hibernation, no process context switching overhead, can be used in situations with small interrupt context and critical area

Semaphores, which will sleep, support multiple concurrent bodies to enter the critical zone at the same time, and can be used in situations where dormancy or long critical areas are possible.

Mutexes, similar to semaphores, but only support that only one concurrency enters the critical zone at the same time

Read-write lock, support for read concurrency, mutual exclusion between write / read-write, read-write delay, read-friendly, suitable for read-focused situations

Sequential lock, support for read concurrency, mutual exclusion between write / read, write will delay reading, write-friendly, suitable for write-focused situations

Although lock technology can effectively provide race protection under parallel execution, the parallel scalability of lock is very poor, so it can not give full play to the performance advantages of multi-core. Too coarse granularity of the lock will limit scalability, too fine granularity will lead to huge system overhead, and it is difficult to design and easy to cause deadlock. In addition to poor concurrent scalability and deadlocks, locks introduce many other problems, such as lock alarm groups, live locks, hunger, unfair locks, priority reversal, and so on. However, there are some technical means or guidelines that can solve or mitigate the risks of these problems.

Use locks (levels of locks) in a uniform order to solve the deadlock problem

Exponential retreat to solve the problem of live lock / hunger

Range lock (tree lock) to solve the problem of lock alarm group

Priority inheritance to solve priority inversion problem

Atomic technology

The main purpose of atomic technology is to solve the problem of cache inconsistency and the destruction of atomic access by out-of-order execution. The main atomic primitives are:

ACCESS_ONECE (): restricts only compiler optimization of memory access

Barrier (): restrict only the out-of-order optimization of the compiler

Smb_wmb (): write memory barrier, refresh store buffer, and limit out-of-order optimization of compiler and CPU

Smb_rmb (): read memory barrier, refresh invalidate queue, while limiting out-of-order optimization of compiler and CPU

Smb_mb (): read-write memory barrier while refreshing store buffer and invalidate queue, while limiting out-of-order optimization of compilers and CPU

Atomic_inc () / atomic_read (), etc.: integer atomic operation

One more thing to mention is that the atomic_inc () primitive needs to refresh the cache in order to ensure atomicity, while the cache line is time-consuming to propagate in the multi-core system, and its parallel scalability is poor.

Lock-free technology

The atomic technology mentioned in the previous section is one of the lock-free technologies, in addition, lock-free technologies also include RCU, Hazard pointer and so on. It is worth mentioning that these lock-free technologies are based on memory barriers.

Hazard pointer is mainly used for object lifecycle management, similar to reference counting, but has better parallel scalability than reference counting.

RCU can be used in many scenarios, such as read-write locks, reference counting, garbage collector, waiting for things to end, etc., and has better parallel scalability. However, RCU also has some scenarios that are not applicable, such as writing emphasis, critical zone length, dormancy in critical zone and so on.

However, all unlocked primitives can only solve the problem of parallel scalability on the reader side, and the parallel scalability on the write side can only be solved by data segmentation technology.

Data segmentation technology

Dividing the data structure and reducing the shared data is the fundamental way to solve the parallel scalability. Data structures that are partition-friendly (that is, parallel-friendly) are:

Array

Hash table

Base tree (Radix Tree) / sparse array

Jump list (skip list)

The use of these easily segmented data structures is helpful for us to improve parallel scalability through data segmentation.

In addition to using appropriate data structures, reasonable segmentation guidelines are also important:

Read-write separation: separation of read-based data from write-based data

Path segmentation: split data according to a separate code execution path

Special segmentation: bind frequently updated data to a specified CPU/ thread

Ownership segmentation: split the data structure according to the number of CPU/ threads and split the data into per-cpu/per-thread

Among the four segmentation rules, the division of ownership is the most thorough.

At this point, I believe you have a deeper understanding of "what is the key technology of Linux multi-core parallel programming". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.