What are the shortcomings of Linux in multi-core scalability design? 07/11 Update SLTechnology News&Howtos

What are the shortcomings of Linux in multi-core scalability design?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces what are the shortcomings of Linux in multi-core scalability design, the content is very detailed, interested friends can refer to, hope to be helpful to you.

In fact, I don't want to discuss the concept of microkernel, nor am I good at expounding concepts. This is an encyclopedia thing, but recently, due to the release of Hongmeng, this topic has gone too far, so I can't resist the temptation. In addition, I have always liked the topic of the operating system, so let's make a cliche.

When it comes to microkernels, their performance is often criticized because of IPC. However, apart from this obvious "flaw", little attention seems to have been paid to other aspects. So I wrote something a little different.

I assume that the performance "flaw" of the microkernel is caused by the high-cost IPC (and it is), so I continue to assume that the IPC performance can be optimized and that it has been optimized (even if nothing is done, the so-called historical shortcomings tend to be weakened as hardware technology evolves.) I evaded the core issue unfairly, which is not very moral, but I have to do so for the sake of smooth writing below.

The reason why many people are not optimistic about the microkernel is largely because it is so different from the Linux kernel, people think that operating system kernels different from the Linux kernel have defects in one way or another, this is because the Linux kernel has brainwashed us.

The design of the Linux kernel solidifies people's understanding of the operating system kernel, so that the Linux kernel does everything right and the anti-Linux probability is wrong. Is the Linux kernel sure to be correct?

In my opinion, the Linux kernel is just a kernel that happens to run at the right time, and it happens to be open source, allowing people to get a glimpse of an operating system kernel for the first time, which doesn't mean it's necessarily right. On the contrary, it is probably wrong. In the 1990s, the Windows NT system started, but it was hard to see what was inside. "windows internal" was all the rage; UNIX was in dispute, but GNU didn't call it out, and the Linux kernel satisfied people's curiosity, so it was preconceived to make people think that this is what the operating system should look like, and in the eyes of most people, this is its only appearance. ]

This article is mainly about the scalability of the kernel.

Throw a pot of cold water first, the Linux kernel is not perfect in this respect.

It is true that the Linux kernel has grown from 2.6 to 5.3 over the past decade and has been striving for excellence in the multicore extension of SMP, but to be honest, there has not been any fundamental adjustment in the architecture.

$O (1) $scheduling algorithm.

SMP processor domain load balancing algorithm.

Percpu data structure.

Unlock the data structure.

It's all details, nothing to wow about, and the more detailed management of cache refresh, something that you don't have to forget the next day.

This is reminiscent of the process in which people continued to optimize CSMA/CD algorithms before the advent of switched Ethernet, and it was not amazing until the emergence of the switch, and CSMA/CD was almost completely abandoned because it was not the right thing.

The core of the correctness of the switch is arbitration.

When a shared resource can only hold access by one entity at a time, we call the resource "a shared resource that must be accessed serially". When multiple entities want to access this resource, one by one is inevitable. There are two one by one solutions:

Which one is better? Go on.

Competition will inevitably lead to conflicts, and conflicts will delay the passage of the whole. Which one would you choose?

Now, we temporarily forget the concepts such as macro kernel, microkernel, process isolation, process switching, cache refresh, IPC and so on, which do not help us to understand the nature of things. On the contrary, they prevent us from building new perceptions. For example, no matter how good you think the microkernel is, someone will jump out and say that IPC is the bottleneck of the microkernel. When you propose an optimization such as page table item exchange, others will say that process switching is cache, and register context save/restore is expensive, and then you may know some cache solutions with process PID keys. Come on, the last show me the code leaves you speechless, come and go, and you don't know the whole picture yet. Has been caught up in the details.

So forget all this and look at a point of view:

For shared resources that must be accessed serially, the right thing to do is to introduce an arbitrator to queue up visitors, rather than leaving visitors to concurrently contend for locks!

The concept of the so-called operating system is unfounded. You can call it anything. In the early days, it was called a monitor, but now let's just call it an operating system, but that doesn't mean the concept is magical.

The operating system is originally used to coordinate multiple processes (this is also an abstract concept, you can call it a task, it doesn't matter) many-to-one access to the underlying shared resources, the most typical resource is probably CPU resources, and almost everyone knows that CPU resources need to be scheduled, so task scheduling has always been a hot topic.

You see, CPU is not used by all tasks concurrently, but by who can use the scheduler. Scheduling, or arbitration, is the essence of the operating system.

So for the files shared in the system, socket, for various tables such as routing tables and other resources, why should we use the way of concurrent competition?! All shared resources should be scheduled for use, just like CPU resources.

If we follow the most essential functions that the operating system is supposed to achieve, instead of thinking about Linux as a preconceived standard, we will find that the Linux kernel is obviously a wrong way to deal with concurrency!

Spin locks are widely used in the Linux kernel, which is obviously the simplest and easiest way to evolve from a single core to SMP, that is, as long as there is no problem!

Indeed, the spin lock on a single core does not spin as it literally implies. In a single-core scenario, the spin lock implementation of Linux simply disables preemption. Because, in this way, there will be no problems.

But when it comes to the need to support SMP, simply banning preemption can no longer guarantee that there will be no problems, so staying in place and waiting for the lock holder to leave is the most obvious solution. The spin lock has been in use ever since. Until today, spin lock is constantly being optimized, but no matter how optimized it is, it is always an inappropriate spin lock.

It can be seen that the Linux kernel was not designed for SMP in the first place, so its concurrency mode is wrong, or at least not appropriate.

I will use a set of user-mode code to simulate how the macro kernel without arbitration and the micro kernel with arbitration deal with shared resource access. The code is relatively simple, so I didn't add too many comments.

The following code simulates the spin lock concurrent contention mode when accessing shared resources in a macro kernel:

# include # include static int count = 0; static int curr = 0; static pthread_spinlock_t spin; long long end, start; int timer_start = 0; int timer = 0; long long gettime () {struct timeb t; ftime (& t); return 1000 * t.time + t.millitm;} void print_result () {printf ("% d\ n", curr) Exit (0);} struct node {struct node * next; void * data;}; void do_task () {int I = 0, j = 2, k = 0; / / for a more fair comparison, since the code that simulates the microkernel uses memory allocation, there is also a fake. Struct node* tsk = (struct node*) malloc (sizeof (struct node)); pthread_spin_lock (& spin); / / Lock the entire access calculation interval if (timer & & timer_start = = 0) {struct itimerval tick = {0}; timer_start = 1; signal (SIGALRM, print_result); tick.it_value.tv_sec = 10 Tick.it_value.tv_usec = 0; setitimer (ITIMER_REAL, & tick, NULL);} if (! timer & & curr = = count) {end = gettime (); printf ("% lld\ n", end-start); exit (0);} curr + +; for (I = 0; I

< 0xff; i++) { // 做一些稍微耗时的计算，模拟类似socket操作。强度可以调整，比如0xff->

Do the test on the machine with more CPU, make it stronger, otherwise the queue overhead will overwhelm the cost of the simulation task. K + = iUnip j;} pthread_spin_unlock (& spin); free (tsk);} void* func (void* arg) {while (1) {do_task ();}} int main (int argc, char * * argv) {int err, i; int tcnt; pthread_t tid; count = atoi (argv [1]); tcnt = atoi (argv [2]) If (argc = = 4) {timer = 1;} pthread_spin_init (& spin, PTHREAD_PROCESS_PRIVATE); start = gettime (); / / create worker thread for (I = 0; I

< tcnt; i++) { err = pthread_create(&tid, NULL, func, NULL); if (err != 0) { exit(1); } } sleep(3600); return 0; } 相对的，微内核采用将请求通过IPC发送到专门的服务进程，模拟代码如下： #include #include #include #include #include #include #include static int count = 0; static int curr = 0; long long end, start; int timer = 0; int timer_start = 0; static int total = 0; long long gettime() { struct timeb t; ftime(&t); return 1000 * t.time + t.millitm; } struct node { struct node *next; void *data; }; void print_result() { printf("%d\n", total); exit(0); } struct node *head = NULL; struct node *current = NULL; void insert(struct node *node) { node->

Data = NULL; node- > next = head; head = node;} struct node* delete () {struct node* tempLink = head; headhead = head- > next; return tempLink;} int empty () {return head = = NULL;} static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; static pthread_spinlock_t spin Int add_task () {struct node* tsk = (struct node*) malloc (sizeof (struct node)); pthread_spin_lock (& spin); if (timer | | curr

< count) { curr ++; insert(tsk); } pthread_spin_unlock(&spin); return curr; } // 强度可以调整，比如0xff->

Do the test on the machine with more CPU, make it stronger, otherwise the queue overhead will overwhelm the cost of the simulation task. Void do_task () {int I = 0, j = 2, k = 0; for (I = 0; I

< 0xff; i++) { k += i/j; } } void* func(void *arg) { int ret; while (1) { ret = add_task(); if (!timer && ret == count) { break; } } } void* server_func(void *arg) { while (timer || total != count) { struct node *tsk; pthread_spin_lock(&spin); if (empty()) { pthread_spin_unlock(&spin); continue; } if (timer && timer_start == 0) { struct itimerval tick = {0}; timer_start = 1; signal(SIGALRM, print_result); tick.it_value.tv_sec = 10; tick.it_value.tv_usec = 0; setitimer(ITIMER_REAL, &tick, NULL); } tsk = delete(); pthread_spin_unlock(&spin); do_task(); free(tsk); total++; } end = gettime(); printf("%lld %d\n", end - start, total); exit(0); } int main(int argc, char **argv) { int err, i; int tcnt; pthread_t tid, stid; count = atoi(argv[1]); tcnt = atoi(argv[2]); if (argc == 4) { timer = 1; } pthread_spin_init(&spin, PTHREAD_PROCESS_PRIVATE); // 创建服务线程 err = pthread_create(&stid, NULL, server_func, NULL); if (err != 0) { exit(1); } start = gettime(); // 创建工作线程 for (i = 0; i < tcnt; i++) { err = pthread_create(&tid, NULL, func, NULL); if (err != 0) { exit(1); } } sleep(3600); return 0; } 我们对比一下执行同样多的任务，在不同的线程数的约束下，两种模式的时间开销对比图：

We can see that in the code simulating the microkernel, when using multithread execution to access the shared data curr in parallel, the overhead does not change with the number of threads, while in the code simulating the macro kernel, the total time increases linearly with the increase of the number of threads. Obviously, this part of the overhead is the cost of spin lock. Today's popular CPU cache architecture has queued spin lock overhead in line with this linear growth.

So why doesn't the lock overhead in the emulated code of the microkernel increase with the number of threads?

Because in synchronization tasks like macro kernels, because concurrency contexts are isolated from each other, the entire task must be protected by a lock, such as in the tcp_v4_rcv of the Linux kernel:

Bh_lock_sock_nested (sk); / / this part of the time is uncertain, so the CPU idling rate is uncertain, inefficient, wasteful! Ret = 0; if (! sock_owned_by_user (sk)) {if (! tcp_prequeue (sk, skb)) ret = tcp_v4_do_rcv (sk, skb);} else if (unlikely (sk_add_backlog (sk, skb, sk- > sk_rcvbuf + sk- > sk_sndbuf)) {bh_unlock_sock (sk) NET_INC_STATS_BH (net, LINUX_MIB_TCPBACKLOGDROP); goto discard_and_relse;} bh_unlock_sock (sk)

However, in the code of the microkernel, tasks like the above are packaged and handed over to a separate service thread to schedule and execute, greatly reducing the latency in the lock area.

The macro kernel needs to lock the whole task in the context isolation and concurrent locking scenario, resulting in a huge cost of locking, while the microkernel only needs to lock the task queue, which has nothing to do with the specific task and can be expected.

Next, let's compare the time cost of the two modes under the constraint of different number of CPU to perform the same task:

It can be seen that with the increase of the number of CPU, the code locking overhead of simulating macro kernel increases linearly, while that of simulating microkernel code increases, but it is obviously not obvious.

Why is this? Take a look at the following comparison between the macro kernel and the micro kernel. First, take a look at the macro kernel:

Take a look at the microkernel:

This is obviously a more modern approach, which not only reduces lock overhead and improves performance, but also greatly reduces CPU idling and improves CPU utilization.

Let's first look at the CPU utilization of code that simulates the macro kernel when it executes for 10 seconds:

If you look at the hot spots, you can guess that it is spinlock:

Obviously, the CPU utilization is so high that it is not really performing a useful task, but idling in spin.

Let's take a look at how the code that simulates the microkernel behaves in the same situation:

Take a look at the hot spots:

Obviously, there is still a hot spot in spinlock, but it is obviously much lower. Under the guarantee of higher execution efficiency, CPU is not that high, and the remaining free time can be used to execute more meaningful work processes.

This article only shows a qualitative effect. In practice, the task queue management of the microkernel service process will be more efficient. It can even be implemented in hardware. [see the switched network implementation of the switch backplane. ]

Having said so much, some people may say, NO, the case of your two comparisons is not rigorous. You only simulate accessing shared data. If it is really parallel executable code, wouldn't the microkernel scheme degrade performance? Pingbai self-abolishing martial arts, will be parallel to serial! That's true, but the kernel itself is shared. The operating system itself coordinates the access of user processes to the underlying shared resources.

So real parallelism requires programmers to design applications that can be parallelized.

The kernel itself is shared. Multithreaded access to shared resources should be strictly serialized, concurrent lock contention is the most disordered way, and the most effective way is unified arbitration scheduling.

In our daily life, we can obviously see and understand why getting on the bus in line is more efficient than getting on the bus in a crowd. In the field of computer systems, we see the same thing in switched Ethernet and PCIe. Compared to CSMA/CD 's shared Ethernet, the switch is an arbitration scheduler, and PCIe's message hub also plays the same role.

In fact, even if the macro kernel, when accessing shared resources, it is not all concurrent locking. For sensitive resources, such as hardware resources with high delay requirements, the bottom layer of the system is also implemented by arbitration scheduling, such as the queue scheduler for the upper layer of the network card, and there is also a corresponding disk scheduler for disk IO.

However, for the macro kernel, higher-level logical resources, such as VFS file objects, socket objects, various queues, etc., are not accessed by arbitration scheduling. When they are accessed concurrently by multiple threads, they adopt a regrettable concurrent lock contention mode, which is a last resort, because no entity can complete the arbitration, after all, the context of accessing them is isolated.

Let's have an interruption. When tuning the Linux system, it is basically enough to aim at the hot spots related to these aspects. A large number of hot issues are caused by this: open/close the same file, process context and soft interrupt operate the same socket at the same time, soft interrupt contexts on multiple CPU put packets into the same queue when receiving packets, and so on. \ if you're not going to tune Linux, maybe you already know the fundamental flaws of the Linux kernel in the SMP environment, so tune it. Look at the outside world more, it may be better than the only one in front of you.

When we evaluate traditional UNIX and Linux operating system kernels, we should pay more attention to what they are missing than to think that they are right. [you think it is, maybe just because it's the first and only one you've ever seen]

If you have to talk about the concept, it is necessary to talk about the virtual machine abstraction of modern operating systems.

For the modern operating system that we often talk about, according to the original von Neumann structure, only "CPU and memory" is abstracted from the multiprocessing mechanism (including all the multiprocessing, multithreading and other mechanisms), but there is no multiprocessing abstraction for the file system, network protocol stack and so on. In other words, modern operating systems provide an exclusive virtual machine abstraction for processes, which includes only CPU and memory:

Timeslice scheduling makes the process think it has a monopoly on CPU.

Virtual memory makes the process think that it has exclusive memory.

There is no other virtual machine abstraction.

When processes use these abstract resources, there is no doubt that modern operating systems use arbitration scheduling mechanisms:

The operating system provides the task scheduler to arbitrate the time-sharing multiplexing of the CPU (typically a multi-level feedback priority queue algorithm) to allocate the time slice resources of the physical CPU to the process / thread.

The operating system provides a memory allocation algorithm to arbitrate the allocation of physical memory space (typically a partner system algorithm), which uniformly allocates physical memory to virtual memory for processes / threads.

Obviously, as mentioned at the beginning of this article, the operating system does not allow processes to concurrently compete for CPU and memory resources, but the operating system does not have any strict rules for almost all other resources. The operating system treats them in two ways:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

It is considered that other resources are not part of the core of the operating system, so the concept of microkernel, user-mode driver and so on is formed.

It is believed that other underlying resources are also part of the core of the operating system, which is the attitude of macro kernels such as Linux.

Attitude, it doesn't matter, macro kernel, micro kernel, using household state, kernel state, these are just concepts, no big deal. The key question is:

How to coordinate the allocation of shared resources. Or space resources, or time resources or concurrent contention locks, or arbitration scheduling.

There is no doubt that the biggest controversy is how to coordinate access to non-process virtualized file systems and network protocol stacks outside of CPU/ memory. But no matter which one of them, there are great solutions for both macro kernel and microkernel at present. Unfortunately, none of these great solutions are adopted by the Linux kernel.

Oh, Nginx takes a similar approach to microkernel, switch and PCIe, but Apache is not. There are many other examples, not to repeat them one by one, just to say that in the field of operating systems, the core things are invisible elephants, not all kinds of concepts.

Excerpt a passage from Wang Yin talking about the micro-kernel:

It's annoying to talk to some people about the operating system, because I tend to abandon some terms and concepts and start from scratch. I try to understand such things from the starting point of the "nature of computing", their causes, developments, status quo and possible improvements. I tend to care about "what this thing should look like" and "what else it could be (maybe better)", not just "what it looks like now". People who do not understand this characteristic of me and think they know something will often mistakenly think that I do not even understand the basic terms. So the sky was chatted to death by them.

This is actually what I want to say.

So, forget microkernel, macro kernel, forget kernel state, use household mode, forget real mode, protect mode, so you will have a deeper understanding of the nature of how to arbitrate access to shared resources.

So much for sharing the shortcomings of Linux in multi-core scalable design. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.