What is the linux CPU isolation method? 07/04 Update SLTechnology News&Howtos

What is the linux CPU isolation method?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what is the method of linux CPU isolation". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Introduction

The mixed part usually refers to the offline mixed part (also known as the offline mixed part), which means that the resource utilization of the node can be improved by mixing online business (usually delay-sensitive high-priority tasks) and offline tasks (usually CPU consumption low-priority tasks) on the same node at the same time. The key difficulty lies in the underlying resource isolation technology, which heavily depends on the OS kernel, while the resource isolation capability provided by the existing native Linux kernel appears to be somewhat stretched (or at least not perfect) in the face of mixed-part requirements, and deep Hack is still needed to meet the needs of the production level.

(cloud native) resource isolation technology mainly includes CPU, memory, IO and network. This article focuses on CPU isolation technology and related background, followed by (series) and then gradually expanded to other aspects.

Background

Whether in IDC or in cloud scenarios, low resource utilization is definitely a common problem faced by most users / vendors. On the one hand, the cost of the hardware is very high (everyone buys it, and most of the hardware (core technology) is in the hands of others, there is no pricing power, and the bargaining power is usually very weak), and the life cycle is still very short (it will have to be renewed in a few years). On the other hand, it is extremely embarrassing that such expensive things can not be fully utilized. Take the CPU occupancy rate as an example, the average occupancy rate of most scenes is very low (if I shoot no more than 20% (here refers to daily average, or weekly average), I believe most students will not have a problem with it, which means that expensive things actually cost less than 1/5. If you still want to live at home seriously, you must feel distressed.

Therefore, improving the resource utilization of the host (node) is a task worth exploring, and the benefit is very obvious. The solution is also very straightforward.

Conventional mode of thinking: run more business. It's easy to say. Who hasn't tried? The core difficulty is that the usual business has obvious peak-valley characteristics.

What you want might look like this:

But the reality is mostly like this:

Capacity planning for business needs to be done according to Worst Case (assuming that all services have the same priority). Specifically, at the CPU level, capacity needs to be planned according to the peak value of CPU (which may be weekly peak or even monthly / annual peak) (usually there is a certain margin to deal with emergencies).

In reality, most of the cases are: the peak is very high, but the actual average is very low. As a result, the average value of CPU in most scenarios is very low, and the actual CPU utilization is very low.

The previous assumption is that "all services have the same priority", and the Worst Case of the business determines the final performance of the whole machine (low resource utilization). If there is a different way of thinking, but when services are prioritized, there will be more room to play, and the quality of service of high-priority services can be guaranteed by sacrificing the quality of service of low-priority services (which is usually tolerated). In this way, we can deploy more services (low priority) while deploying an appropriate number of high-priority services, thus improving the utilization of resources as a whole.

Therefore, mixed part (mixed deployment) arises at the historic moment. The "mix" here is essentially "prioritization". In a narrow sense, it can be simply understood as "online + offline" (offline). In a broad sense, it can be extended to a wider range of applications: mixed deployment of multi-priority services.

The core technologies involved include two aspects:

Underlying resource isolation technology. It is (usually) provided by the operating system (kernel), which is the core focus of this (series) article.

The resource scheduling technology of the upper layer. (usually) provided by the upper-level resource orchestration / scheduling framework (typically K8s), please wait for another series of articles to be carried out.

Mixed part is also a very hot topic and technical direction in the industry. at present, the mainstream head manufacturers continue to invest, the value is obvious, and there is also a higher technical threshold (barriers). The related technology originated very early and has a lot of origin. The famous K8s (predecessor Borg) actually comes from the mixed part scene of Google, and from the history and effect of the mixed part, Google is the benchmark in the industry, claiming that the CPU occupancy rate (average) can reach 60%.

Technical challenge

As mentioned earlier, in mixed-part scenarios, the underlying resource isolation technology is very important, in which "resources" are divided into four categories as a whole:

CPU

Memory

The network

This paper focuses on CPU isolation technology, and mainly analyzes the technical difficulties, current situation and schemes at the level of CPU isolation.

CPU isolation

Among the four types of resources mentioned above, CPU resource isolation can be said to be the most basic isolation technology. On the one hand, CPU is a compressible (reusable) resource, the difficulty of reuse is relatively low, and the solution availability of Upstream is relatively good; on the other hand, CPU resources are strongly related to other resources, and the use of other resources (application / release) often depends on the process context and indirectly depends on CPU resources. For example, when the CPU is quarantined (suppressed), other requests such as IO and the network may (in most cases) be suppressed because the CPU is suppressed (and thus is not scheduled).

Therefore, the effect of CPU isolation will also indirectly affect the isolation effect of other resources, CPU isolation is the core isolation technology.

Kernel scheduler

Specifically, landing in OS, CPU isolation essentially depends entirely on the implementation of the kernel scheduler, which is the basic functional unit of the kernel for load distribution of CPU resources (very officially). Specifically (in a narrow sense), it corresponds to the default scheduler of the Linux kernel that we come into contact with most: the CFS scheduler (essentially a scheduling class, a set of scheduling policies).

The kernel scheduler determines when and what tasks (processes) are selected for execution on the CPU, so it determines the CPU running time of online and offline tasks in mixed scenarios, which determines the effect of CPU isolation.

Upstream kernel isolation effect

The Linux kernel scheduler provides five scheduling classes by default, but there are basically only two available for actual business:

CFS

Real-time scheduler (rt/deadline)

In mixed-part scenarios, the essence of CPU isolation is that it requires:

Try to suppress offline tasks when online tasks need to be run

When online tasks are not running, offline tasks are run using idle CPU

For "suppression", based on Upstream kernel (based on CFS), there are several ideas (schemes):

Priority

The priority of offline tasks can be reduced or the priority of online tasks can be increased. Without modifying the scheduling class (based on the default CFS), the priority range that can be adjusted dynamically is: [- 20,20)

The specific performance of the time slice is the time slice that can be allocated in a single scheduling cycle, specifically:

The weight ratio of the time slice allocation between the normal priority 0 and the lowest priority 19 is: 1024 gambit 15, about 68:1

The weight ratio of the time slice allocation between the highest priority-20 and the normal priority 0 is 88761 take 1024, about 87:1

The weight ratio of the time slice allocation between the highest priority-20 and the lowest priority 19 is: 88761 take 15, about: 5917 virtual 1

It seems that the suppression ratio is still relatively high, by setting the priority of offline tasks to 20 and keeping the default 0 online (as usual), the weight of online and offline time slice allocation is 68:1.

Assuming that the length of a single scheduling cycle is 24ms (the default configuration for most systems), it seems (rough estimate) that the time slice that can be allocated offline in a single scheduling cycle is about 24ms/69=348us, which can occupy about 1.4% of CPU.

The actual running logic is also a little different: CFS considers throughput and sets the minimum time granularity protection for a single run (the minimum time for a process to run): sched_min_granularity_ns, in most cases, is set to 10ms, which means the time it takes to run 10ms continuously once offline preemption occurs, which means that the scheduling delay (RR switching delay) of online tasks may reach 10ms.

There is also minimum time granularity protection for Wakeup (minimum run time guarantee for preempted tasks in Wakeup): sched_wakeup_granularity_ns, which is set to 4ms in most cases. This means that once offline, the wakeup latency of online tasks (another typical scheduling delay) may also reach 4ms.

In addition, adjusting the priority does not optimize the preemption logic, specifically, when preemption is implemented (wakeup and periodicity), priority is not taken into account, and different preemption strategies are not adopted in real time because of different priorities (preemption is not suppressed and preemption opportunities are reduced because of the low priority of offline tasks), which may lead to unnecessary preemption offline, resulting in interference.

Cgroup (CPU share)

The Linux kernel provides CPU Cgroup (corresponding to the container pod). You can control the priority of the container by setting the share value of the Cgroup, that is, you can achieve the purpose of "suppression" by lowering the share value of the offline Cgroup. For Cgroup v1, the default share value for Cgroup is 1024 Cgruop v2 and the default share (weight) value for Cgruop v2 is 100 (of course it can be adjusted). If you set the share/weight value of offline Cgroup to 1 (the lowest value), then in CFS, the corresponding time slice allocation weight ratio is 10241 and 100, respectively, and the corresponding CPU occupancy is about 0.1% and 1%, respectively.

The actual running logic is still limited by sched_min_granularity_ns and sched_wakeup_granularity_ns. The logic is similar to the priority scenario.

Similar to the priority scheme, preemption logic is not optimized based on share values, and there may be additional interference.

Special policy

A special scheduling policy:SCHED_IDLE is also provided in CFS, which is designed to run very low-priority tasks and seems to be designed for "offline tasks." A SCHED_IDLE task is essentially a CFS task with a weight of 3, and its time slice weight ratio to an ordinary task is 1024 CPU 3, which is about 334 CPU 1. At this time, the CPU occupancy rate of offline tasks is about 0.3%. Time slices are allocated such as:

The actual running logic is still limited by sched_min_granularity_ns and sched_wakeup_granularity_ns. The logic is similar to the priority scenario.

CFS makes a special preemption logic optimization for SCHED_IDLE tasks (suppressing the preemption of other tasks by SCHED_IDLE tasks, reducing preemption opportunities), so from this point of view, SCHED_IDLE is a small step forward in "adapting" (although Upstream is not intended to be) mixed scenarios.

In addition, since SCHED_IDLE is the tag of per-task and does not have Cgroup-level SCHED_IDLE tagging capability, and when scheduling CFS, you need to select (task) group first, and then select task from group, so for cloud native scenario (container) mixed part, using SCHED_IDLE alone will not play a practical role.

On the whole, although CFS provides priority (similar in principle to share/SCHED_IDLE, which is essentially priority) and suppresses low-priority tasks to a certain extent according to priority, the core design of CFS is "fairness". In essence, it is impossible to achieve "absolute suppression" on offline. Even if the priority (weight) is set to the lowest, offline tasks can still get fixed time slices. The time slices obtained are not free CPU time slices, but snatched from online task time slices. In other words, the "fair design" of CFS determines that the interference of offline tasks to the online can not be completely avoided and the perfect isolation effect can not be achieved.

In addition, reducing the priority of offline tasks (as is the case in all these scenarios) essentially reduces the priority space for offline tasks, in other words, there is nothing you can do if you want to further prioritize offline tasks (there may also be QoS distinctions between offline tasks, which may actually be required).

In addition, from the point of view of the underlying implementation, since both online and offline CFS scheduling classes are used, online and offline shared run queues (rq), superimposed computing load, and shared load balance mechanisms are used in actual running. on the one hand, synchronization (locking) is required when doing operations on shared resources (such as run queues) offline, while the lock primitive itself is not prioritized and offline interference cannot be excluded. On the other hand, load balance cannot separate online tasks and do special treatment to them (such as radical balance to prevent hunger, improve CPU utilization, etc.), and the balance effect of offline tasks can not be controlled.

Real-time priority

At this point, you may wonder, if you need absolute preemption (suppressing offline), why not use the real-time scheduling class (RT/deadline)? Compared with CFS, real-time scheduling class just achieves the effect of "absolute suppression".

That's true. However, in this way of thinking, the online business can only be set to real-time, and the offline task can be kept as CFS, so that online can absolutely preempt offline. At the same time, if you are worried about offline starvation, there is also a rt_throttle mechanism to ensure that offline starvation will not occur.

It looks "perfect", but it's not. The essence of this approach will compress the priority space and survival space of online tasks (as opposed to the previous results of lowering the priority of offline tasks). As a result, online services can only use real-time scheduling classes (although most online businesses do not meet the characteristics of real-time types), and can no longer take advantage of the native capabilities of CFS (such as fair scheduling, Cgroup, etc., which are the rigid requirements of online tasks).

To put it simply, the problem is that the real-time type does not meet the needs of the online task itself. In essence, the online business itself is not a real-time task. After such a strong twist to real-time, there will be more serious side effects, such as hunger of system tasks (tasks included in OS, such as various kernel threads and system services).

To sum up, for real-time priorities:

Recognize the "absolute suppression" ability of real-time types for CFS types (which is exactly what we want)

However, in the current Upstream kernel implementation, online tasks can only be set to real-time types with higher priority than CFS, which is unacceptable in practical application scenarios.

Priority inversion

Speaking of which, you may still have a huge question mark in your mind: after "absolute repression", there will be a priority reversal problem, right? What shall I do?

The answer is: there is a priority reversal problem.

Explain the logic of priority reversal in this scenario: if there are shared resources between online and offline tasks (such as some common data in the kernel, such as / proc file system, etc.), when offline tasks get locks due to access to shared resources (abstractly, not necessarily locks), if they are "absolutely suppressed", they cannot run all the time, and when online tasks also need to access the shared resources. While waiting for the corresponding lock, priority inversion occurs, resulting in a deadlock (prolonged blocking is also possible). Priority inversion is a classical problem to be considered in the scheduling model.

A rough summary of the conditions under which priority inversion occurs:

There are shared resources offline.

There is concurrent access to shared resources and sleep lock protection is used.

After getting the lock offline, it is completely and absolutely suppressed and there is no chance to run it. This sentence can be interpreted like this: all CPU are 100% occupied by online tasks, resulting in no opportunity to run offline. (in theory, as long as there is a free CPU, offline tasks can be utilized through the load balance mechanism.)

In the cloud native hybrid scenario, the way to deal with the priority inversion problem depends on the point of view of the problem, and we look at it from the following different perspectives

How likely is the priority reversal to occur? This depends on the actual application scenario, in theory, if there are no shared resources between online business and offline business, priority reversal will not occur. In a cloud-native scenario, there are generally two situations:

(1) secure container scenario. In this scenario, the business actually runs in a "virtual machine" (abstract understanding), and the virtual machine itself ensures the isolation of most of the resources. In this scenario, priority inversion can be basically avoided (if it does exist, it can be done separately).

(2) ordinary container scenario. In this scenario, the business runs in a container, and there are some shared resources, such as kernel common resources, shared file system, and so on. As analyzed above, under the premise of the existence of shared resources, the conditions for priority reversal are relatively stringent, and the most critical condition is that all CPU are occupied by 100% of online tasks. This situation is very rare in real scenarios and can be regarded as very extreme scenarios. In reality, such "extreme scenarios" can be dealt with separately.

Therefore, in (most) real cloud native scenarios, we can assume that it can be avoided if the scheduler optimization / hack is good enough.

What about priority inversion? Although priority reversal occurs only in extreme scenarios, what if you have to deal with it (Upstream will definitely consider it)?

(1) Upstream's idea. In the CFS implementation of native Linux kernel, a certain weight is reserved for the lowest priority (which can be regarded as SCHED_IDLE), which means that the lowest priority task can also get a certain time slice, so the problem of priority inversion can be (basically) avoided. This has always been the attitude of the community: GM, even in extreme scenarios, needs a perfect cover. Such a design is precisely the reason why "absolute suppression" cannot be achieved. From a design point of view, there is nothing wrong with such a design, but for cloud native hybrid scenarios, it is not perfect: it does not perceive the degree of hunger offline, that is, if it is not hungry offline, it may also preempt online, resulting in unnecessary interference.

(2) another idea. Optimized design for cloud native scenarios: perceive offline hunger and the possibility of priority reversal, but preempt it only when offline hunger occurs and may lead to priority inversion (that is, as a last resort). On the one hand, it can avoid different preemption (interference), and at the same time, it can avoid the problem of priority inversion. Achieve (relatively) a perfect effect. Of course, I have to admit that such a design is less Generic and less Graceful, so Upstream is unlikely to accept it.

Hyperthreading interference

So far, another key issue has been left out: hyper-threading interference. This is also a stubborn problem of mixed scenarios, and there has been no targeted solution in the industry.

The specific problem is that because hyperthreads on the same physical CPU share core hardware resources, such as Cache and computing units. When online tasks and offline tasks run on a pair of hyperthreads at the same time, they will interfere with each other because of the competition for hardware resources. And CFS did not consider this problem at all in its design.

As a result, the performance of online services is impaired in mixed scenarios. The actual test uses CPU-intensive benchmark, and the performance interference caused by hyperthreading can be up to 40% +.

Note: Intel official data: physical core performance is only about 1.2x of single-core performance.

The problem of hyper-threading interference is a key problem in mixed-part scenarios, and CFS was (almost) completely unconsidered in the initial design. It can not be said that the design is missing. It can only be said that CFS is not designed for mixed-part scenarios, but for a more general and macro scenario.

Core scheduling

Speaking of this, professional (kernel scheduling) students may have another question: haven't you heard of Core scheduling, can't you solve the problem of hyper-threading interference?

Hearing this, I have to say that this student is indeed very professional. Core Scheduling is a new feature submitted by the kernel scheduler module Maintainer Perter in 2019 (based on the concept of coscheduling put forward earlier in the community). The main goal is to solve (should be mitigation or workaround) L1TF vulnerabilities (data leakage caused by sharing cache between hyperthreads). The main application scenarios are: CVM scenarios Avoid data leakage caused by different virtual machine processes running on the same pair of hyperthreads.

The core idea is to avoid different marked processes running on the same pair of hyperthreads.

Here directly throw (personal) point of view (tap):

Core scheduling can indeed be used to solve the problem of hyper-threading interference.

Core scheduling is originally designed to address security vulnerabilities (L1TF), not for mixed hyperthreading interference. Because of the need for security and absolute isolation, we need complex (expensive) synchronization primitives (such as core-level rq lock), heavyweight feature implementations, such as core-scoped pick task, and overweight force idle. In addition, there are matching interrupt context concurrency isolation and so on.

The design and implementation of Core scheduling is too heavy and expensive, and the performance regression is so serious that it can not distinguish between online and offline. It is not very suitable for (cloud native) mixed scenarios.

In essence: Core scheduling is not designed for cloud native hybrid scenarios.

This is the end of the content of "what is the method of linux CPU isolation". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.