What is Linux kernel preemption 07/12 Update SLTechnology News&Howtos

What is Linux kernel preemption

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is Linux kernel preemption". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn what Linux kernel preemption is.

Environment:

Processor architecture: arm64

Kernel source code: linux-5.11

Ubuntu version: 20.04.1

Code Reading tool: vim+ctags+cscope

We may have often heard of kernel preemption, but do we really understand it? What does kernel preemption have to do with preemptive kernels? What on earth is the use of the preemptive counter?. In this article, let's discuss some technical details about kernel preemption, in order to make everyone understand kernel preemption.

Note: this article focuses on the CFS scheduling class.

two。 Kernel preemption and preemptive kernel

We often use the uname-a command to see the word "PREEMPT", and yes, we use a preemptive kernel.

# uname-a Linux (none) 5.11.0-g08a3831f3ae1 # 1 SMP PREEMPT Fri Apr 30 17:41:53 CST 2021 aarch74 GNU/Linux

So what is a preemptive kernel? In fact, kernels that support kernel preemption are called preemptive kernels, and kernels that do not support kernel preemption are called non-preemptive kernels. So the question comes again, what is kernel preemption? As we all know, take periodic tick: for user tasks, when each clock interrupt arrives, it will check whether its actual running time exceeds the ideal running time, or if there is a higher priority process in the run queue. Generally, if one of the conditions is met, the rescheduling flag will be set, and then scheduling occurs on the eve of the interrupt returning to the user state. This is the so-called user task preemption.

But if a task in a kernel state is running and an interrupt occurs to wake up a high-priority task, can the awakened task be scheduled for execution? At this time, it will be analyzed in two cases. If it is a preemptive kernel, then high-priority tasks may preempt the current task and schedule execution (all of it is possible that preemption is allowed because the difference in virtual running time between the two is greater than the preemption granularity). Preemption is not allowed if it is a non-preemptive kernel, unless the current process completes execution or actively schedules high-priority processes should have a chance to be scheduled.

In other words, the kernel that supports kernel preemption allows not only tasks in the user state to be preempted, but also tasks in the kernel state (please note that the kernel state is mentioned here. Because user-space tasks can enter kernel state through system calls, etc.), this is very friendly to interactive or low-latency application scenarios, such as handheld devices and desktop applications. As for the server, it requires higher throughput and wants more cpu time, while interactivity or low latency is secondary, so it is designed as a non-preemptive kernel.

The following figure shows the non-preemptive kernel scheduling:

The following figure shows preemptive kernel scheduling:

Comparing the two diagrams, we can find that in the case of preemptive kernel scheduling, waking up a high-priority task in an interrupt can get a good response.

The choice of preemptive kernel or non-preemptive kernel is described in the kernel/Kconfig.preempt of the source code:

Config PREEMPT_NONE bool "No Forced Preemption (Server)" help This is the traditional Linux preemption model, geared towards throughput. It will still provide good latencies most of the time, but there are no guarantees and occasional longer delays are possible. "Select this option if you are building a kernel for a server or" scientific/computation system, or if you want to maximize the "raw processing power of the kernel, irrespective of scheduling" latencies. Config PREEMPT bool "Preemptible Kernel (Low-Latency Desktop)" depends on! ARCH_NO_PREEMPT select PREEMPTION select UNINLINE_SPIN_UNLOCK if! ARCH_INLINE_SPIN_UNLOCK select PREEMPT_DYNAMIC if HAVE_PREEMPT_DYNAMIC help "This option reduces the latency of the kernel by making" all kernel code (that is not executing in a critical section) "preemptible. This allows reaction to interactive events by "permitting a low priority process to be preempted involuntarily" even if it is in kernel mode executing a system call and would "otherwise not be about to reach a natural preemption point. This allows applications to run more 'smoothly' even when the' system is under load, at the cost of slightly lower throughput 'and a slight runtime overhead to kernel code. Select this if you are building a kernel for a desktop or embedded system with latency requirements in the milliseconds range.

Two compilation options are listed above, one is to support kernel preemption and the other is not to support kernel preemption. In fact, there are PREEMPT_VOLUNTARY and PREEMPT_RT, the former will explicitly add some preemption points, and the latter is used to support real-time performance.

3. Reschedule flags and preemption counters

Some paths in the kernel do not allow scheduling, such as atomic context. At this time, if you wake up a high-priority task or tick to check that the reschedulable condition is met, then the high-priority task will not be executed immediately, but I need to identify the need for rescheduling, so you need to set the rescheduling flag when returning to the schedulable context (such as preemption). At this point, it will check whether this flag is set to determine whether to call the scheduler to select the next task to run.

Identifies that rescheduling is a setting:

/ / the flag stsk- > thread_info- > flags of the thread_info of the task_struct of the current task sets the TIF_NEED_RESCHED flag # define TIF_NEED_RESCHED 1 / * rescheduling necessary * /

After this flag is set on some paths of the kernel, scheduling occurs at the nearest scheduling point (either when preemption is recently turned on or when a recent interrupt exception returns).

The rescheduling flag is set for the current task, which only indicates that scheduling will occur in the near future, not immediately. For user tasks, scheduling occurs on the eve of interrupting the exception and returning to the user state. For tasks in the kernel state, if you want to preempt the current process in the kernel state, only setting the rescheduling flag is not enough, and you also need to judge whether the preemption counter of the current process is 0.

For all tasks in kernel state, preemption counters are critical for rescheduling. As long as preemption counters are not zero, no scheduler can be obtained for awakened tasks in an emergency. Let's take a look at this preemption counter:

Tsk- > thread_info- > preempt.count

Let's take a look at the definition of preemption counters for the arm64 architecture:

24 struct thread_info {25 unsigned long flags; / * low level flags * / 29 union {30 u64 preempt_count; / * 0 = > preemptible, bug * / 31 struct {32 # ifdef CONFIG_CPU_BIG_ENDIAN 33 u32 need_resched; 34 u32 count 35 # else 36 U32 count; 37 U32 need_resched; 38 # endif 39} preempt; 40}; 45}

You can find that it is a common body, some paths to the kernel use preempt_count, some are preempt, why do you use such a strange definition? Because a member can represent two states: the rescheduling flag and the value of the preemption counter

The TIF_NEED_RESCHED flag of flags is set when rescheduling is needed, and the preempt.need_resched is cleared at the same time. When the preempt_count==0 of the check thread_info is established, it indicates that the value of the preemption counter is 0 and the TIF_NEED_RESCHED flag of the flags is set, at which time the process can be rescheduled (such as checking on the eve of the interrupt returning to the kernel state).

Let's see how to set the rescheduling flag:

Resched_curr / / kernel/sched/core.c 613 if (cpu = = smp_processor_id ()) {614 set_tsk_need_resched (curr); 615 set_preempt_need_resched (); 616 return 617} 29 static inline void set_preempt_need_resched (void) / / arch/arm64/include/asm/preempt.h 30 {31 current_thread_info ()-> preempt.need_resched = 0; 32}

When a path in the kernel sets the rescheduling flag (such as when the clock interrupts tick), resched_curr is called to set the rescheduling flag: you can see that in addition to setting the TIF_NEED_RESCHED flag for the task's flags, preempt.need_resched is set to 0.

How to clear the rescheduling flag:

Kernel/sched/core.c _ _ schedule / / active scheduling or preemptive scheduling will be called to these 5046 clear_tsk_need_resched (prev); 5047 clear_preempt_need_resched (); / / arch/arm64/include/asm/preempt.h 34 static inline void clear_preempt_need_resched (void) 35 {36 current_thread_info ()-> preempt.need_resched = 1; 37}

You can see that in the main scheduler, in addition to calling clear_tsk_need_resched to clear the TIF_NEED_RESCHED flag of the task's flags, clear_preempt_need_resched is called to set preempt.need_resched to 1 to clear the rescheduling.

The following is a representation of the fields that preempt the counter:

0-7 indicates preemption count, 8-15 indicates soft interrupt count, 16-19 indicates hard interrupt count, and 20-23 indicates unmasked interrupt count. The bit field of the response is set when entering a different context, indicating that in a certain context, when a bit field is set and the preemption counter is not 0, the kernel state of the task is not allowed to be preempted.

Therefore, the preemption counter has two functions: one is to identify the kernel path in an atomic context, and the other is to determine whether the task kernel state is allowed to be preempted.

Include/linux/preempt.h 85 / * 86 * Macros to retrieve the current execution context: 87 * 88 * in_nmi ()-We're in NMI context 89 * in_hardirq ()-We're in hard IRQ context 90 * in_serving_softirq ()-We're in softirq context 91 * in_task ()-We're in task context 92 * / 93 # define in_nmi () (nmi_count ()) / / determine whether the Unmasked interrupt context 94 # define in_hardirq () (hardirq_count ()) / / determine whether it is in the hard interrupt context 95 # define in_serving_softirq () (softirq_count () & SOFTIRQ_OFFSET) / / determine whether it is a soft interrupt Context 96 # define in_task () (! (in_nmi () | in_hardirq () | in_serving_softirq () / / determine whether it is in the process context 97 98 / * 99 * The following macros are deprecated and should not be used in new code: 100 * in_irq ()-Obsolete version of in_hardirq () 101 * in_softirq ()-We have BH disabled Or are processing softirqs 102 * in_interrupt ()-We're in NMI,IRQ SoftIRQ context or have BH disabled 103 * / 104 # define in_irq () (hardirq_count ()) / / determine whether it is in the hard interrupt context 105 # define in_softirq () ( Softirq_count () / / determines whether it is in the context of soft interrupts (shutting down or executing soft interrupts) 106 # define in_interrupt () (irq_count ()) / / determines whether it is in the context of interrupts (including hard interrupts and unmasked interrupts) / / determine whether it is in the atomic context (preemption counter is not 0) 144 # define in_atomic () (preempt_count ()! = 0) 4. Scheduling opportunity preempted by kernel

Here, the scheduling timing is divided into two cases, one is the cheek point without scheduling, and the other is the real preemption point (that is, the main scheduler is called for scheduling):

Cheek Point->

When tick: when the condition is satisfied (the task uses up the ideal run time, the run time is greater than the minimum preemption granularity and the running queue has a higher priority task), set the TIF_NEED_RESCHED flag, and the nearest preemption point is scheduled.

Wake-up preemption: when the condition is satisfied (the difference between the virtual run time of the awakened task and the current task is greater than the minimum wake-up preemption granularity, and the wake-up task virtual run time is smaller), set the TIF_NEED_RESCHED flag, and the nearest preemption point is scheduled.

Preemptive point->

Interrupt return kernel state: preemptive scheduling when the condition (rescheduling flag is set and preemption counter is 0) is satisfied.

When opening preemption: (such as opening preemption, opening the lower half of interruption, releasing spin lock) when the condition (rescheduling flag is set and preemption counter is 0), preemptive scheduling.

When soft interrupt is turned on: when the condition (rescheduling flag is set and preemption counter is 0), preemptive scheduling is met.

Interrupt return kernel state is a regular preemption point. In general, periodic tick interrupts will occur even if no other interrupts occur. When the condition (rescheduling flag is set and preemption counter is 0), the current task will be preempted. In some critical areas where multitasking will occur, we need to turn off kernel preemption, some directly call preempt_disable, some indirectly call preempt_disable (such as the critical section of applying for spin lock), some turn off soft interrupts, etc., all these will cause the preemption counter not to be zero, but in these critical areas, if the interrupt awakens high-priority tasks. The interrupt cannot be scheduled on the eve of returning to the kernel state, so at the end of these critical areas, it will check whether the scheduling conditions are met, and if preemptive scheduling is satisfied, so that the awakened tasks can be responded in a timely manner. In general, after some cheek points have set the rescheduling flag for the current task, if the preemption counter is 0, it will be scheduled at the nearest preemption point (that is, the three cases mentioned above). It is also important to note that in the critical area of preemption, only kernel preemption of the cpu where the current task is located is prohibited, and other cpu can still preempt the kernel. If this critical section may be accessed by other cpu, you can directly use spin lock to protect it.

4.1 cheek point

1) when the clock interrupts tick:

Kernel/sched/core.c scheduler_tick-> curr- > sched_class- > task_tick (rq, curr, 0)-> task_tick_fair-> entity_tick-> check_preempt_tick-> 4374 if (delta_exec > ideal_runtime) {/ 1. The actual elapsed time of the current task is 4375 resched_curr (rq_of (cfs_rq)); / / set the rescheduling flag 4389 if (delta_exec)

< sysctl_sched_min_granularity) //当前任务的实际运行时间小于最小调度粒度吗？ 4390 return; 4398 if (delta >

Ideal_runtime) / / 2. The difference between the virtual run time of the leftmost task in the red-black tree and that of the current task is less than the ideal run time of 4399 resched_curr (rq_of (cfs_rq)); / / set the rescheduling flag

When each clock tick arrives, scheduler_tick is called to check whether rescheduling is needed, and the rescheduling flag is set when one of the following two conditions occurs:

1. The actual running time of the current task is greater than the ideal running time (ensuring that the running time of the task does not exceed the ideal running time in a scheduling cycle, preventing "rogue" tasks from occupying the cpu all the time and regaining the right to use the processor through periodic clock interruptions).

two。 The actual run time of the current task is greater than the minimum scheduling granularity, and the difference between the virtual run time of the leftmost task in the red-black tree and the virtual run time of the current task is less than the ideal run time (high-priority tasks in the red-black tree can preempt the current task).

2) awaken and preempt:

On fork and the normal wake-up path:

Fork path:

Kernel/fork.c kernel_clone-> wake_up_new_task (p)-> check_preempt_curr (rq, p, WF_FORK)-> rq- > curr- > sched_class- > check_preempt_curr (rq, p, flags)-> check_preempt_wakeup / / kernel/sched/fair.c-> 6994 if (wakeup_preempt_entity (se) Pse) = = 1) {/ / the difference between the virtual run time of the awakened task and the virtual run time of the current task is less than the virtual run time of the latest wake-up preemptive granularity transformation 6995 / * 6996 * Bias pick_next to pick the sched Entity that is 6997 * triggering this preemption. 6998 million * / 6999 if (! next_buddy_marked) 7000 set_next_buddy (pse); 7001 goto preempt 7002} 7003 7004 return 7005 7006 preempt: 7007 resched_curr (rq); / / set rescheduling flag

Normal wake-up path:

Kernel/sched/core.c wake_up_process-> try_to_wake_up-> ttwu_queue-> ttwu_do_activate-> ttwu_do_wakeup-> check_preempt_curr (rq, p, wake_flags)

No matter when creating a new task or waking up a task, it is possible for the newly awakened task to preempt the current task. The judgment condition is as follows: the difference between the virtual run time of the awakened task and the virtual run time of the current task is less than the virtual run time of the minimum wake preemptive granularity transformation (the virtual run time of the awakened task is smaller).

4.2 preemptive point

All the cheek points described above are just setting the rescheduling flag and not letting the preemptive task run. The real preemption point is when the main scheduler is called.

1) interrupt returns to kernel state

When kernel preemption is enabled, before the interrupt returns to the kernel state, it will check whether the current task has set the rescheduling flag and the preemption counter is 0, and if all are satisfied, preemptive scheduling will be carried out.

Arch/arm64/kernel/entry.S el1_irq-> 671 # ifdef CONFIG_PREEMPTION 672 ldr x24, [tsk # TSK_TI_PREEMPT] / / get preempt count 673 alternative_if ARM64_HAS_IRQ_PRIO_MASKING 674 / * 675 Please * DA_F were cleared at start of handling. If anything is set in DAIF, 676'* we come back from an NMI, so skip preemption 677'* / 678 mrs x0 Daif 679 orr x24, x24, x0 680 alternative_else_nop_endif 681 cbnz x24 1F / / preempt count! = 0 | | NMI return path 682bl arm64_preempt_schedule_irq / / irq en/disable is done inside 683 1: 684 # endif

When an interrupt occurs, el1_irq is executed to handle the interrupt

Line 672 to read the thread_info.preempt_count line 681 of the current task to determine whether the thread_info.preempt_count is 0, and if it is 0, the arm64_preempt_schedule_irq of line 682 is called for preemptive scheduling (as analyzed in the previous section).

Let's take a look at preemptive scheduling:

Arm64_preempt_schedule_irq-> preempt_schedule_irq-> _ _ schedule (true) / / call the main scheduler for preemptive scheduling

2) when opening preemption

Enable preemption:

Preempt_enable-> if (unlikely (preempt_count_dec_and_test ()\ / / preemptive counter minus one to 0 _ preempt_schedule ();\-> preempt_schedule / / kernel/sched/core.c-> _ schedule (true) / / call the main scheduler for preemptive scheduling

Release the spin lock:

Spin_unlock-> raw_spin_unlock-> _ _ raw_spin_unlock-> preempt_enable / / as above

3) enable soft interrupt

Local_bh_enable-> _ _ local_bh_enable_ip-> preempt_check_resched-> if (should_resched (0))\ _ _ preempt_schedule ();-> preempt_schedule-> _ schedule (true) / / call the main scheduler for preemptive scheduling

In fact, _ _ schedule is called whether it is active scheduling or preemptive scheduling, and _ _ schedule belongs to the context of off-preemption and is not allowed to be preempted during scheduling.

5. Low latency processing of non-preemptive kernel

Let's take a look at how to deal with low latency in a kernel that has no kernel preemption turned on:

We will see that cond_resched is called in some of the more time-consuming processes such as file systems and memory collection paths. What is it for?

The following is an example of using this macro: in the memory recovery path, some pages will be recycled and isolated to page_list at the end of the inactive lru linked list, and eventually shrink_page_list will be called:

Mm/vmscan.c shrink_page_list-> 1084 while (! list_empty (page_list)) {... 1091 cond_resched (); / / Recycling}

You can see that for each isolated candidate recycling page in page_list, cond_resched is called before processing to actively determine whether rescheduling is needed.

Let's take a look at the macro implementation of cond_resched:

Include/linux/sched.h 1868 / * 1869 * cond_resched () and cond_resched_lock (): latency reduction via 1870 * explicit rescheduling in places that are safe. The return 1871 * value indicates whether a reschedule was done in fact. 1872 * cond_resched_lock () will drop the spinlock before scheduling, 1873 * / 1874 # ifndef CONFIG_PREEMPTION 1875 extern int _ cond_resched (void); 1876 # else 1877 static inline int _ cond_resched (void) {return 0;} 1878 # endif 1879 1880 # define cond_resched () ({\ 1881 _ might_sleep (_ _ FILE__, _ LINE__, 0);\ 1882 _ cond_resched () \ 1883})

We can clearly see that the _ cond_resched of the preemptive kernel (CONFIG_PREEMPTION=y) cond_ rescheduled macro is empty and does not actively determine the rescheduling function. Only the non-preemptive kernel will call _ cond_resched to actively check preemption.

Let's take a look at _ cond_resched:

6671 # ifndef CONFIG_PREEMPTION 6672 int _ _ sched _ cond_resched (void) 6673 {6674 if (should_resched (0)) {/ / determine whether the preemption counter is 0 6675 preempt_schedule_common (); / / preemptive scheduling 6676 return 1; 6677} 6678 rcu_all_qs (); 6679 return 0; 6680} 6681 EXPORT_SYMBOL (_ cond_resched) 6682 # endif

It actively checks whether the preemption counter is 0 (in fact, whether the preemption counter is 0 and the current task is set with the rescheduling flag), then preemptive scheduling is performed.

In fact, for the non-preemptive kernel, many parts of the kernel, especially some time-consuming paths related to file system operations and memory management, have been recognized by kernel developers and use cond_resched to reduce latency (interested partners can check it out through the grep and wc-l commands).

6. Voluntary kernel preemption

Kernel preemption model has a voluntary kernel preemption model (CONFIG_PREEMPT_VOLUNTARY=y), which allows kernel developers to actively check for preemptive scheduling when performing time-consuming operations, which is similar to the previous section.

Config PREEMPT_VOLUNTARY bool "Voluntary Kernel Preemption (Desktop)" depends on! ARCH_NO_PREEMPT help "explicit preemption points" to the kernel code. These new "preemption points have been selected to reduce the maximum" latency of rescheduling, providing faster application reactions, at the cost of slightly lower throughput. This allows reaction to interactive events by allowing a low priority process to voluntarily preempt itself even if it is in kernel mode executing a system call. This allows "applications to run more 'smoothly' even when the system is" under load. Select this if you are building a kernel for a desktop system.

Use might_resched:

83 # ifdef CONFIG_PREEMPT_VOLUNTARY 84 extern int _ cond_resched (void); 85 # define might_resched () _ cond_resched () 86 # else 87 # define might_resched () do {} while (0) 88 # endif

It is found that might_resched is valid only when CONFIG_PREEMPT_VOLUNTARY=y is found, otherwise it is empty.

It is surprising to find that when searching for the use of might_resched in the kernel, you don't see any use, presumably because most time-consuming kernel paths have used cond_resched to check for scheduling opportunities.

7. Summary

This paper explains all aspects of kernel preemption. Non-preemptive kernel is mainly used in scenarios with high throughput requirements such as servers, while preemptive kernels are mainly used in scenarios with high response requirements such as embedded devices and desktops. The scheduling timing of kernel preemption is mainly analyzed from two angles: check point and preemption point: the check point determines whether the task needs to be rescheduled at the appropriate time (such as when the clock interrupts tick or when the task wakes up). If you need to set the rescheduling flag (need_resched), it is not scheduled immediately, and then scheduling occurs at the nearest preemption point. The preemption point is the real time to call the main scheduler to schedule, which usually occurs when the interrupt returns to the kernel state or restarts kernel preemption. Finally, we analyze how the non-preemptive kernel performs low-latency processing and how the voluntary preemptive kernel implements voluntary preemption.

Thank you for reading, the above is the content of "what is Linux kernel preemption". After the study of this article, I believe you have a deeper understanding of what Linux kernel preemption is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.