How to understand Linux Kernel Scheduler 07/19 Update SLTechnology News&Howtos

How to understand Linux Kernel Scheduler

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the knowledge of "how to understand Linux Kernel Scheduler". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Why do you need scheduling?

Linux is a multitasking operating system, which means it can perform multiple tasks "simultaneously". On a single-core processor, only one process can execute (concurrently) at any one time, while in a multi-core processor, tasks are allowed to execute in parallel. However, no matter what type of hardware it is, there may be many processes that cannot be executed in memory, waiting to run or sleeping. The kernel component responsible for allocating CPU time to processes is the process Scheduler.

The scheduler is responsible for maintaining the process scheduling order and selecting the next task to be executed. Like most other modern operating systems, Linux implements preemptive multitasking. That is, the scheduler can decide at any time that any process can stop running and let other processes have access to CPU resources. This act of stopping the running process against the will of the running process is called "preemption". Preemption usually occurs when the timer is interrupted, and when the interrupt occurs, the scheduler checks to see if the task needs to be switched, and if so, completes the process context switch. The elapsed time for each process is called the timeslice of the process.

Tasks can usually be divided into interactive (CPU O-intensive) and non-interactive (Imax-intensive) tasks. Interactive tasks usually rely heavily on Istroke O operations (such as GUI applications) and usually do not use up the time slices assigned to it. Non-interactive tasks, such as mathematical operations, require more CPU resources. They are usually preempted after running out of their own time slices and are not frequently blocked by Imax O requests. Of course, real-world applications may contain both of these classification tasks. For example, a text editor, in most cases, waits for user input, but also consumes a lot of CPU resources when performing a spell check.

The scheduling strategy of the operating system needs to balance these two types of tasks and ensure that each task can get sufficient execution resources without significantly affecting the performance of other tasks. In order to maximize the utilization of CPU while ensuring faster response time, Linux tends to allocate larger time slices to non-interactive tasks but run them at a lower frequency, while for Imax O-intensive tasks, it will be executed frequently in a shorter period.

Scheduling-related process descriptors

Many of the fields in the process descriptor (task_struct) are used directly by the scheduling mechanism. Only some of the core parts are listed below and discussed in detail later.

Struct task_struct {int prio, static_prio, normal_prio; unsigned int rt_priority; const struct sched_class * sched_class; struct sched_entity se; struct sched_rt_entity rt;... Unsigned int policy; cpumask_t cpus_allowed;... }

Descriptions of these fields are as follows:

Prio indicates the priority of the process. The process run time and preemption frequency all depend on these values. Rt_priority is used for real-time (real-time) tasks

Sched_class indicates which scheduling class the process is in

The meaning of sched_entity is special. A thread (a synonym for processes and tasks in Linux) is usually called a minimum scheduling unit. But Linux scheduler can not only schedule a single task, but also schedule a group of processes, even all processes belonging to a user as a whole. This allows us to implement group scheduling so that CPU time is allocated to a process group first and then to a single thread within the group. When this function is introduced, the interactivity of the desktop system can be greatly improved. For example, compilation tasks can be grouped into a group and then scheduled without significant impact on interactivity. Again, * * Linux scheduler can not only schedule processes directly, but also schedule scheduling units (schedulable entities). Such a scheduling unit is represented by struct sched_entity. It is important to note that it is not a pointer, but is directly nested in the process descriptor. Of course, the rest of the discussion will focus on the simple scenario of single-process scheduling. Because the scheduler is designed for scheduling units, it treats individual processes as scheduling units, so it manipulates them using sched_entity structures. Sched_rt_entity is used for real-time scheduling.

Policy indicates the scheduling strategy for tasks: it usually means applying special scheduling decisions to specific process groups (such as longer time slices, higher priorities, etc.). The scheduling strategies supported by the Linux kernel are as follows:

SCHED_NORMAL: the scheduling strategy used by ordinary tasks

SCHED_BATCH: unlike normal tasks, being preempted frequently allows tasks to run for as long as possible to make better use of the cache, but at the natural cost of losing interactive performance. This is very suitable for batch task scheduling (batch CPU-intensive tasks)

SCHED_IDLE: it has a lower task priority than nice 19, but it's not really an idle task

SCHED_FIFO and SCHED_RR are soft real-time process scheduling strategies. They are defined by the POSIX standard and are defined by the

The real-time scheduler defined in it is responsible for scheduling. RR implements round robin scheduling with fixed time slices, while SCHED_FIFO uses a first-in-first-out queue mechanism.

Cpus_allowed: used to indicate the CPU affinity of a task. User space can be set through sched_setaffinity system calls.

Priority Priority

Process priority:

Normal task priority:

All Unix-like operating systems implement priority scheduling mechanism. Its core idea is to set a value for the task, and then use that value to determine the importance of the task. If the tasks have the same priority, run them again. In Linux, every normal task is assigned a nice value, which ranges from-20 to + 19, and the default nice value for the task is 0.

The higher the nice value, the lower the task priority (it's nice to others). The nice (int increment) system call can be used in Linux to modify the priority of the current process. The implementation of the system call is located in. By default, a user can only increase the nice value (that is, lower priority) for processes started by that user. If you need to increase the priority (decrease the nice value), or modify the priority of other user processes, you must operate as root.

Real-time task priority:

In Linux, in addition to ordinary tasks, there is another category of tasks that belong to real-time tasks. Real-time tasks are tasks that ensure that they can be executed within a certain time frame. There are two types of real-time tasks, listed below:

Hard real-time tasks: there will be strict time limits, and tasks must be completed within the time limit. For example, the flight control system of the helicopter needs to respond to the control of the pilot in time and make the expected action. However, Linux itself does not support hard real-time tasks, but there are some modified versions based on it, such as RTLinux (they are often called RTOS) that support hard real-time scheduling.

Soft real-time tasks: soft real-time tasks actually have time limits, but not so strict. In other words, if the task runs later, it will not cause an irreparable catastrophic accident. In practice, soft real-time tasks provide a certain amount of time limit, but don't rely too much on this feature. For example, VOIP software will send audio and video signals using soft real-time protection protocols, but even if there is a little delay due to the high load of the operating system, it will not have much impact. In any case, soft real-time tasks always have a higher priority than normal tasks.

The priority range for real-time tasks in Linux is 0: 99, but interestingly, it does the opposite of the nice value, where the higher the priority value, the higher the priority.

Like other Unix systems, Linux implements real-time priorities based on the "Real-time Extensions" defined by the POSIX 1b standard. You can view real-time tasks in the system with the following command:

$ps-eo pid, rtprio, cmd

You can also view the details of individual processes through chrt-p pid. The real-time task priority can be changed through chrt-p prio pid in Linux. It is important to note that if you are operating on a system process (the process of an ordinary user is not usually set to real-time), you must have root permission to modify the real-time priority.

Process priority from the kernel perspective:

In real time, the priority of the task seen by the kernel is different from that seen by the user, and there are many aspects to consider when calculating and managing priorities. In the Linux kernel, zero 139 is used to indicate the priority of the task, and the lower the value, the higher the priority (note the difference from user space). Of these, 0percent 99 is reserved for real-time processes, and 100percent 139 (which maps to a nice value of-20 percent 19) is reserved for ordinary processes.

We can see in the header file that the kernel represents the process priority unit (scale) and macro definition (macros), which are used to map user space priority to kernel space.

# define MAX_NICE 19 # define MIN_NICE-20 # define NICE_WIDTH (MAX_NICE-MIN_NICE + 1)... # define MAX_USER_RT_PRIO 100 # define MAX_RT_PRIO MAX_USER_RT_PRIO # define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH) # define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2) / * * Convert user-nice values [- 20.. 0. 19] * to static priority [MAX_RT_PRIO..MAX_PRIO-1], * and back. * / # define NICE_TO_PRIO (nice) ((nice) + DEFAULT_PRIO) # define PRIO_TO_NICE (prio) ((prio)-DEFAULT_PRIO) / * * 'User priority' is the nice value converted to something we * can work with better when scaling various scheduler parameters, * it's a [0... 39] range. * / # define USER_PRIO (p) ((p)-MAX_RT_PRIO) # define TASK_USER_PRIO (p) USER_PRIO ((p)-> static_prio) # define MAX_USER_PRIO (USER_PRIO (MAX_PRIO))

Priority calculation:

There are several fields in task_struct to indicate the priority of the process:

Int prio, static_prio, normal_prio; unsigned int rt_priority

Static_prio is a "static" priority set by the user or system that maps to the priority represented by the kernel:

P-> static_prio = NICE_TO_PRIO (nice_value)

Normal_prio stores priorities based on static_prio and process scheduling policies (real-time or ordinary). With the same static priority, the normal priority is different under different scheduling strategies. When the child process is in fork, it inherits the normal_prio of the parent process.

Prio is a "dynamic priority", which changes in some scenarios. In one scenario, the system can preempt other high-priority tasks by raising the priority of one task for a period of time. Once the static_prio is determined, the prio field can be calculated as follows:

P-> prio = effective_prio (p); / / kernel/sched/core.c defines the calculation method static int effective_prio (struct task_struct * p) {p-> normal_prio = normal_prio (p); / * If we are RT tasks or we were boosted to RT priority, * keep the priority unchanged. Otherwise, update priority * to the normal priority: * / if (! rt_prio (p-> prio)) return p-> normal_prio; return p-> prio;} static inline int normal_prio (struct task_struct * p) {int prio; if (task_has_dl_policy (p)) prio = MAX_DL_PRIO-1 Else if (task_has_rt_policy (p)) prio = MAX_RT_PRIO-1-p-> rt_priority; else prio = _ _ normal_prio (p); return prio;} static inline int _ normal_prio (struct task_struct * p) {return p-> static_prio;}

Load weight (Load Weights):

Priorities make some tasks more important than others, so you get more CPU usage time. The proportional relationship between nice value and time slice is maintained by load weight (Load Weights). You can see the process weight in task_struct- > se.load, which is defined as follows:

Struct sched_entity {struct load_weight load; / * for load-balancing * / … } struct load_weight {unsigned long weight; u32 inv_weight;}

To make it more reasonable for changes in nice values to be reflected on CPU time changes, an array is defined in the Linux kernel to map nice values to weights:

Static const int prio_to_weight [40] = {/ *-20 * / 88761, 71755, 56483, 46273, 36291, / *-15 * / 29154, 23254, 18705,14949,11916, / *-10 * / 9548,7620,6100,4904,3906, / *-5 * / 3121,2501,1991,1586,1277, / * 0 * / 1024,820,655,526,423, / * 5 * / 335,272,215,172,137 / * 10 * / 110, 87, 70, 56, 45, / * 15 * / 36, 29, 23, 18, 15,}

Let's take a look at how to use the mapping table above, assuming that there are two tasks with a priority of 0, each getting 50% of the CPU time (1024 / (1024 + 1024) = 0.5). If you suddenly raise the priority of one of the tasks by 1 (nice value-1). At this point, one task should get an additional 10% of CPU time, while the other should get 10% less CPU time. Let's take a look at the calculation results: 1277 / (1024 + 1277) ≈ 0.55. 1024 / (1024 + 1277) ≈ 0.45, the gap between the two is exactly about 10%, in line with expectations. The complete calculation function is defined in:

Static void set_load_weight (struct task_struct * p) {int prio = p-> static_prio-MAX_RT_PRIO; struct load_weight * load = & p-> se.load; / * * SCHED_IDLE tasks get minimal weight: * / if (p-> policy = = SCHED_IDLE) {load- > weight = scale_load (WEIGHT_IDLEPRIO); load- > inv_weight = WMULT_IDLEPRIO; return } load- > weight = scale_load (prio_to_ weight [prio]); load- > inv_weight = prio_to_ wmult [prio];}

Scheduling class Scheduling Classes

Although the C language used by the Linux kernel is not the so-called OOP language (there is no similar concept of class in C++/Java), we can see some ways in the kernel code that use C language structures + function pointers (Hooks) to simulate object-oriented methods, abstract behavior and data. The scheduling class is also implemented in this way (in addition, inode_operations, super_block_operations, etc.), which is defined as follows (located in):

/ / for simplicity, hide part of the code (such as SMP related) struct sched_class {/ / multiple sched_class is linked together const struct sched_class * next; / / this hook will be called when the task enters a runnable state. It puts the scheduling unit (such as a task) in the queue and increments the `nr_ running` variable (this variable represents the number of tasks that can be run in the running queue) void (* enqueue_task) (struct rq * rq, struct task_struct * p, int flags); / / the hook is called when the task is not runnable. It moves the task out of the queue and decrements `nr_ running` void (* dequeue_task) (struct rq * rq, struct task_struct * p, int flags); / / this hook can be called when the task needs to actively abandon the CPU, but it should be noted that it will not change the runnable state of the task, that is, it will still wait in the queue for the next schedule. Similar to dequeue_task, / / then enqueue_task void (* yield_task) (struct rq * rq); / / the hook is called when the task is runnable and checks whether it needs to preempt the current task void (* check_preempt_curr) (struct rq * rq, struct task_struct * p, int flags) / / this hook is used to select the next task struct task_struct * (* pick_next_task) (struct rq * rq, struct task_struct * prev) that is most suitable for running; / / this hook will call void (* set_curr_task) (struct rq * rq) when the task modifies its scheduling class or task group. / / it is usually called when the clock is interrupted, which may cause the task to switch over void (* task_tick) (struct rq * rq, struct task_struct * p, int queued); / / notify the scheduler void (* task_fork) (struct task_struct * p) when the task is fork; / / notify the scheduler void (* task_dead) (struct task_struct * p) when the task dies;}

The implementation of the specific details of the scheduling strategy has the following modules:

Core.c contains the core part of the scheduler

Fair.c implements the CFS (Comple Faire Scheduler, fully Fair Task Scheduler) scheduler, which is applied to ordinary tasks.

Rt.c implements real-time scheduling and applies it to real-time tasks.

Idle_task.c runs idle tasks when there are no other tasks to run.

The kernel decides which scheduling class to implement based on the task scheduling policy (SCHED_*), and calls the corresponding methods. SCHED_NORMAL, SCHED_BATCH, and SCHED_IDLE processes map to fair_sched_class (implemented by CFS); SCHED_RR and SCHED_FIFO map rt_sched_class (real-time scheduler).

Run queue runqueue

All runnable tasks are placed in the run queue and wait for CPU to run. Each CPU core has its own run queue, and each task can only be in one queue at any time. In multiprocessor machines, there is a load balancing strategy, and it is possible for tasks to be transferred to other CPU.

The run queue data structure is defined as follows (located in):

/ / for simplicity, part of the code (SMP related) is hidden / / this is a task run queue that every CPU will have. Struct rq {/ / indicates how many runnable tasks (including all sched class) are currently in the queue. Unsigned int nr_running; # define CPU_LOAD_IDX_MAX 5 unsigned long cpu_ load [CPU _ LOAD_IDX_MAX] / / run queue load record struct load_weight load; / / nested CFS Scheduler run queue struct cfs_rq cfs; / / nested real-time task scheduler run queue struct rt_rq rt / / curr points to the currently running process descriptor / / idle points to the idle process descriptor (the task will only start when there is no other runnable task) struct task_struct * curr, * idle; U64 clock; int cpu;}

When will the scheduler be run?

In real time, the scheduling function schedule () is called in many scenarios. Some are called directly or implicitly (by setting TIF_NEED_RESCHED to prompt the operating system to run the scheduling function as soon as possible). The following three scheduling opportunities are worth paying attention to: when a clock interrupt occurs, the scheduler_tick () function is called, which updates some data statistics related to scheduling and triggers the periodic scheduling method of the scheduling class, thus scheduling indirectly. Taking the 2.6.39 source code as an example, the possible call links are as follows:

Scheduler_tick └── task_tick └── entity_tick └── check_preempt_tick └── resched_task └── set_tsk_need_resched

The currently running task goes to sleep. In this case, the task actively releases the CPU. Typically, the task sleeps while waiting for a specified event, and it can add itself to the waiting queue and start a loop to check whether the desired conditions are met. Before going to sleep, the task can set its state to TASK_INTERRUPTABLE (in addition to the event the task is waiting for, it can also be awakened by the signal) or TASK_UNINTERRUPTABLE (naturally ignore the signal), and then call schedule () to select the next task to run.

Linux scheduler

Previous versions:

Linux 0.0.1 already has a simple scheduler, which is certainly not suitable for systems with a lot of processors. The scheduler maintains only one global process queue, which needs to be traversed each time to find new process execution, and there is a strict limit on the number of tasks (NR_TASKS was only 32 in the original version). Let's look at how the scheduler is implemented:

/ / 'schedule ()' is the scheduler function. / / This is GOOD CODE! There probably won't be any reason to change / / this, as it should work well in all circumstances (ie gives / / IO-bound processes good response etc)... Void schedule (void) {int I, next, c; struct task_struct * * p; / / traverses all tasks. If there is a signal, you need to wake up the task for (p = & LAST_TASK; p > & FIRST_TASK;-- p) if (* p) {if ((* p)-> alarm & & (* p)-> alarm) of `INTERRPTABLE`.

< jiffies) { (*p)->

Signal | = (1 alarm = 0;} if ((* p)-> signal & & (* p)-> state = = TASK_INTERRUPTIBLE) (* p)-> state = TASK_RUNNING;} while (1) {c =-1; next = 0; I = NR_TASKS; p = & task [NR _ TASKS] / / iterate through all tasks and find the while (--I) {if (! *-p) continue; if ((* p)-> state = = TASK_RUNNING & & (* p)-> counter > c) c = (* p)-> counter, next = I } if (c) break; / / traversal task, reset the time slice for (p = & LAST_TASK; p > & FIRST_TASK;-- p) if (* p) (* p)-> counter = ((* p)-> counter > > 1) + (* p)-> priority } / / switch to the next task to be executed switch_to (next);}

O (n):

The scheduling algorithm used in version 2.4 of the Linux kernel is very simple and straightforward, and it is called the O (n) scheduler (time complexity) because you need to traverse all the tasks (linked lists) in the system every time you look for the next task.

Of course, this scheduler is a little more complex than the scheduling algorithm in version 0.01 kernel, which introduces the concept of epoch. That is, time is divided into epochs, that is, the life cycle of each process. In theory, each process should have run once at the end of each era, and usually uses up its current time slice. But in fact, some tasks do not run out of time slices, so half of the remaining time slices will be added to the new time slices, thus running longer in the next era.

Let's look at the core source code of the schedule () algorithm:

The / / schedule () algorithm iterates through all the tasks (O (N)), calculates the / / goodness value of each task, and picks out the "best" tasks to run. / / the following is part of the core source code, mainly to understand its ideas. Asmlinkage void schedule (void) {/ / task (process) descriptor: / / 1. Prev: currently running task / / 2. Next: next task to run / / 3. P: currently traversing task struct task_struct * prev, * next, * p; int this_cpu, c / / c indicates the weight value repeat_schedule: / / default selected task next = idle_task (this_cpu); c =-1000; list_for_each (tmp, & runqueue_head) {p = list_entry (tmp, struct task_struct, run_list); if (can_schedule (p, this_cpu)) {int weight = goodness (p, this_cpu, prev- > active_mm) If (weight > c) c = weight, next = p;}

The goodness () function in the source code calculates a weight value, and the basic idea of its algorithm is based on the number of clock beats remaining by the process (time slices), plus the weight value based on the process priority. The returned values are as follows:

-1000 means do not select the process to run

0 means that the time slice is used up and the counters needs to be recalculated (it may be selected to run)

Positive integer: represents the goodness value (the larger the better)

+ 1000 represents a real-time process, and then you need to select it to run

Finally, a summary is made for the O (n) scheduler: the implementation of the algorithm is very simple, but not efficient (the more tasks, the longer it takes to traverse)

Without good scalability, what about multi-core processors?

Weak support for real-time task scheduling (in any case, real-time tasks with high priority need to be known after traversing the list)

O (1):

Ingo Moln á r boss added a new scheduling algorithm to the 2.6 kernel, which can schedule tasks in a constant time, so it is called O (1) scheduler. Let's take a look at some of the new features it introduces:

A global priority unit with a range of 0,139. The lower the value, the higher the priority.

The task is divided into real-time (099) and normal (100139) parts. Higher priority tasks get more time slices

Immediate preemption (early preemption). When the task state changes to TASK_RUNNING, the kernel checks whether it has a higher priority than the currently running task, and if so, preempts the currently running task and switches to it

Real-time tasks use static priority

Normal tasks use dynamic priorities. The task priority is recalculated after it has used its own time slice, and the kernel takes into account its past behavior and determines its level of interactivity. Interactive tasks are easier to schedule

The O (n) scheduler will not recalculate the task priority until the end of each era (the time slices of all tasks have been used). On the other hand, O (1) recalculates the priority after the time slice quota for each task is used up. The O (1) scheduler maintains two queues for each CPU, active and expired. Active queues store tasks that have not yet run out of time slices, while expired is tasks that have run out of time slices. When a task runs out of time slices, it is transferred to the expired queue and its priority is recalculated. When all active queue tasks are transferred to the expired queue, the two are exchanged (let active point to the expired queue and expired point to the active queue). It can be seen that priority calculation and queue switching have nothing to do with the number of tasks and can be completed under O (1) time complexity.

In the scheduling algorithm described earlier, if you want to take a task with the highest priority, you need to traverse the entire task list. The O (1) scheduler is special, providing a linked list of tasks for each priority. All runnable tasks are scattered in the linked list of different priority queues.

Let's take a look at how the new runqueue is defined:

Struct runqueue {unsigned long nr_running; / * Total number of runnable tasks (a CPU) * / struct prio_array * active; / * pointers to active queues * / struct prio_array * expired; / * pointers to expired queues * / struct prio_array arrays [2]; / * actually stores task linked lists corresponding to different priorities * /}

You can get a visual sense of the task queue through the following figure:

Let's take a look at how prio_array is defined:

Total number of tasks in struct prio_array {int nr_active; / * list * / unsigned long bitmap [BITMAP _ SIZE]; / * bitmap indicates whether there are tasks in the corresponding priority list * / struct list_head queue [Max _ PRIO]; / * task queue (each priority corresponds to a two-way linked list) * /}

As you can see, there is a bitmap in prio_array that marks whether there are tasks in the task list corresponding to each priority. Let's take a look at why the O (1) scheduler can find the task to run at a constant time:

Priority is determined by constant time: first, the first bit set to 1 is found in the bitmap (there are 140bits in total, and the search starts with the first bit, which ensures that high-priority tasks get the opportunity to run first). If you find it, you can determine which priority has the task, assuming that the value after it is found is priority.

Constant time to get the next task: extract the first task from the task list corresponding to queue [priority] to execute (multiple tasks will be executed in rotation).

Well, it's time to sum up the pros and cons of the O (1) scheduler: the design is more complex and subtle than the O (n) scheduler.

Relatively speaking, it has better scalability, better performance and less overhead on task switching.

The algorithm used to mark whether a task is an interaction type is still too complex and error-prone.

CFS: single core scheduling

The full name of CFS is Complete Fair Scheduler, which is the fully fair scheduler. It implements a weight-based fair queuing algorithm, which allocates CPU time to multiple tasks (the weight of each task is related to its nice value, the lower the nice value, the higher the weight value). Each task has an associated virtual runtime vruntime, which represents the CPU time used by a task divided by its priority. Two tasks with the same priority and the same vruntime actually run for the same time, which means that the CPU resources are divided equally between them. To ensure that all tasks advance fairly, CFS always picks the task with the smallest vruntime to run whenever it needs to preempt the current task.

Before the kernel version 2.6.38, each thread (task) will be treated as an independent scheduling unit and share resources with other threads in the system, which means that a multi-threaded application will get more resources than a single-threaded application. Since then, CFS has been continuously improved and now supports packaging threads in an application into a cgroup structure, and the vruntime of cgroup is the sum of the vuntime of all threads. Then CFS can apply its algorithm between cgroup to ensure fairness. When a cgroup is selected, the thread with the smallest vruntime is executed, thus ensuring fairness between threads in the cgroup. Cgroup can also be nested. For example, systemd automatically configures cgroup to ensure fairness between different users, and then maintains fairness among multiple applications run by users.

CFS avoids hunger by running and scheduling all threads in a certain period of time. When the number of threads running is 8 or less, the default time period is 48ms; when there are more than 8 threads, the time period increases with the number of threads (6ms * the number of threads, 6ms is chosen to avoid frequent preemption, resulting in frequent context switching overhead). Because CFS always picks the thread with the smallest vruntime to execute, it needs to avoid that one thread's vruntime is so small that other threads have to wait a long time to be scheduled (there will be hunger problems). So in practice, CFS ensures that the vruntime difference between all threads is lower than the preemption time (6ms), which is guaranteed by the following two points:

When a thread is created, its vruntime value is equal to the maximum vruntime of the thread waiting to execute in the run queue

When a thread wakes up from sleep, its vruntime value is updated to be greater than or equal to the smallest vruntime of all scheduled threads. Using a minimum vruntime also ensures that frequently sleeping threads are scheduled first, which is ideal for desktop systems and reduces response latency for interactive applications.

CFS also introduces the idea of heuristic scheduling to improve cache utilization. For example, when a thread is awakened, it checks to see if the difference between the thread's vruntime and the running thread's vruntime is significant (the critical value is 1ms), and if not, it does not preempt the currently running task. But this approach is still at the expense of scheduling delay, which is a tradeoff.

Multicore load balancing:

In a multi-core environment, Linux CFS allocates work (work) to multiple processor cores. But this is not the same as dividing threads into multiple processors. For example, a CPU-intensive thread and 10 sleep-intensive threads might execute on two cores, one of which specifically executes CPU-intensive threads, while the other handles the 10 sleep-intensive threads.

For workload balancing across multiple processors, CFS uses load metrics to measure thread and processor load. The load of the thread is related to the average CPU utilization of the thread: the load of the thread that sleeps frequently is lower than that of the thread that does not sleep. Like vruntime, the load of a thread is weighted by the priority of the thread. The load of the processor is the sum of the load of the threads that can be run on the processor. CFS attempts to balance the load on the processor.

CFS pays attention to the processor load when threads are created and awakened, and the scheduler first decides which processor to put the task in the run queue. Heuristics are also involved here. For example, if CFS examines the producer-consumer model, it will spread consumer threads across as many processors as possible on the machine, because most cores are suitable for handling wake-up threads.

Load balancing also occurs periodically, and every 4ms, each processor tries to steal some work from other processors. Of course, this work-stealing equalization method also takes into account the topology of the machine: processors try to steal work from other processors that are "closer" to them, rather than processors that are "farther" away (such as remote NUMA nodes). When the processor decides to steal a task from another processor, it tries to balance the load between the two and steal up to 32 threads. In addition, when the processor enters an idle state, it invokes the load balancer immediately.

On large NUMA machines, CFS does not roughly compare the load of all CPU, but loads are balanced in a hierarchical manner. Taking a machine with two NUMA nodes as an example, CFS will first balance the load among the processors within the NUMA node, then compare the load between the NUMA nodes (calculated by the load of the processors within the node), and then decide whether to load balance between the two nodes. If the load gap between NUMA nodes is less than 25%, load balancing will not occur. To sum up, if the distance between two processors (or processor groups) is greater, load balancing will be considered only if the imbalance gap is greater.

Run queue:

CFS introduces a red-black tree (essentially a semi-balanced binary tree with O (log (N)) time complexity for both insertion and lookup) to maintain the run queue. The node value of the tree is the vruntime of the scheduling unit, and the node with the minimum vruntime is at the bottom left of the tree.

Next, take a look at the definition of the cfs_rq data structure (located in):

Struct cfs_rq {/ / the cumulative weight value of all tasks struct load_weight load; / / indicates how many runnable tasks are in the queue unsigned int nr_running; / / the root node of the smallest vruntime u64 min_vruntime; / / red-black tree in the run queue, pointing to the running task queue struct rb_root tasks_timeline / / the next task to be scheduled struct rb_node * rb_leftmost; / / points to the currently running scheduling unit struct sched_entity * curr;}

The CFS algorithm is actually applied to the scheduling unit (this is a more general abstraction, which can be thread, cgroups, etc.). The data structure of the scheduling unit is defined as follows (located in):

Struct sched_entity {/ / indicates the load weight of the scheduling unit (for example, if the scheduling unit is a group, then the value is the combination of the load weights of all threads under the group) struct load_weight load; / * for load-balancing * / / indicates the node struct rb_node run_node; / / of the red-black tree indicates whether the current scheduling unit is in the run queue unsigned int on_rq / / start execution time U64 exec_start; / / the total running time, which is updated by `start () `. U64 sum_exec_runtime; / / calculate the running time of the scheduling unit based on the virtual clock U64 vruntime; / / used to record the sum of the previous running time U64 prev_sum_exec_runtime;}

Virtual clock:

What on earth is the vruntime mentioned earlier? Why is it called virtual runtime? The next step is to unravel its mystery. To achieve better fairness, CFS uses a virtual clock to measure how long a waiting scheduling unit is allowed to execute on a fully fair processor. However, the virtual clock has no real implementation, it is just an abstract concept.

We can calculate the virtual run time based on the real time and the load weight of the task. The algorithm is implemented in the update_cur () function, which updates the time accounting information of the scheduling unit and the min_vruntime of the CFS run queue (the full definition is located in):

Static void update_curr (struct cfs_rq * cfs_rq) {struct sched_entity * curr = cfs_rq- > curr; U64 now = rq_clock_task (rq_of (cfs_rq)); u64 delta_exec; if (unlikely (! curr)) return; / / calculates the difference between the start time of the scheduling unit and the current, that is, the real run time delta_exec = now-curr- > exec_start Curr- > vruntime + = calc_delta_fair (delta_exec, curr); update_min_vruntime (cfs_rq);} static inline U64 calc_delta_fair (U64 delta, struct sched_entity * se) {/ / if the priority of the task is the default priority (the internal nice value is 120), then the virtual elapsed time / is the real elapsed time. Otherwise, the virtual run time will be calculated based on `_ _ calc_ delta`. If (unlikely (se- > load.weight! = NICE_0_LOAD)) / / the calculation process is basically equivalent to: / / delta = delta_exec * NICE_0_LOAD / cur- > load.weight; delta = _ _ calc_delta (delta, NICE_0_LOAD, & se- > load); return delta;} static void update_min_vruntime (struct cfs_rq * cfs_rq) {u64 vruntime = cfs_rq- > min_vruntime If (cfs_rq- > curr) / / if there is a task running at this time, update the vruntime vruntime = cfs_rq- > curr- > vruntime with the minimum running time of the current task If (cfs_rq- > rb_leftmost) {/ / get the next scheduling unit to run struct sched_entity * se = rb_entry (cfs_rq- > rb_leftmost, struct sched_entity, run_node) If (! cfs_rq- > curr) vruntime = se- > vruntime; else / / ensure that min_vruntime is the smaller value between the two vruntime = min_vruntime (vruntime, se- > vruntime);} / / the maximum value between the two is removed here to ensure that min_vruntime can grow monotonously / / you can think about why it is necessary to do so. Cfs_rq- > min_vruntime = max_vruntime (cfs_rq- > min_vruntime, vruntime);}

Finally, let's summarize the significance of using a virtual clock: when a task runs, its virtual time always increases to ensure that it is moved to the right side of the red-black tree.

For high-priority tasks, the virtual clock beats slower, making it slower to move to the right side of the red-black tree, so they are more likely to be rescheduled.

Select the next task:

CFS can always find the leftmost (leftmost) node in the red-black tree as the next task to run. But the function that really implements _ _ pick_first_entity () doesn't actually perform a lookup (although it can be found in O (log (N)) time), we can take a look at its definition (the full definition is located at:

Struct sched_entity * _ pick_first_entity (struct cfs_rq * cfs_rq) {/ / actually the cached leftmost node / / is taken here, so the execution will be faster struct rb_node * left = cfs_rq- > rb_leftmost; if (! left) return NULL; return rb_entry (left, struct sched_entity, run_node);}

Real-time scheduler:

The Linux real-time task scheduler implementation is located at policy = = SCHED_RR) return sched_rr_timeslice; else return 0;}

The definition of the running queue is as follows:

/ * Real-Time classes' related field in a runqueue: * / struct rt_rq {/ / all real-time tasks with the same priority are stored in the `active.queue` linked list struct rt_prio_array active; unsigned int rt_nr_running; struct rq * rq; / * main runqueue * /} / * * This is the priority-queue data structure of the RT scheduling class: * / struct rt_prio_array {/ * include 1 bit for delimiter * / / similar to O (1) scheduler, using bitmaps to mark whether the linked list corresponding to the priority is empty DECLARE_BITMAP (bitmap, MAX_RT_PRIO + 1); struct list_head queue [Max _ RT_PRIO];}

Similar to the update_curr () function in CFS, the update_curr_rt () function is used to track the CPU usage of real-time tasks, collect some statistical information, update time slices, etc., but real time is used here, but there is no concept of virtual time. For a complete definition, please refer to kernel/sched/rt.c.

BFS & MuqSS Scheduler:

Generally speaking, BFS is a scheduler for desktop or mobile devices. It is designed succinctly to improve the interaction of desktop applications, reduce response time, and enhance user experience. It uses a global single-task queue design, which no longer allows each CPU to have an independent run queue. Although using a single global queue, queue locks need to be introduced to ensure concurrent security, but for desktop systems, there are usually few processors, and the overhead of locks can be ignored. BFS selects the task with the smallest virtual deadline to run each time in the task list.

MuqSS is a scheduler improved by the author based on BFS, which is also used for task scheduling in desktop environment. It mainly solves two problems of BFS:

Each time you need to traverse the search in the corresponding priority list, you need to execute the task, which is a time complexity of O (n). Therefore, the new scheduler introduces a jump table to solve this problem, thus reducing the time complexity to O (1).

The cost optimization of global lock contention uses try_lock instead of lock.

That's all for "how to understand the Linux Kernel Scheduler". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.