What is the design principle of dispatching system? 07/08 Update SLTechnology News&Howtos

What is the design principle of dispatching system?

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Scheduling is a very broad concept, and the term scheduling is used in many fields. In computer science, scheduling is a method of assigning tasks (Work) to resources. Tasks may be virtual computing tasks, such as threads, processes, or data streams, which are scheduled to be executed on hardware resources, such as the processor CPU.

Figure 1-Dispatch system Design Essentials

This paper will introduce the common scenarios of the scheduling system and some key issues in the design process. The design of the scheduler will eventually boil down to one problem-how to allocate and schedule resources efficiently to achieve our goals. It may include rational use of resources, minimizing costs, and quickly matching supply and demand.

Figure 2-context and content of the article

In addition to introducing the common problems encountered in the design of the scheduling system, this paper will also deeply analyze the design, evolution and implementation principles of several common schedulers, including the process scheduler of the operating system, the runtime scheduler of Go language and the workload scheduler of Kubernetes, to help us understand the core principles of scheduler design.

Design principle

In fact, the scheduling system is a Scheduler. We can see the scheduler in many systems. As we said above, there is not only a scheduler in the operating system, but also a scheduling system or scheduling module in programming languages, container orchestration and many business systems.

The core function of these scheduling modules is to allocate limited resources in order to maximize the utilization of resources or reduce the tail delay of the system. The scheduling system is faced with the problem of imbalance between demand and supply of resources.

Figure 3-Scheduler tasks and resources

In this section, we will introduce the key issues that need to be considered in the design of the scheduling system from many aspects, including the demand investigation, scheduling principle and architecture design of the scheduling system.

1. Demand survey

Before starting to build a scheduling system, the first task is to conduct detailed requirements research and analysis, and the following two things need to be completed in this process:

Investigate the application scenario of the scheduling system, deeply study the characteristics of the tasks to be executed (Work) and the resources that can be used to execute the tasks (Resource), analyze the purpose of the scheduling system, which may be cost priority, quality priority, maximize the utilization of resources, etc., the scheduling purpose is generally dynamic and will change with the change of demand; application scenario

The application scenario of the scheduling system is the first problem we need to consider, and the analysis of the application scenario is very important. we need to have an in-depth understanding of the characteristics of tasks to be executed and resources that can be used to execute tasks in the current scenario. We need to analyze the following characteristics of the tasks to be performed:

Whether the task has a deadline, it must be completed before a certain point in time; whether the task supports preemption, what are the specific rules for preemption; whether the task contains pre-dependent conditions; whether the task can only run on specified resources;.

The resources used to perform tasks may also have the problem of imbalance in resources and inconsistency in the speed at which different resources deal with tasks.

The diversity of resources and tasks determines the design of the scheduling system. Here we give a few simple examples to help readers understand the process of requirements analysis of the scheduling system.

Figure 4-Linux operating system

In the process scheduler of the operating system, the tasks to be scheduled are threads, which are generally only in the state of being executed or not executed (waiting or terminating); and the CPU used to deal with these tasks are often inseparable, and the same CPU can only execute one task at a time, which is a physical restriction. To summarize briefly, the tasks and resources of the operating system Scheduler have the following characteristics:

Task-- Thread. The status is simple: it will only be in two states: being executed or not being executed; different priorities: tasks to be executed may have different priorities, and when priority is taken into account, it is necessary to ensure the fairness of different tasks; resources-CPU time. Resources cannot be subdivided: only one task can be run at a time

In the above scenario, the task to be executed is the basic unit of operating system scheduling-threads, and the resource that can be allocated is CPU time. The scheduler of the Go language faces almost the same scenario as the scheduler of the operating system, where the task is Goroutine and the resources that can be allocated are threads running on CPU.

Figure 5-Container choreography system Kubernetes

In addition to lower-level schedulers such as operating systems and programming languages, container and computing task scheduling is also common today. Kubernetes, as a container orchestration system, is responsible for fetching containers in the cluster. People who know a little about it know that the basic unit of scheduling in Kubernetes is Pod, and these Pod are scheduled to be executed on node Node:

Task-- Pod. Priority is different: the priority of the Pod may be different, and the high-priority system Pod can preempt the resources of the low-priority Pod; stateful: Pod can be divided into stateless and stateful, and stateful Pod depends on persistent storage volumes.

Resource-Node. Different types: different types of resources on different nodes, including CPU, GPU, memory, etc. These resources can be split but belong to the current node; instability: nodes may be unavailable due to sudden reasons, such as no network connection, disk damage, etc.

Scheduling system is very common in life and work. In addition to the above two scenarios, other scenarios that require scheduling system include CDN resource scheduling, order scheduling and offline task scheduling system. In different scenarios, we all need to think deeply about the characteristics of tasks and resources, which can guide the designers of the system.

Dispatching purpose

After an in-depth analysis of the scheduling scenario, we need to understand the purpose of scheduling. We can understand the scheduling purpose as the cost function (Cost function) in machine learning. To determine the scheduling purpose is to determine the definition of the cost function. The common scheduling purpose has been introduced in the book scheduling Theory, including the following contents:

Completion span (Makesapan)-the time span of the first to last task to complete the schedule; maximum delay (Maximum Lateness)-the task that exceeds the deadline; the sum of weighted completion time (Total weighted completion time)-the sum of weight multiplied by completion time.

These are the purpose of partial theoretical scheduling. The scheduling purpose of most business scheduling systems is to optimize the indicators closely related to the business-cost and quality. How to achieve a balance between cost and quality requires careful consideration and design. Due to the limited space and the complexity of the business scenario, this paper will not analyze how to balance cost and quality, which often need to be considered in combination with the business. There is not enough similarity.

two。 Dispatching principle

A scheduler with excellent performance is a prerequisite to achieve a specific scheduling purpose, and we often ignore the extra overhead of scheduling when discussing scheduling scenarios and purposes. however, the delay and throughput of the scheduler can not be ignored when the scheduling load is heavy. This section examines some important concepts related to scheduler implementation that can help us implement a high-performance scheduler:

Collaborative scheduling and preemptive scheduling; single scheduler and multiple schedulers; task sharing and task theft; collaborative and preemptive scheduling

Cooperative (Cooperative) and preemptive (Preemptive) scheduling are common multitasking strategies in operating systems. The definitions of the two scheduling methods are completely different:

Collaborative scheduling allows the task to execute for any long time until the task actively notifies the scheduler to give up resources; preemptive scheduling allows the task to be suspended by the scheduler during execution, and the scheduler redecides the next task to run

Figure 6-Collaborative scheduling and preemptive scheduling

The execution time of the task and the additional overhead of task context switching determine which scheduling method will bring better performance. As shown in the following figure, figure 7 shows the process of a collaborative scheduler scheduling a task. Once the scheduler allocates resources to a task, it waits for the task to actively release resources. Although the execution time of the four tasks in the figure is different, they all release resources after the task execution is completed, and the whole process requires only four context changes.

Figure 7-Collaborative scheduling

Figure 8 shows the process of preemptive scheduling, and because the scheduler does not know the execution time of all tasks, it allocates a time slice for each task. Task 1 and task 4 completed the task when they were scheduled for the first time because of their short execution time, but task 2 and task 3 exceeded the upper limit assigned by the scheduler because of their long execution time, so preemption will be triggered in order to ensure fairness, and other tasks in the waiting queue will get resources. During the whole scheduling process, a total of 6 context switches took place.

Figure 8-preemptive scheduling

If the execution time of some tasks is very long, collaborative task scheduling will starve some long-executing tasks to other tasks, but if the execution time of the tasks to be executed is short and almost the same, then the use of collaborative task scheduling can reduce the additional overhead caused by task interruption, resulting in better scheduling performance.

Because in most cases, the time of task execution is uncertain, in collaborative scheduling, once the task does not actively give up resources, it will lead to other tasks waiting and blocking, so the scheduling system generally focuses on preemptive task scheduling. at the same time, collaborative scheduling of tasks is supported.

Monotone scheduler and multi-scheduler

Whether to use a single scheduler or multiple schedulers also needs to be carefully considered when designing a scheduling system. Multiple schedulers do not necessarily mean multiple processes, but may also be multiple scheduling threads in a process. They can choose to schedule in parallel on a multi-core, concurrent scheduling on a single core, or use both parallelism and concurrency to improve performance.

Figure 9-monotone scheduler scheduling tasks and resources

However, for the scheduling system, because the decisions it makes will change the state of resources and the context of the system, and then affect the subsequent scheduling decisions, so the serial scheduling of a single scheduler is the only way to accurately schedule resources. A single scheduler uses different channels to collect the context needed for scheduling, and after receiving the scheduling request, it will make the current optimal decision according to the task and resource situation.

With the continuous evolution of the scheduler, the performance and throughput of a single scheduler may be limited. We still need to introduce parallel or concurrent scheduling to solve the performance bottleneck. We need to partition the resources to be scheduled. Let multiple schedulers be responsible for scheduling resources in different areas.

Figure 10-multiple schedulers and resource partitions

The concurrent scheduling of multiple schedulers can greatly improve the overall performance of the scheduler, such as the scheduler of Go language. The Go language runtime will assign multiple CPU to different processors to schedule separately, so that the performance of the scheduler can be improved through parallel scheduling.

The two scheduling methods described above are based on the premise that precise scheduling is required, and each scheduler in the multi-scheduler will face unrelated resources, so for the resources of the same partition, the scheduling is serial.

Figure 11-coarse-grained scheduling of multiple schedulers

It is also possible to use multiple schedulers to schedule multiple resources at the same time, but the accuracy of scheduling may have to be sacrificed-different schedulers may receive status updates at different times, which causes different schedulers to make different decisions. Load balancer can be regarded as a multi-thread and multi-process scheduler. Because the information on task and resource control is limited, the result of this coarse-grained scheduling is that the load of different machines will vary greatly. Therefore, both small-scale clusters and large-scale clusters are likely to lead to excessive load on some instances.

Job sharing and job theft

This section continues to cover the two scheduling paradigms for redistributing tasks among multiple schedulers-job sharing (Work Sharing) and job theft (Work Stealing). An independent scheduler can handle all tasks and resources at the same time, so it will not encounter the imbalance between tasks and resources of multiple schedulers. In most scheduling scenarios, the execution time of tasks is uncertain, assuming that multiple schedulers schedule the same resources respectively, because the execution time of tasks is uncertain, the task queues waiting to be scheduled in multiple schedulers will eventually be different-some queues contain a large number of tasks, while others do not contain tasks, so it is necessary to introduce a task redistribution strategy.

Job sharing and job theft are two completely different redistribution strategies. In job sharing, when the scheduler creates a new task, it allocates some tasks to other schedulers, while in job theft, when the scheduler's resources are not fully utilized, it steals some tasks to be assigned from other schedulers, as shown in the following figure:

Figure 12-Job theft scheduler

Both of these two kinds of task redistribution strategies add additional overhead to the system. Compared with job sharing, job theft is triggered only when the resources of the current scheduler are not fully utilized, so the extra overhead introduced by job theft is less. Job theft is more commonly used in production environments. Both the Linux operating system and the Go language choose the job theft strategy.

3. Architecture design

This section analyzes the architecture design of the scheduler from both the internal and external perspectives of the scheduler, the former analyzing the relationship between multiple components within the scheduler and the process of making scheduling decisions, and the latter analyzing how multiple schedulers should cooperate and whether there are other external services that can assist the scheduler to make more reasonable scheduling decisions.

Inside the scheduler

When the scheduler receives the task to be scheduled, it will make a reasonable scheduling decision according to the collected status and the specification (Spec) of the task to be scheduled. We can understand the internal logic of common scheduling systems from the figure below.

Figure 13-Scheduler makes scheduling decisions

A common scheduler generally consists of two parts-the status module used to collect status and the decision module responsible for making decisions.

Status module

The status module collects as much information as possible from different ways to provide rich context for scheduling, which may include information such as resource properties, utilization, and availability. Depending on the scenario, the context may need to be stored in persistent storage such as MySQL, and a copy is usually cached in memory to reduce the cost of accessing the context by the scheduler.

Decision module

The decision module will make scheduling decisions according to the context and task specifications collected by the state module. It should be noted that the scheduling decisions made are only valid now, and at some point in the future, the state change may cause the previous decisions not to meet the requirements of the task. For example, when we use the Kubernetes scheduler to schedule workloads to certain nodes, these nodes may suddenly become unavailable due to network problems. The workload on this node cannot work properly, that is, the scheduling decision is invalid.

The scheduler schedules the appropriate resources for the task through the following three steps:

Through priority, task creation time and other information to determine the scheduling order of different tasks; through filtering and scoring two stages to select appropriate resources for the task; when there are no resources that meet the conditions, choose the sacrificed preemption object.

Figure 14-scheduling Framework

The figure above shows several steps performed by the common scheduler decision module, determining the priority, scoring idle resources, and determining the victims of preemptive resources. the last of the above three steps is often optional, and some scheduling systems do not need to support preemptive scheduling.

Outside of the scheduler

If we look at the scheduler as a whole, the architecture design from the outside of the scheduler will get a completely different perspective-how to use the external system to enhance the function of the scheduler. Here we will introduce two external designs of the scheduler, namely, the multi-scheduler and the inverse scheduler (Descheduler).

Multiple scheduler

The serial scheduling and parallel scheduling section has analyzed the design of multi-scheduler. We can partition the resources to be scheduled so that multiple scheduler threads or processes are responsible for scheduling resources in each area respectively, making full use of the parallel ability of multi-and CPU.

Inverse scheduler

Inverse scheduler is an interesting concept, which can remove scheduling that the decision is no longer correct, reduce the entropy in the system, and let the scheduler make new decisions according to the current state.

Figure 15-Scheduler and inverse scheduler

The introduction of inverse scheduler makes the whole scheduling system more robust. The scheduler is responsible for making the correct scheduling decision according to the current state, and the inverse scheduler removes the wrong scheduling decision according to the current state. Their function seems to be the opposite, but the purpose is to schedule more appropriate resources for the task.

The use of the inverse scheduler is not so widespread, and the actual application scenario is also limited. The authors first discovered this concept in the descheduler project incubated by Kubernetes, but because removing scheduling relationships from the anti-scheduler may affect running online services, Kubernetes will only be used in specific scenarios.

Operating system

The scheduler is an important component of the operating system, which includes process scheduler, network scheduler and Imax O scheduler. This section introduces the process scheduler in the operating system.

Some readers may wonder if the smallest unit of operating system scheduling is not threads and why process scheduling is used here. In the Linux operating system, the scheduler dispatches neither processes nor threads, it dispatches the task_struct structure, which can represent both threads and processes, and the scheduler will regard both processes and threads as tasks. Let's explain this problem here to avoid confusion among readers. We will use the term process scheduler, but it is important to note that there is no distinction between threads and processes in Linux Scheduler.

Linux incorporates process and thread scheduling by treating them as one in the same. A process can be viewed as a single thread, but a process can contain multiple threads that share some number of resources (code and/or data).

Next, this section examines the types of scheduling systems in the operating system and the evolution of the Linux process scheduler.

1. Dispatching system type

The operating system divides process schedulers into three different types, namely, long-term schedulers, medium-term schedulers, and short-term schedulers. These three different types of schedulers provide different functions, which we will introduce in turn in this section.

Long-term scheduler

The long-term scheduler (Long-Term Scheduler), also known as the task scheduler (Job Scheduler), determines which tasks will enter the scheduler's preparation queue. When we try to execute a new program, the long-term scheduler is responsible for authorizing or delaying the execution of the program. The role of the long-term scheduler is to balance the number of tasks that are running at the same time for either the Imax O-intensive or CPU-intensive processes:

If there are too many CPU O-intensive tasks, there will be no tasks to be scheduled in the ready queue, and the short-term scheduler will not need to schedule, so the CPU resources will be idle; if there are too many CPU-intensive tasks, there will be no tasks to be scheduled in the IMaGO waiting queue, and the IMaGO device will be idle.

The long-term scheduler can balance the running Imax O-intensive and CPU-intensive tasks at the same time, maximizing the use of Imax O and CPU resources of the operating system.

Medium-term scheduler

The medium-term scheduler removes inactive, low-priority, page-error-prone, or memory-intensive processes from memory, freeing resources for other processes.

Figure 16-medium-term scheduler

When a running process is caught in an Iamp O operation, it only consumes computing resources, in which case the medium-term scheduler removes it from memory and waits for the short-term scheduler to rejoin the ready queue and wait for the short-term scheduler to complete.

Short-term scheduler

The short-term scheduler should be the most familiar scheduler, which selects a process for execution from the ready queue. The selection of the process will use a specific scheduling algorithm, which will also consider the priority of the process, queue time and other characteristics. Because the execution time available to each process is limited, the short-term scheduler executes frequently.

two。 Design and evolution

This section focuses on Linux's CPU scheduler, the short-term scheduler. Linux's CPU scheduler is not as complex as it is today from the beginning of its design. For a long time (v0.01 ~ v2.4), Linux process scheduling is handled by simple functions of dozens of lines. Let's first take a look at the history of different versions of the scheduler:

Initial scheduler v0.01 ~ v2.4. Implemented by dozens of lines of code, the function is very simple; it can handle up to 64 tasks at the same time.

Scheduler v2.4 ~ v2.6. When scheduling, you need to traverse all the tasks. When there are more tasks to be executed, the interval between the two execution of the same task is very long, and there will be a serious hunger problem.

Scheduler v2.6.0 ~ v2.6.22. The time complexity achieved by introducing run queues and priority arrays; using local run queues instead of global run queues to enhance scalability in symmetric multiprocessors; introducing work theft to ensure the balance of tasks in multiple run queues

Full fair scheduler v2.6.23 ~ so far. Red-black tree and running time are introduced to ensure the fairness of scheduling, and scheduling classes are introduced to implement different scheduling strategies for different task types.

The evolution from the original scheduler to today's complex full Fair Scheduler (Completely Fair Scheduler,CFS) is described in detail here.

Initial scheduler

Linux's original process scheduler consists of only two files, sched.h and sched.c. It may be hard to imagine that earlier versions of Linux used the schedule function, which is only a few dozen lines, to be responsible for scheduling operating system processes:

Void schedule (void) {int iGrader NextLegend c; struct task_struct * * p; for (p = & LAST_TASK; p > & FIRST_TASK;-- p) {...} while (1) {c =-1; next = 0; I = NR_TASKS; p = & task [NR _ TASKS] While (- I) {if (! *-p) continue; if ((* p)-> state = = TASK_RUNNING & & (* p)-> counter > c) c = (* p)-> counter, next = I;} if (c) break; for (p = & LAST_TASK; p > & FIRST_TASK -- p) if (* p) (* p)-> counter = (* p)-> counter > > 1) + (* p)-> priority;} switch_to (next);}

Both processes and threads are regarded as task_struct structures in Linux. All scheduled processes are stored in an array with an upper limit of 64, and the scheduler can handle only 64 processes.

Figure 17-initial process Scheduler

The above function first wakes up the interruptible process that got the signal, and then finds the executable process with the largest counter counter from the queue. Counter is the number of time slices that the process can occupy. This function performs different logic according to the value of the time slice:

If the maximum counter time slice is greater than 0, call the assembly language implemented switch_to switch process; if the maximum counter time slice is equal to 0, which means that the executable time of all processes is 0, then all processes will get a new time slice

Every other 10ms, the timer of the Linux operating system triggers do_timer to reduce the counter of the currently running process by one, and the scheduling is triggered again when the counter of the current process is set to 00:00.

O (n) scheduler

The scheduler is the scheduler used by Linux in v2.4 ~ v2.6. Because the scheduler traverses all tasks in the worst case, the time complexity of scheduling tasks is. The Linux scheduling algorithm divides the CPU time into different periods (Epoch), that is, the time slices that each task can use.

We can find the source code of the scheduler in the sched.h and sched.c files. Compared with the previous version of the scheduler, the implementation of the scheduler is much more complex. The scheduler traverses all the tasks in the running queue in the schedule function and calls the goodness function to calculate their weights to get the next running process:

Asmlinkage void schedule (void) {... still_running_back: list_for_each (tmp, & runqueue_head) {p = list_entry (tmp, struct task_struct, run_list); if (can_schedule (p, this_cpu)) {int weight = goodness (p, this_cpu, prev- > active_mm); if (weight > c) c = weight, next = p }}...}

At the beginning of each period, the above code calculates time slices for all tasks, and because it needs to be executed n times, the scheduler is called the scheduler. By default, each task is assigned a time slice around 200ms in one cycle, but this scheduling and allocation is the biggest problem for the scheduler:

After each round of scheduling is completed, it will fall into a situation where there are no tasks to be scheduled, and scenarios that need to improve interactive performance will be seriously affected, for example, obvious stutters will be felt when dragging the mouse on the desktop; each time you find the most weighted task, you need to traverse all the tasks in the array The average time slice size allocated by the scheduler is 210ms. When there are 100 processes in the program, the interval between two runs of the same process is 21s, which seriously affects the availability of the operating system.

It is because of the above problems with the scheduler that the Linux kernel replaces the implementation with a new scheduler after two versions.

O (1) scheduler

The scheduler has been used in the v2.6.0 to v2.6.22 Linux kernel for four years. It can schedule processes in a constant time. You can view the source code of the scheduler in sched.h and sched.c. Due to the increase in implementation and functional complexity, the number of lines of code for the scheduler has increased from 2100 to 5000, with the following improvements based on the scheduler:

The scheduler supports the scheduling of time complexity; the scheduler supports the scalability of symmetric multiprocessing (Symmetric multiprocessing,SMP); and the scheduler optimizes the affinity of symmetric multiprocessing. Data structure

The scheduler achieves time complexity by running two important data structures, the queue runqueue and the priority array prio_array. Each run queue holds two priority arrays that store an array of active and expired processes:

Struct runqueue {... Prio_array_t * active, * expired, arrays [2];...} struct prio_array {unsignedint nr_active; unsignedlong bitmap [BITMAP _ SIZE]; struct list_head queue [Max _ PRIO];}

The nr_active in the priority array represents the number of active processes, while bitmap and list_head together make up the data structure shown in the following figure:

Figure 18-priority array

The bitmap of the priority array contains a total of 140 bits, each of which indicates whether the corresponding priority process exists. The priority array in figure 17 contains three processes with priority 2 and one process with priority 5. The flag bit of each priority corresponds to a linked list in the list_head array. The scheduler uses the above data structure to schedule as follows:

Call sched_find_first_bit to allocate CPU resources according to priority; call schedule to select processes from the chain header to execute; schedule processes of the same priority through schedule rotation training. This function adds processes to the end of the queue after each selection of processes to be executed, which ensures that processes of the same priority will be executed in turn (Round-Robin) The timer triggers the scheduler_tick function every 1ms. If the execution time of the current process has been exhausted, it will be moved to the expired array. When there are no processes to run in the active queue, schedule will exchange the active priority array and the expiration priority array.

The above rules are the main rules for the operation of the scheduler, in addition to the above rules, the scheduler also needs to support preemption, CPU affinity and other functions, but I will not cover them here.

Local run queue

The global run queue is the main reason why it is difficult for the scheduler to scale on a symmetric multiprocessor architecture. In order to ensure the consistency of the running queue, the scheduler needs to obtain the global lock of the running queue during scheduling. with the increase of the number of processors, multiple processors will lead to more lock competition during scheduling, which will seriously affect the scheduling performance. The scheduler solves this problem by introducing local run queues, and different CPU can obtain the run queues bound to the current CPU through this_rq, which reduces the granularity of locks and the possibility of conflicts.

# define this_rq () (& _ _ get_cpu_var (runqueues))

Figure 19-Global and local run queues

Multiple processors no longer need to share global run queues, so it enhances the scalability on the symmetrical pair processor architecture. when we add new processors, we only need to add new run queues. This approach does not introduce more lock conflicts.

Priority and time slicing

There are two different priority calculation methods in the scheduler, one is static task priority, the other is dynamic task priority. By default, the static task priority of a task is 0, but we can change the priority of the task by calling nice. The scheduler rewards Imax O-intensive tasks and punishes CPU-intensive tasks. It accomplishes the dynamic priority adjustment by changing the static priority of tasks, because the processes that interact with users are Imax O-intensive processes, and these processes improve the user experience because of the scheduler's dynamic policy.

Fully fair scheduler

The full Fair Scheduler (Completely Fair Scheduler,CFS) is the v2.6.23 scheduler integrated into the kernel and the kernel's default process scheduler, which is designed to maximize CPU utilization and interaction performance. CFS in the Linux kernel version v2.6.23 consists of the following files:

Include/linux/sched.hkernel/sched_stats.hkernel/sched.ckernel/sched_fair.ckernel/sched_idletask.ckernel/sched_rt.c

Through the name of CFS, we can find that the scheduler can provide full fairness to different processes. Once some processes are treated unfairly, the scheduler runs them, thus maintaining the fairness of the running time of all processes. This way of ensuring fairness is similar to "adding noodles with more water and adding water with more noodles":

The scheduler will find the most unfairly treated process in the running queue and allocate computing resources to the process, which is the difference between the running time of other resources and the minimum unit of time that can run; after the end of the process, it is found that other processes in the running queue have received the most unfair treatment, and the scheduler will run a new process.

The scheduler algorithm constantly calculates the running time of each process and schedules the most unfairly treated processes in the queue in turn to ensure that the running time difference of each process is not greater than the minimum running time unit.

Data structure

Although we will continue to use the term run queue, queues are no longer used to store processes inside CFS. Cfs_rq is a new structure for managing processes to be run, which uses red-black trees (Red-black tree) instead of linked lists:

Struct cfs_rq {struct load_weight load; unsignedlong nr_running; s64 fair_clock; U64 exec_clock; S64 wait_runtime; U64 sleeper_bonus; unsignedlong wait_runtime_overruns, wait_runtime_underruns; struct rb_root tasks_timeline; struct rb_node * rb_leftmost; struct rb_node * rb_load_balance_curr; struct sched_entity * curr; struct rq * rq; struct list_head leaf_cfs_rq_list;}

The red-black tree (Red-black tree) is a balanced binary search tree. The worst time complexity of the addition, deletion, modification and query operation of the red-black tree is the height of the tree. The leftmost node in the tree, rb_leftmost, runs for the shortest time and is the next process to be run.

Note: in the latest version of the CFS implementation, the kernel uses virtual runtime vruntime instead of wait time, but the basic scheduling principles and sorting methods have not changed much.

Scheduling process

The scheduling process of CFS is completed by the schedule function, and the execution of this function can be divided into the following steps:

Disable the preemption function of the current CPU; if there are no tasks in the current CPU run queue, call idle_balance to take a part of the other CPU run queue for execution; call pick_next_task to select the task with the highest priority in the red-black tree; call context_switch to switch the running context, including the status and stack of registers; and restart the preemption feature of the current CPU.

The scheduling process of CFS is very similar to the scheduler, the difference between the current scheduler and the former is that the optional work theft mechanism is added and the underlying data structure is changed.

Scheduling class

Scheduling class in CFS is an interesting concept, which can determine the scheduling strategy of a process. Each scheduling class contains a set of functions responsible for scheduling, and the scheduling class is represented by the sched_class structure shown below:

Struct sched_class {struct sched_class * next; void (* enqueue_task) (struct rq * rq, struct task_struct * p, int wakeup); void (* dequeue_task) (struct rq * rq, struct task_struct * p, int sleep); void (* yield_task) (struct rq * rq, struct task_struct * p); void (* check_preempt_curr) (struct rq * rq, struct task_struct * p) Struct task_struct * (* pick_next_task) (struct rq * rq); void (* put_prev_task) (struct rq * rq, struct task_struct * p) Unsigned long (* load_balance) (struct rq * this_rq, int this_cpu, struct rq * busiest, unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain * sd, enum cpu_idle_type idle, int * all_pinned, int * this_best_prio); void (* set_curr_task) (struct rq * rq) Void (* task_tick) (struct rq * rq, struct task_struct * p); void (* task_new) (struct rq * rq, struct task_struct * p);}

The scheduling class contains functions for initializing, queuing, and dequeuing tasks, and the design here is slightly similar to that in object-oriented design. The kernel contains SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE, SCHED_FIFO and SCHED_RR scheduling classes, which implement functions in sched_class to provide different scheduling behaviors.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.