Essentials of dispatching system design 10/27 Update SLTechnology News&Howtos

Essentials of dispatching system design

2025-10-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Author | Draveness

Introduction: it takes about 2 months for the author to write this article, and the full text is about 2w words. It is recommended to read it after collection or through the computer.

Scheduling is a very broad concept, and the term scheduling is used in many fields. In computer science, scheduling is a method of assigning tasks (Work) to resources. Tasks may be virtual computing tasks, such as threads, processes, or data streams, which are scheduled to be executed on hardware resources, such as the processor CPU.

Figure 1-Dispatch system Design Essentials

This paper will introduce the common scenarios of the scheduling system and some key issues in the design process. The design of the scheduler will eventually boil down to one problem-how to allocate and schedule resources efficiently to achieve our goals. It may include rational use of resources, minimizing costs, and quickly matching supply and demand.

Figure 2-context and content of the article

In addition to introducing the common problems encountered in the design of the scheduling system, this paper will also deeply analyze the design, evolution and implementation principles of several common schedulers, including the process scheduler of the operating system, the runtime scheduler of Go language and the workload scheduler of Kubernetes, to help us understand the core principles of scheduler design.

Design principle

In fact, the scheduling system is a Scheduler. We can see the scheduler in many systems. As we said above, there is not only a scheduler in the operating system, but also a scheduling system or scheduling module in programming languages, container orchestration and many business systems.

The core function of these scheduling modules is to allocate limited resources in order to maximize the utilization of resources or reduce the tail delay of the system. The scheduling system is faced with the problem of imbalance between demand and supply of resources.

Figure 3-Scheduler tasks and resources

In this section, we will introduce the key issues that need to be considered in the design of the scheduling system from many aspects, including the demand investigation, scheduling principle and architecture design of the scheduling system.

1. Demand survey

Before starting to build a scheduling system, the first task is to conduct detailed requirements research and analysis, and the following two things need to be completed in this process:

Investigate the application scenario of the scheduling system, deeply study the characteristics of the tasks to be executed (Work) and the resources that can be used to execute the tasks (Resource), analyze the purpose of the scheduling system, which may be cost priority, quality priority, maximize the utilization of resources, etc., the scheduling purpose is generally dynamic and will change with the change of demand; application scenario

The application scenario of the scheduling system is the first problem we need to consider, and the analysis of the application scenario is very important. we need to have an in-depth understanding of the characteristics of tasks to be executed and resources that can be used to execute tasks in the current scenario. We need to analyze the following characteristics of the tasks to be performed:

Whether the task has a deadline, it must be completed before a certain point in time; whether the task supports preemption, what are the specific rules for preemption; whether the task contains pre-dependent conditions; whether the task can only run on specified resources;.

The resources used to perform tasks may also have the problem of imbalance in resources and inconsistency in the speed at which different resources deal with tasks.

The diversity of resources and tasks determines the design of the scheduling system. Here we give a few simple examples to help readers understand the process of requirements analysis of the scheduling system.

Figure 4-Linux operating system

In the process scheduler of the operating system, the tasks to be scheduled are threads, which are generally only in the state of being executed or not executed (waiting or terminating); and the CPU used to deal with these tasks are often inseparable, and the same CPU can only execute one task at a time, which is a physical restriction. To summarize briefly, the tasks and resources of the operating system Scheduler have the following characteristics:

Task-- Thread. The status is simple: it will only be in two states: being executed or not being executed; different priorities: tasks to be executed may have different priorities, and when priority is taken into account, it is necessary to ensure the fairness of different tasks; resources-CPU time. Resources cannot be subdivided: only one task can be run at a time

In the above scenario, the task to be executed is the basic unit of operating system scheduling-threads, and the resource that can be allocated is CPU time. The scheduler of the Go language faces almost the same scenario as the scheduler of the operating system, where the task is Goroutine and the resources that can be allocated are threads running on CPU.

Figure 5-Container choreography system Kubernetes

In addition to lower-level schedulers such as operating systems and programming languages, container and computing task scheduling is also common today. Kubernetes, as a container orchestration system, is responsible for fetching containers in the cluster. People who know a little about it know that the basic unit of scheduling in Kubernetes is Pod, and these Pod are scheduled to be executed on node Node:

Task-- Pod. Priority is different: the priority of the Pod may be different, and the high-priority system Pod can preempt the resources of the low-priority Pod; stateful: Pod can be divided into stateless and stateful, and stateful Pod depends on persistent storage volumes.

Resource-Node. Different types: different types of resources on different nodes, including CPU, GPU, memory, etc. These resources can be split but belong to the current node; instability: nodes may be unavailable due to sudden reasons, such as no network connection, disk damage, etc.

Scheduling system is very common in life and work. In addition to the above two scenarios, other scenarios that require scheduling system include CDN resource scheduling, order scheduling and offline task scheduling system. In different scenarios, we all need to think deeply about the characteristics of tasks and resources, which can guide the designers of the system.

Dispatching purpose

After an in-depth analysis of the scheduling scenario, we need to understand the purpose of scheduling. We can understand the scheduling purpose as the cost function (Cost function) in machine learning. To determine the scheduling purpose is to determine the definition of the cost function. The common scheduling purpose has been introduced in the book scheduling Theory, including the following contents:

Completion span (Makesapan)-the time span of the first to last task to complete the schedule; maximum delay (Maximum Lateness)-the task that exceeds the deadline; the sum of weighted completion time (Total weighted completion time)-the sum of weight multiplied by completion time.

These are the purpose of partial theoretical scheduling. The scheduling purpose of most business scheduling systems is to optimize the indicators closely related to the business-cost and quality. How to achieve a balance between cost and quality requires careful consideration and design. Due to the limited space and the complexity of the business scenario, this paper will not analyze how to balance cost and quality, which often need to be considered in combination with the business. There is not enough similarity.

two。 Dispatching principle

A scheduler with excellent performance is a prerequisite to achieve a specific scheduling purpose, and we often ignore the extra overhead of scheduling when discussing scheduling scenarios and purposes. however, the delay and throughput of the scheduler can not be ignored when the scheduling load is heavy. This section examines some important concepts related to scheduler implementation that can help us implement a high-performance scheduler:

Collaborative scheduling and preemptive scheduling; single scheduler and multiple schedulers; task sharing and task theft; collaborative and preemptive scheduling

Cooperative (Cooperative) and preemptive (Preemptive) scheduling are common multitasking strategies in operating systems. The definitions of the two scheduling methods are completely different:

Collaborative scheduling allows the task to execute for any long time until the task actively notifies the scheduler to give up resources; preemptive scheduling allows the task to be suspended by the scheduler during execution, and the scheduler redecides the next task to run

Figure 6-Collaborative scheduling and preemptive scheduling

The execution time of the task and the additional overhead of task context switching determine which scheduling method will bring better performance. As shown in the following figure, figure 7 shows the process of a collaborative scheduler scheduling a task. Once the scheduler allocates resources to a task, it waits for the task to actively release resources. Although the execution time of the four tasks in the figure is different, they all release resources after the task execution is completed, and the whole process requires only four context changes.

Figure 7-Collaborative scheduling

Figure 8 shows the process of preemptive scheduling, and because the scheduler does not know the execution time of all tasks, it allocates a time slice for each task. Task 1 and task 4 completed the task when they were scheduled for the first time because of their short execution time, but task 2 and task 3 exceeded the upper limit assigned by the scheduler because of their long execution time, so preemption will be triggered in order to ensure fairness, and other tasks in the waiting queue will get resources. During the whole scheduling process, a total of 6 context switches took place.

Figure 8-preemptive scheduling

If the execution time of some tasks is very long, collaborative task scheduling will starve some long-executing tasks to other tasks, but if the execution time of the tasks to be executed is short and almost the same, then the use of collaborative task scheduling can reduce the additional overhead caused by task interruption, resulting in better scheduling performance.

Because in most cases, the time of task execution is uncertain, in collaborative scheduling, once the task does not actively give up resources, it will lead to other tasks waiting and blocking, so the scheduling system generally focuses on preemptive task scheduling. at the same time, collaborative scheduling of tasks is supported.

Monotone scheduler and multi-scheduler

Whether to use a single scheduler or multiple schedulers also needs to be carefully considered when designing a scheduling system. Multiple schedulers do not necessarily mean multiple processes, but may also be multiple scheduling threads in a process. They can choose to schedule in parallel on a multi-core, concurrent scheduling on a single core, or use both parallelism and concurrency to improve performance.

Figure 9-monotone scheduler scheduling tasks and resources

However, for the scheduling system, because the decisions it makes will change the state of resources and the context of the system, and then affect the subsequent scheduling decisions, so the serial scheduling of a single scheduler is the only way to accurately schedule resources. A single scheduler uses different channels to collect the context needed for scheduling, and after receiving the scheduling request, it will make the current optimal decision according to the task and resource situation.

With the continuous evolution of the scheduler, the performance and throughput of a single scheduler may be limited. We still need to introduce parallel or concurrent scheduling to solve the performance bottleneck. We need to partition the resources to be scheduled. Let multiple schedulers be responsible for scheduling resources in different areas.

Figure 10-multiple schedulers and resource partitions

The concurrent scheduling of multiple schedulers can greatly improve the overall performance of the scheduler, such as the scheduler of Go language. The Go language runtime will assign multiple CPU to different processors to schedule separately, so that the performance of the scheduler can be improved through parallel scheduling.

The two scheduling methods described above are based on the premise that precise scheduling is required, and each scheduler in the multi-scheduler will face unrelated resources, so for the resources of the same partition, the scheduling is serial.

Figure 11-coarse-grained scheduling of multiple schedulers

It is also possible to use multiple schedulers to schedule multiple resources at the same time, but the accuracy of scheduling may have to be sacrificed-different schedulers may receive status updates at different times, which causes different schedulers to make different decisions. Load balancer can be regarded as a multi-thread and multi-process scheduler. Because the information on task and resource control is limited, the result of this coarse-grained scheduling is that the load of different machines will vary greatly. Therefore, both small-scale clusters and large-scale clusters are likely to lead to excessive load on some instances.

Job sharing and job theft

This section continues to cover the two scheduling paradigms for redistributing tasks among multiple schedulers-job sharing (Work Sharing) and job theft (Work Stealing). An independent scheduler can handle all tasks and resources at the same time, so it will not encounter the imbalance between tasks and resources of multiple schedulers. In most scheduling scenarios, the execution time of tasks is uncertain, assuming that multiple schedulers schedule the same resources respectively, because the execution time of tasks is uncertain, the task queues waiting to be scheduled in multiple schedulers will eventually be different-some queues contain a large number of tasks, while others do not contain tasks, so it is necessary to introduce a task redistribution strategy.

Job sharing and job theft are two completely different redistribution strategies. In job sharing, when the scheduler creates a new task, it allocates some tasks to other schedulers, while in job theft, when the scheduler's resources are not fully utilized, it steals some tasks to be assigned from other schedulers, as shown in the following figure:

Figure 12-Job theft scheduler

Both of these two kinds of task redistribution strategies add additional overhead to the system. Compared with job sharing, job theft is triggered only when the resources of the current scheduler are not fully utilized, so the extra overhead introduced by job theft is less. Job theft is more commonly used in production environments. Both the Linux operating system and the Go language choose the job theft strategy.

3. Architecture design

This section analyzes the architecture design of the scheduler from both the internal and external perspectives of the scheduler, the former analyzing the relationship between multiple components within the scheduler and the process of making scheduling decisions, and the latter analyzing how multiple schedulers should cooperate and whether there are other external services that can assist the scheduler to make more reasonable scheduling decisions.

Inside the scheduler

When the scheduler receives the task to be scheduled, it will make a reasonable scheduling decision according to the collected status and the specification (Spec) of the task to be scheduled. We can understand the internal logic of common scheduling systems from the figure below.

Figure 13-Scheduler makes scheduling decisions

A common scheduler generally consists of two parts-the status module used to collect status and the decision module responsible for making decisions.

Status module

The status module collects as much information as possible from different ways to provide rich context for scheduling, which may include information such as resource properties, utilization, and availability. Depending on the scenario, the context may need to be stored in persistent storage such as MySQL, and a copy is usually cached in memory to reduce the cost of accessing the context by the scheduler.

Decision module

The decision module will make scheduling decisions according to the context and task specifications collected by the state module. It should be noted that the scheduling decisions made are only valid now, and at some point in the future, the state change may cause the previous decisions not to meet the requirements of the task. For example, when we use the Kubernetes scheduler to schedule workloads to certain nodes, these nodes may suddenly become unavailable due to network problems. The workload on this node cannot work properly, that is, the scheduling decision is invalid.

The scheduler schedules the appropriate resources for the task through the following three steps:

Through priority, task creation time and other information to determine the scheduling order of different tasks; through filtering and scoring two stages to select appropriate resources for the task; when there are no resources that meet the conditions, choose the sacrificed preemption object.

Figure 14-scheduling Framework

The figure above shows several steps performed by the common scheduler decision module, determining the priority, scoring idle resources, and determining the victims of preemptive resources. the last of the above three steps is often optional, and some scheduling systems do not need to support preemptive scheduling.

Outside of the scheduler

If we look at the scheduler as a whole, the architecture design from the outside of the scheduler will get a completely different perspective-how to use the external system to enhance the function of the scheduler. Here we will introduce two external designs of the scheduler, namely, the multi-scheduler and the inverse scheduler (Descheduler).

Multiple scheduler

The serial scheduling and parallel scheduling section has analyzed the design of multi-scheduler. We can partition the resources to be scheduled so that multiple scheduler threads or processes are responsible for scheduling resources in each area respectively, making full use of the parallel ability of multi-and CPU.

Inverse scheduler

Inverse scheduler is an interesting concept, which can remove scheduling that the decision is no longer correct, reduce the entropy in the system, and let the scheduler make new decisions according to the current state.

Figure 15-Scheduler and inverse scheduler

The introduction of inverse scheduler makes the whole scheduling system more robust. The scheduler is responsible for making the correct scheduling decision according to the current state, and the inverse scheduler removes the wrong scheduling decision according to the current state. Their function seems to be the opposite, but the purpose is to schedule more appropriate resources for the task.

The use of the inverse scheduler is not so widespread, and the actual application scenario is also limited. The authors first discovered this concept in the descheduler project incubated by Kubernetes, but because removing scheduling relationships from the anti-scheduler may affect running online services, Kubernetes will only be used in specific scenarios.

Operating system

The scheduler is an important component of the operating system, which includes process scheduler, network scheduler and Imax O scheduler. This section introduces the process scheduler in the operating system.

Some readers may wonder if the smallest unit of operating system scheduling is not threads and why process scheduling is used here. In the Linux operating system, the scheduler dispatches neither processes nor threads, it dispatches the task_struct structure, which can represent both threads and processes, and the scheduler will regard both processes and threads as tasks. Let's explain this problem here to avoid confusion among readers. We will use the term process scheduler, but it is important to note that there is no distinction between threads and processes in Linux Scheduler.

Linux incorporates process and thread scheduling by treating them as one in the same. A process can be viewed as a single thread, but a process can contain multiple threads that share some number of resources (code and/or data).

Next, this section examines the types of scheduling systems in the operating system and the evolution of the Linux process scheduler.

1. Dispatching system type

The operating system divides process schedulers into three different types, namely, long-term schedulers, medium-term schedulers, and short-term schedulers. These three different types of schedulers provide different functions, which we will introduce in turn in this section.

Long-term scheduler

The long-term scheduler (Long-Term Scheduler), also known as the task scheduler (Job Scheduler), determines which tasks will enter the scheduler's preparation queue. When we try to execute a new program, the long-term scheduler is responsible for authorizing or delaying the execution of the program. The role of the long-term scheduler is to balance the number of tasks that are running at the same time for either the Imax O-intensive or CPU-intensive processes:

If there are too many CPU O-intensive tasks, there will be no tasks to be scheduled in the ready queue, and the short-term scheduler will not need to schedule, so the CPU resources will be idle; if there are too many CPU-intensive tasks, there will be no tasks to be scheduled in the IMaGO waiting queue, and the IMaGO device will be idle.

The long-term scheduler can balance the running Imax O-intensive and CPU-intensive tasks at the same time, maximizing the use of Imax O and CPU resources of the operating system.

Medium-term scheduler

The medium-term scheduler removes inactive, low-priority, page-error-prone, or memory-intensive processes from memory, freeing resources for other processes.

Figure 16-medium-term scheduler

When a running process is caught in an Iamp O operation, it only consumes computing resources, in which case the medium-term scheduler removes it from memory and waits for the short-term scheduler to rejoin the ready queue and wait for the short-term scheduler to complete.

Short-term scheduler

The short-term scheduler should be the most familiar scheduler, which selects a process for execution from the ready queue. The selection of the process will use a specific scheduling algorithm, which will also consider the priority of the process, queue time and other characteristics. Because the execution time available to each process is limited, the short-term scheduler executes frequently.

two。 Design and evolution

This section focuses on Linux's CPU scheduler, the short-term scheduler. Linux's CPU scheduler is not as complex as it is today from the beginning of its design. For a long time (v0.01 ~ v2.4), Linux process scheduling is handled by simple functions of dozens of lines. Let's first take a look at the history of different versions of the scheduler:

Initial scheduler v0.01 ~ v2.4. Implemented by dozens of lines of code, the function is very simple; it can handle up to 64 tasks at the same time.

Scheduler v2.4 ~ v2.6. When scheduling, you need to traverse all the tasks. When there are more tasks to be executed, the interval between the two execution of the same task is very long, and there will be a serious hunger problem.

Scheduler v2.6.0 ~ v2.6.22. The time complexity achieved by introducing run queues and priority arrays; using local run queues instead of global run queues to enhance scalability in symmetric multiprocessors; introducing work theft to ensure the balance of tasks in multiple run queues

Full fair scheduler v2.6.23 ~ so far. Red-black tree and running time are introduced to ensure the fairness of scheduling, and scheduling classes are introduced to implement different scheduling strategies for different task types.

The evolution from the original scheduler to today's complex full Fair Scheduler (Completely Fair Scheduler,CFS) is described in detail here.

Initial scheduler

Linux's original process scheduler consists of only two files, sched.h and sched.c. It may be hard to imagine that earlier versions of Linux used the schedule function, which is only a few dozen lines, to be responsible for scheduling operating system processes:

Void schedule (void) {int iGrader NextLegend c; struct task_struct * * p; for (p = & LAST_TASK; p > & FIRST_TASK;-- p) {...} while (1) {c =-1; next = 0; I = NR_TASKS; p = & task [NR _ TASKS] While (- I) {if (! *-p) continue; if ((* p)-> state = = TASK_RUNNING & & (* p)-> counter > c) c = (* p)-> counter, next = I;} if (c) break; for (p = & LAST_TASK; p > & FIRST_TASK -- p) if (* p) (* p)-> counter = (* p)-> counter > > 1) + (* p)-> priority;} switch_to (next);}

Both processes and threads are regarded as task_struct structures in Linux. All scheduled processes are stored in an array with an upper limit of 64, and the scheduler can handle only 64 processes.

Figure 17-initial process Scheduler

The above function first wakes up the interruptible process that got the signal, and then finds the executable process with the largest counter counter from the queue. Counter is the number of time slices that the process can occupy. This function performs different logic according to the value of the time slice:

If the maximum counter time slice is greater than 0, call the assembly language implemented switch_to switch process; if the maximum counter time slice is equal to 0, which means that the executable time of all processes is 0, then all processes will get a new time slice

Every other 10ms, the timer of the Linux operating system triggers do_timer to reduce the counter of the currently running process by one, and the scheduling is triggered again when the counter of the current process is set to 00:00.

O (n) scheduler

The scheduler is the scheduler used by Linux in v2.4 ~ v2.6. Because the scheduler traverses all tasks in the worst case, the time complexity of scheduling tasks is. The Linux scheduling algorithm divides the CPU time into different periods (Epoch), that is, the time slices that each task can use.

We can find the source code of the scheduler in the sched.h and sched.c files. Compared with the previous version of the scheduler, the implementation of the scheduler is much more complex. The scheduler traverses all the tasks in the running queue in the schedule function and calls the goodness function to calculate their weights to get the next running process:

Asmlinkage void schedule (void) {... still_running_back: list_for_each (tmp, & runqueue_head) {p = list_entry (tmp, struct task_struct, run_list); if (can_schedule (p, this_cpu)) {int weight = goodness (p, this_cpu, prev- > active_mm); if (weight > c) c = weight, next = p }}...}

At the beginning of each period, the above code calculates time slices for all tasks, and because it needs to be executed n times, the scheduler is called the scheduler. By default, each task is assigned a time slice around 200ms in one cycle, but this scheduling and allocation is the biggest problem for the scheduler:

After each round of scheduling is completed, it will fall into a situation where there are no tasks to be scheduled, and scenarios that need to improve interactive performance will be seriously affected, for example, obvious stutters will be felt when dragging the mouse on the desktop; each time you find the most weighted task, you need to traverse all the tasks in the array The average time slice size allocated by the scheduler is 210ms. When there are 100 processes in the program, the interval between two runs of the same process is 21s, which seriously affects the availability of the operating system.

It is because of the above problems with the scheduler that the Linux kernel replaces the implementation with a new scheduler after two versions.

O (1) scheduler

The scheduler has been used in the v2.6.0 to v2.6.22 Linux kernel for four years. It can schedule processes in a constant time. You can view the source code of the scheduler in sched.h and sched.c. Due to the increase in implementation and functional complexity, the number of lines of code for the scheduler has increased from 2100 to 5000, with the following improvements based on the scheduler:

The scheduler supports the scheduling of time complexity; the scheduler supports the scalability of symmetric multiprocessing (Symmetric multiprocessing,SMP); and the scheduler optimizes the affinity of symmetric multiprocessing. Data structure

The scheduler achieves time complexity by running two important data structures, the queue runqueue and the priority array prio_array. Each run queue holds two priority arrays that store an array of active and expired processes:

Struct runqueue {... Prio_array_t * active, * expired, arrays [2];...} struct prio_array {unsignedint nr_active; unsignedlong bitmap [BITMAP _ SIZE]; struct list_head queue [Max _ PRIO];}

The nr_active in the priority array represents the number of active processes, while bitmap and list_head together make up the data structure shown in the following figure:

Figure 18-priority array

The bitmap of the priority array contains a total of 140 bits, each of which indicates whether the corresponding priority process exists. The priority array in figure 17 contains three processes with priority 2 and one process with priority 5. The flag bit of each priority corresponds to a linked list in the list_head array. The scheduler uses the above data structure to schedule as follows:

Call sched_find_first_bit to allocate CPU resources according to priority; call schedule to select processes from the chain header to execute; schedule processes of the same priority through schedule rotation training. This function adds processes to the end of the queue after each selection of processes to be executed, which ensures that processes of the same priority will be executed in turn (Round-Robin) The timer triggers the scheduler_tick function every 1ms. If the execution time of the current process has been exhausted, it will be moved to the expired array. When there are no processes to run in the active queue, schedule will exchange the active priority array and the expiration priority array.

The above rules are the main rules for the operation of the scheduler, in addition to the above rules, the scheduler also needs to support preemption, CPU affinity and other functions, but I will not cover them here.

Local run queue

The global run queue is the main reason why it is difficult for the scheduler to scale on a symmetric multiprocessor architecture. In order to ensure the consistency of the running queue, the scheduler needs to obtain the global lock of the running queue during scheduling. with the increase of the number of processors, multiple processors will lead to more lock competition during scheduling, which will seriously affect the scheduling performance. The scheduler solves this problem by introducing local run queues, and different CPU can obtain the run queues bound to the current CPU through this_rq, which reduces the granularity of locks and the possibility of conflicts.

# define this_rq () (& _ _ get_cpu_var (runqueues))

Figure 19-Global and local run queues

Multiple processors no longer need to share global run queues, so it enhances the scalability on the symmetrical pair processor architecture. when we add new processors, we only need to add new run queues. This approach does not introduce more lock conflicts.

Priority and time slicing

There are two different priority calculation methods in the scheduler, one is static task priority, the other is dynamic task priority. By default, the static task priority of a task is 0, but we can change the priority of the task by calling nice. The scheduler rewards Imax O-intensive tasks and punishes CPU-intensive tasks. It accomplishes the dynamic priority adjustment by changing the static priority of tasks, because the processes that interact with users are Imax O-intensive processes, and these processes improve the user experience because of the scheduler's dynamic policy.

Fully fair scheduler

The full Fair Scheduler (Completely Fair Scheduler,CFS) is the v2.6.23 scheduler integrated into the kernel and the kernel's default process scheduler, which is designed to maximize CPU utilization and interaction performance. CFS in the Linux kernel version v2.6.23 consists of the following files:

Include/linux/sched.hkernel/sched_stats.hkernel/sched.ckernel/sched_fair.ckernel/sched_idletask.ckernel/sched_rt.c

Through the name of CFS, we can find that the scheduler can provide full fairness to different processes. Once some processes are treated unfairly, the scheduler runs them, thus maintaining the fairness of the running time of all processes. This way of ensuring fairness is similar to "adding noodles with more water and adding water with more noodles":

The scheduler will find the most unfairly treated process in the running queue and allocate computing resources to the process, which is the difference between the running time of other resources and the minimum unit of time that can run; after the end of the process, it is found that other processes in the running queue have received the most unfair treatment, and the scheduler will run a new process.

The scheduler algorithm constantly calculates the running time of each process and schedules the most unfairly treated processes in the queue in turn to ensure that the running time difference of each process is not greater than the minimum running time unit.

Data structure

Although we will continue to use the term run queue, queues are no longer used to store processes inside CFS. Cfs_rq is a new structure for managing processes to be run, which uses red-black trees (Red-black tree) instead of linked lists:

Struct cfs_rq {struct load_weight load; unsignedlong nr_running; s64 fair_clock; U64 exec_clock; S64 wait_runtime; U64 sleeper_bonus; unsignedlong wait_runtime_overruns, wait_runtime_underruns; struct rb_root tasks_timeline; struct rb_node * rb_leftmost; struct rb_node * rb_load_balance_curr; struct sched_entity * curr; struct rq * rq; struct list_head leaf_cfs_rq_list;}

The red-black tree (Red-black tree) is a balanced binary search tree. The worst time complexity of the addition, deletion, modification and query operation of the red-black tree is the height of the tree. The leftmost node in the tree, rb_leftmost, runs for the shortest time and is the next process to be run.

Note: in the latest version of the CFS implementation, the kernel uses virtual runtime vruntime instead of wait time, but the basic scheduling principles and sorting methods have not changed much.

Scheduling process

The scheduling process of CFS is completed by the schedule function, and the execution of this function can be divided into the following steps:

Disable the preemption function of the current CPU; if there are no tasks in the current CPU run queue, call idle_balance to take a part of the other CPU run queue for execution; call pick_next_task to select the task with the highest priority in the red-black tree; call context_switch to switch the running context, including the status and stack of registers; and restart the preemption feature of the current CPU.

The scheduling process of CFS is very similar to the scheduler, the difference between the current scheduler and the former is that the optional work theft mechanism is added and the underlying data structure is changed.

Scheduling class

Scheduling class in CFS is an interesting concept, which can determine the scheduling strategy of a process. Each scheduling class contains a set of functions responsible for scheduling, and the scheduling class is represented by the sched_class structure shown below:

Struct sched_class {struct sched_class * next; void (* enqueue_task) (struct rq * rq, struct task_struct * p, int wakeup); void (* dequeue_task) (struct rq * rq, struct task_struct * p, int sleep); void (* yield_task) (struct rq * rq, struct task_struct * p); void (* check_preempt_curr) (struct rq * rq, struct task_struct * p) Struct task_struct * (* pick_next_task) (struct rq * rq); void (* put_prev_task) (struct rq * rq, struct task_struct * p) Unsigned long (* load_balance) (struct rq * this_rq, int this_cpu, struct rq * busiest, unsigned long max_nr_move, unsigned long max_load_move, struct sched_domain * sd, enum cpu_idle_type idle, int * all_pinned, int * this_best_prio); void (* set_curr_task) (struct rq * rq) Void (* task_tick) (struct rq * rq, struct task_struct * p); void (* task_new) (struct rq * rq, struct task_struct * p);}

The scheduling class contains functions for initializing, queuing, and dequeuing tasks, and the design here is slightly similar to that in object-oriented design. The kernel contains SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE, SCHED_FIFO and SCHED_RR scheduling classes, which implement functions in sched_class to provide different scheduling behaviors.

3. Summary

This section introduces the design principle and evolution history of the operating system scheduler. It has been a long time since it was incorporated into CFS in 2007, the current scheduler has become more complex, and the community is constantly improving the process scheduler.

We can see the change of mainstream system architecture from the evolution of Linux scheduler. At first, the scheduler with dozens of lines of code can complete the basic scheduling function, but now we have to use tens of thousands of lines of code to complete complex scheduling to ensure low latency and high throughput of the system.

Due to the limited space, it is difficult for us to analyze the scheduler of the operating system. Here you can find the Linux source code used by the author and analyze different versions of the process scheduler yourself.

4. Extended reading of Understanding the Linux 2.6.8.1 CPU SchedulerCFS SchedulerInside the Linux 2.6Completely Fair SchedulerThe Linux desktop may soon be a lot fasterModular Scheduler Core and Completely Fair SchedulerThe Linux Scheduler: A Decade of Wasted CoresGo language

Go is a programming language born in 2009. I believe many people have the impression that Go is syntactically simple and can support high concurrency services. Syntax simplicity is the top-level design philosophy of programming languages, and the high concurrency support of languages depends on the runtime scheduler, which is what this section will study.

Anyone with a little knowledge of the Go language knows that communication sequential processes (Communicating sequential processes,CSP) influence the concurrency model of the Go language, where Goroutine and Channel represent entities and media for communication, respectively.

Figure 20-concurrency model for Go and Erlang

"Don't communicate through shared memory, we should use communication to share memory" is not only the design philosophy encouraged by the Go language, the older Erlang language actually follows the same design, but Erlang chose to use the Actor model, so we will not introduce the difference and relationship between CSP and Actor here. Interested readers can find relevant resources in recommended reading and references.

1. Design and evolution

Today's Go language scheduler has excellent performance, but if we look back at the v0.x version of the Go language scheduler, we will find that the original scheduler was very crude and could not support highly concurrent services. It took several large versions of iterations for the entire scheduler to achieve today's excellent performance.

Single-threaded scheduler 0.x-source code. Contains only more than 40 lines of code; can only be scheduled in a single thread and consists of the GmurM model; multithreaded scheduler 1.0-source code. Multithreaded scheduling is introduced; global locks lead to serious competition; tasks steal scheduler 1.1-source code. Processor P is introduced to form the current G-M-P model; on the basis of processor P, the scheduler based on work theft is implemented; in some cases, Goroutine will not let the thread cause hunger; too long program pause (Stop-the-world,STW) will cause the program not to work; preemptive scheduler 1.2so far-source code. Implement signal-based true preemptive scheduling; preemptive scheduling is triggered when garbage collection scans the stack; preemptive time is not enough to cover all edge cases; cooperation-based preemptive scheduling is achieved by inserting check instructions during function calls by the compiler; GC and loops may cause Goroutine to take up resources for a long time and cause programs to be paused; cooperative preemptive scheduler-1.21.13 Preemptive scheduler-1.14 ~ so far; non-uniform storage access scheduler proposal. Partition various resources in the runtime; the implementation is so complex that it has not been put on the agenda today

In addition to multithreading, task theft, and preemptive schedulers, the Go language community currently has a proposal for a non-uniform memory access (Non-uniform memory access,NUMA) scheduler that may one day be implemented by the Go language. In this section, we will in turn introduce the implementation of different versions of the scheduler and the scheduler proposals that may be implemented in the future.

Single thread scheduler

The Go language contains only two structures, G for Goroutine and M for threads, in the 0.x scheduler, and there is only one thread globally. We can find the source code of the single-thread scheduler in the clean up scheduler submission. At this time, the Go language scheduler is implemented in C, and the scheduling function schedule contains only more than 40 lines of code:

Static void scheduler (void) {& gp; lock (& sched); if (gosave (& m-> sched)) {lock (& sched); gp = m-> curg; switch (gp- > status) {case Grunnable: case Grunning: gp- > status = Grunnable; gput (gp); break .} notewakeup (& gp- > stopped);} gp = nextgandunlock (); noteclear (& gp- > stopped); gp- > status = Grunning; m-> curg = gp; g = gp; gogo (& gp- > sched);}

The function follows the procedure shown below:

Get the global lock of the scheduler; call gosave to save the stack register and program counter; call nextgandunlock to get the Goroutine that the next thread M needs to run and unlock the scheduler; modify the Goroutine; to be executed on the global thread m to call the gogo function to run the latest Goroutine.

The only advantage of this single-threaded scheduler is its ability to run, but from this submission we can see two important data structures, G and M, which build the framework of the Go language scheduler.

Multithreaded scheduler

The Go language version 1.0 supported a multithreaded scheduler when it was officially released, and the Go language team completed the transition from unavailable to available at this stage compared to the scheduler that was completely unavailable in the previous version. We can find version 1.0.1 of the scheduler in proc.c. The multithreaded version of the scheduling function schedule contains more than 70 lines of code. We retain the core logic here:

Static void schedule (G * gp) {schedlock (); if (gp! = nil) {gp- > m = nil; uint32 v = runtime ·xadd (& runtime ·sched.atomic,-1status) {case Grunning: gp- > status = Grunnable; gput (gp); break Case.:} else {...} gp = nextgandunlock (); gp- > status = Grunning; m-> curg = gp; gp- > m = m; runtime ·gogo (& gp- > sched, 0);}

The overall logic is not much different from the single-thread scheduler, which introduces the GOMAXPROCS variable to help us control the maximum number of threads in the program, so that there may be multiple active threads in our program at the same time.

The main problem of the multithreaded scheduler is the lock competition during scheduling. The performance test of the scheduler in Scalable Go Scheduler Design Doc found that 14% of the time is spent on the runtime.futex function. The current scheduler implementation has the following problems to be solved:

Global unique scheduler and global lock, all scheduling states are centrally stored, which brings lock competition; threads often need to pass runnable Goroutine to each other, which introduces a lot of delay and extra overhead; each thread needs to deal with memory cache, resulting in a large amount of memory occupation and affecting data locality (Data locality); system calls frequently block and unblock running threads, which increases additional overhead.

The global locking problem here is similar to that encountered by the Linux operating system scheduler in the early days, and the solutions are more or less the same.

Task theft scheduler

Dmitry Vyukov, an engineer at Google in 2012, pointed out the problems with the existing multithreaded scheduler in Scalable Go Scheduler Design Doc and proposed two improvements to the multithreaded scheduler:

Processor P is introduced into the current Gmurm model, and a scheduler based on job theft is implemented on the basis of processor P.

The Go language scheduler based on task theft uses the G-M-P model that has been used so far. We can find the source code when the task theft scheduler was first implemented in the runtime: improved scheduler submission, but the schedule function of the scheduler is even simpler now:

Static void schedule (void) {G * gp; top: if (runtime ·gcwaiting) {gcstopm (); goto top;} gp = runqget (m-> p); if (gp = = nil) gp = findrunnable (); Execute (gp);} if the current runtime is waiting for garbage collection, call the gcstopm function; call runqget and findrunnable to get the Goroutine; to be executed from the local or global run queue, call the execute function to run Goroutine on the current thread M.

When the running queue of the current processor does not contain Goroutine, calling the findrunnable function will trigger work theft and get some Goroutine randomly from the queues of other processors.

Processor P introduced in the runtime G-M-P model is the intermediate layer between thread M and Goroutine. From its structure, we can see the relationship between P and M and G.

Struct P {Lock; uint32 status; / / one of Pidle/Prunning/... P* link; uint32 tick; / / incremented on every scheduler or system call M* m; / / back-link to associated M (nil if idle) MCache* mcache; gaming * runq; int32 runqhead; int32 runqtail; int32 runqsize; G* gfree; int32 gfreecnt;}

Processor P holds a run queue runq, an array of runnable Goroutine, which also holds a pointer to thread M in reverse. When scheduling, the scheduler selects the Goroutine of the queue header from the processor's queue and puts it on thread M. The following picture shows the relationship between thread M, processor P, and Goroutine in the Go language.

Figure 21-G-M-P model

The multithreaded scheduler based on work theft binds each thread to a separate CPU and manages it separately through different processors, and redistributes tasks in different processors, which improves the overall performance of the scheduler and Go language programs. Today, the high performance of all Go language services benefits from this change.

Preemptive scheduler

The modification of the Go language concurrency model improves the performance of the scheduler, but the scheduler in version 1.1 still does not support preemptive scheduling, and the program can only rely on Goroutine to actively give up CPU resources. The Go language scheduler introduced collaboration-based preemptive scheduling in version 1.2 to solve the following problems:

A separate Goroutine can run on a thread all the time without switching to another Goroutine, causing hunger problems; garbage collection requires pausing the entire program (Stop-the-world,STW), and if there is no preemption, it may take several minutes to wait, causing the whole program to fail to work.

However, the preemptive scheduling implemented in version 1.2 is based on cooperation. For a long time, the Go language scheduler contains some edge cases that can not be preempted, and it is not until 1.14 that the signal-based true preemptive scheduling is implemented to solve some of the problems.

Preemptive scheduling based on collaboration

We can find the scheduler implementation after introducing preemptive scheduling in the proc.c file. The Go language implements preemptive scheduling on the current segmented stack mechanism, and all Goroutine have the opportunity to enter the runtime to check whether preemption is required when a function is called. Collaboration-based preemption is achieved through the following multiple submissions:

Runtime: mark runtime.goexit as nosplitruntime: add stackguard0 to G . Introduce the stackguard0 field for Goroutine, and when the field is set to StackPreempt, Goroutine will be preempted; runtime: introduce preemption function (not used for now). Introduce the preemption functions preemptone and preemptall, which set the StackPreempt; of Goroutine to bring in the preemption request StackPreempt;runtime: preempt goroutines for GC. Call the preemptall function in the runtime ·stoptheworld of the garbage collection call to set the StackPreempt; of Goroutine on all processors to add preemptive code in the runtime ·newstack function, and trigger the preemption of the scheduler when stackguard0 is equal to StackPreempt; runtime: preempt long-running goroutines. In system monitoring, if a Goroutine runs longer than 10ms, retake and preemptone;runtime: more reliable preemption are called. Fixed an issue where Goroutine would not be preempted due to periodic execution of non-blocking CGO or system calls.

From the above series of submissions, we will find that the Go language runtime will put forward a preemptive request StackPreempt; when the garbage collection pauses the program, and the system monitoring finds that Goroutine is running more than 10ms. Because the compiler will insert runtime.newstack in the function call, the function call will use runtime.newstack to check whether the stackguard0 of Goroutine is StackPreempt and trigger preemption to give up the current thread.

This approach does not bring too much extra overhead at run time, and the implementation is relatively simple, but it increases the complexity of the runtime, which is generally a relatively successful implementation. Because the above preemption is achieved by the compiler inserting a function at a specific time, or requires a function call as an entry to trigger preemption, this is a collaborative preemptive scheduling.

Signal-based preemptive scheduling

Although the implementation of collaborative preemptive scheduling is ingenious, it leaves a lot of edge cases, and we can find some legacy problems in runtime: non-cooperative goroutine preemption:

Runtime: tight loops should be preemptible # 10958An empty for {} will block large slice allocation in another goroutine, even with GOMAXPROCS > 1? # 17174runtime: tight loop hangs process completely after some time # 15442.

Go language implements non-cooperative preemptive scheduling in version 1.14. In the process of implementation, we reconstruct the existing logic and add new states and fields to Goroutine to support preemption. The Go team implements this functionality through the following submission, and we can understand how it works in the order in which it is submitted:

Runtime: add general suspendG/resumeG . The process of suspending Goroutine is completed during the stack scan. We reconstruct the stack scan process through the functions runtime.suspendG and runtime.resumeG. When calling the runtime.suspendG function, the preemptStop of the running Goroutine is marked as true;. The runtime.preemptPark function can suspend the current Goroutine, update its status to _ Gpreempted, and trigger the rescheduling of the scheduler, which can hand over thread control; runtime: asynchronous preemption function for x86. Add asynchronous preemption functions runtime.asyncPreempt and runtime.asyncPreempt2;runtime: use signals to preempt Gs for suspendG to x86 architecture. Goroutine;, which supports suspending operations by sending signals to threads, registers the SIGURG signal handler runtime.doSigPreempt;runtime.preemptM function in the runtime.sighandler function to send preemption requests to threads; runtime: implement async scheduler preemption. Modify the implementation of the runtime.preemptone function to add asynchronous preemption logic.

The current preemptive scheduling is only triggered by the garbage collection scan task. We can comb through the process of triggering preemptive scheduling:

When the program starts, runtime.doSigPreempt;, which registers the SIGURG signal in the runtime.sighandler function, calls the runtime.suspendG function to suspend Goroutine when it triggers a stack scan for garbage collection. The Goroutine in the state of _ Grunning is marked as preemptive, that is, preemptStop is set to true; to call the runtime.preemptM function to trigger preemption; the runtime.preemptM function calls runtime.signalM to send a signal to the thread that the SIGURG; operating system interrupts the running thread and executes the pre-registered signal processing function runtime.doSigPreempt;runtime.doSigPreempt function to process the preemption signal, get the current SP and PC registers and call runtime.sigctxt.pushCall Runtime.sigctxt.pushCall modifies the register and executes from runtime.asyncPreempt when the program returns to the user state; the assembly instruction runtime.asyncPreempt calls the runtime function runtime.asyncPreempt2;runtime.asyncPreempt2 calls the runtime.preemptPark function; runtime.preemptPark modifies the state of the current Goroutine to _ Gpreempted and calls runtime.schedule to put the current function into sleep and gives up the thread, and the scheduler chooses another Goroutine to continue execution

The above nine steps show the execution process of signal-based preemptive scheduling. We also need to discuss the choice of signals in this process, and the proposal chooses SIGURG as the signal to trigger asynchronous preemption based on the following four reasons:

The signal needs to be transparently transmitted by the debugger; it is not used and intercepted by the internal libc library; the signal can appear at will and does not trigger any consequences; we need to deal with different signals on multiple platforms.

The current preemptive scheduling does not solve all the potential problems, because problems are more likely to occur during STW and stack scanning, and it is also a Safe-points that can be preempted, so we will add preemption here first, and more preemptive time points may be added in the future.

Non-uniform memory access scheduler

At present, the non-uniform memory access (Non-uniform memory access,NUMA) scheduler is only a proposal of the Go language, because the proposal is too complex, and the performance of the current scheduler is good enough, so it has not been implemented for the time being. The principle of the proposal is to split the global resources so that each processor can access the local resources nearby, reduce lock competition and increase data locality.

In the current runtime, threads, processors, network trainers, run queues, global memory allocator status, memory allocation caches, and garbage collectors are global resources. There is no guarantee of localization at runtime, and the topology of the system is not clear. Some of the structures can provide some locality, but there is no such guarantee from a global point of view.

Figure 22-Go language NUMA Scheduler

As shown in the figure above, the stack, global run queue, and thread pool are partitioned according to the NUMA node, and the network rotational trainer and timer are held by separate processors. Although this approach can take advantage of locality to improve the performance of the scheduler, its own implementation is too complex, so the Go language team has not yet begun to implement this proposal.

two。 Summary

The Go language scheduler iterated quickly in the first few versions, but the scheduler did not change much since version 1.2, until version 1.14 introduced real preemptive scheduling to solve the problem that has existed since 1.2. In the foreseeable future, the Go language scheduler will further evolve to increase the time point of preemptive scheduling and reduce the existence of edge cases.

This section selects the Go language scheduler implementation principles in the book "Go language Design and implementation". You can click the link to learn more about Go language design and implementation principles.

3. Extended reading of How Erlang does schedulingAnalysis of the Go runtime schedulerGo's work-stealing schedulerruntime: clean up async preemption loose endsThe Go netpollerSystem Calls Make the World Go RoundLinux Syscall ReferenceKubernetes

Kubernetes is a production-level container scheduling and management system. In the past period of time, Kubernetes has rapidly occupied the market and become the implementation standard in the field of container orchestration.

Figure 23-Container choreography system Evolution

Kubernetes, which means "helmsman" in Greek, was originally founded by several software engineers of Google. It was deeply influenced by Borg and Omega projects within the company. Many designs were borrowed from Borg and improved on the defects of Borg. Kubernetes is currently a project of Cloud Native Computing Foundation (CNCF) and a solution for many companies to manage distributed systems.

The scheduler is the core component of Kubernetes, and its main function is to bind the running node Pod to the running workload Node. Unlike other scheduling scenarios, although resource utilization is also very important in Kubernetes, this is only a factor that Kubernetes focuses on. It needs to support many and complex business requirements in the container orchestration scenario. In addition to considering the adequacy of CPU and memory, other domain-specific scenarios need to be considered. For example, two services cannot occupy the same port of the same machine, several services should run on the same machine, and resources should be scheduled according to the type of node.

These complex business scenarios and scheduling requirements make the internal design of the Kubernetes scheduler completely different from other schedulers, but as a scheduler in the user application layer, we can learn a lot of useful patterns and designs from it. Next, this section describes the design and evolution of the scheduler in Kubernetes.

1. Design and evolution

The evolution of the Kubernetes scheduler is relatively simple, and we can divide its evolution into two stages:

Scheduler based on predicate and priority v1.0.0 ~ v1.14.0 Scheduler based on scheduling framework v1.15.0 ~ so far

Kubernetes was released from v1.0.0 to v1.14.0, and a total of 15 versions have been using predicates and priorities to manage different scheduling algorithms until v1.15.0 began to introduce a scheduling framework (Alpha functionality) to reconstruct existing schedulers. Here we will analyze the evolution of the scheduler with v1.14.0 predicates and priorities and v1.17.0 scheduling framework.

Predicate and priority algorithm

The predicate (Predicates) and priority (Priorities) scheduler are patterns that have existed since the release of Kubernetes v1.0.0, and the final implementation of v1.14.0 is not much different from the original design. However, many improvements were introduced from v1.0.0 to v1.14.0:

The scheduler extends v1.2.0-Scheduler extension. Change the decision of the scheduler by calling the external scheduler extension

Map-Reduce priority algorithm v1.5.0-MapReduce-like scheduler priority functions. For the priority algorithm of the scheduler to support the calculation mode of Map-Reduce, the computing performance of the scheduler is optimized by introducing a parallel Map phase.

The scheduler migrates v1.10.0-Move scheduler code out of plugin directory. Move from plugin/pkg/scheduler to pkg/scheduler;kube-scheduler to become an executable file provided directly to the outside world

Predicate and priority are two abstractions provided by Kubernetes in the scheduling system. The predicate algorithm uses the FitPredicate type, while the priority algorithm uses PriorityMapFunction and PriorityReduceFunction:

Type FitPredicate func (pod * v1.Pod, meta PredicateMetadata, nodeInfo * schedulernodeinfo.NodeInfo) (bool, [] PredicateFailureReason, error) type PriorityMapFunction func (pod * v1.Pod, meta interface {}, nodeInfo * schedulernodeinfo.NodeInfo) (schedulerapi.HostPriority, error) type PriorityReduceFunction func (pod * v1.Pod, meta interface {}, nodeNameToInfo map [string] * schedulernodeinfo.NodeInfo, result schedulerapi.HostPriorityList) error

Because v1.14.0 was also the first version of the author to participate in Kubernetes development, I was very impressed by the design at that time. The Kubernetes scheduler of v1.14.0 calculated the scores of all nodes in the Map-Reduce way of PriorityMapFunction and PriorityReduceFunction and selected the node with the highest score. The following figure shows the execution of the scheduler in v1.14.0:

Figure 24-predicate and priority algorithm

As shown in the figure above, we assume that there is a predicate algorithm and a Map-Reduce priority algorithm in the scheduler. When we choose the most suitable one among the six nodes for a Pod, the six nodes will be filtered first, and the predicate algorithm in the graph will filter out half of the nodes, and the remaining three nodes will get 5, 10 and 5 points respectively through the Map and Reduce processes. Finally, the scheduler will select the No. 4 node with the highest score.

GenericScheduler.Schedule is Kubernetes's method of selecting nodes for Pod. We omitted the code used to check boundary conditions and dotting in this method:

Func (g * genericScheduler) Schedule (pod * v1.Pod, nodeLister algorithm.NodeLister) (result ScheduleResult, err error) {nodes, err: = nodeLister.List () if err! = nil {return result, err} iflen (nodes) = = 0 {return result, ErrNoNodesAvailable} filteredNodes, failedPredicateMap, err: = g.findNodesThatFit (pod, nodes) if err! = nil {return result, err}. PriorityList, err: = PrioritizeNodes (pod, g.nodeInfoSnapshot.NodeInfoMap,..., g.prioritizers, filteredNodes, g.extenders) if err! = nil {return result, err} host, err: = g.selectHost (priorityList) return ScheduleResult {SuggestedHost: host, EvaluatedNodes: len (filteredNodes) + len (failedPredicateMap), FeasibleNodes: len (filteredNodes),}, err} get all the nodes that exist in the current system from NodeLister Call the genericScheduler.findNodesThatFit method to execute all the predicate algorithm filter nodes in parallel. The predicate algorithm will filter the nodes based on the incoming Pod and Node, which will filter out the nodes with conflicting port numbers and insufficient resources; call the Filter method extended by all schedulers to assist the filtering; and call the PrioritizeNodes function to score all nodes. Call the PriorityReduceFunction function with Pod and Node as parameters to execute the PriorityMapFunction;Pod of the same priority and the mapping of Node to score returned by the priority as parameters; call the Prioritize method extended by all schedulers; add all the scores according to weight and return the mapping from Node to score; call the genericScheduler.selectHost method to select the node with the highest score.

This is the scheduling process when using predicates and priority algorithms, where we omit the sorting in the scheduler's priority queue, the preemption in the event of a scheduling error, and the process of binding the Pod persistent storage volume to the Node, leaving only the core scheduling logic.

Scheduling framework

Kubernetes scheduling framework is the latest scheduler design proposed by Babak Salamat and Jonathan Basseri in 2018. This proposal defines the scheduling stages of Kubernetes and provides a well-designed plug-in-based interface. The scheduling framework believes that there are two loops of Scheduling and Binding in Kubernetes:

The scheduling loop selects the most appropriate Node; binding loop for Pod among multiple Node to apply scheduling decisions to the cluster, including binding Pod and Node, binding persistent storage, and so on.

In addition to the two large loops, the scheduling framework also contains 11 extension points (Extension Point) of QueueSort, PreFilter, Filter, PostFilter, Score, Reserve, Permit, PreBind, Bind, PostBind, and Unreserve, which are triggered during the scheduling process and run in the following order:

Figure 25-Kubernetes scheduling framework

We can use the Scheduler.scheduleOne method in the scheduler as an entry to analyze the scheduler implementation based on the scheduling framework. Each call to this method will complete the whole process of scheduling node for Pod. We divide the execution process of this function into two stages: scheduling and binding. The first is the scheduling phase of the scheduler:

Func (sched * Scheduler) scheduleOne (ctx context.Context) {fwk: = sched.Framework podInfo: = sched.NextPod () pod: = podInfo.Pod state: = framework.NewCycleState () scheduleResult, _: = sched.Algorithm.Schedule (schedulingCycleCtx, state, pod) assumedPod: = podInfo.DeepCopy (). Pod allBound, _: = sched.VolumeBinder.Binder.AssumePodVolumes (assumedPod, scheduleResult.SuggestedHost) if err! = nil {return} if sts: = fwk.RunReservePlugins (schedulingCycleCtx State, assumedPod, scheduleResult.SuggestedHost) ! sts.IsSuccess () {return} if err: = sched.assume (assumedPod, scheduleResult.SuggestedHost); err! = nil {fwk.RunUnreservePlugins (schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost) return}.} call the function returned by the MakeNextPodFunc of the internal priority queue to get the next Pod waiting to be scheduled from the queue, which is used to maintain the queue waiting for Pod and execute the QueueSort plug-in. Call the genericScheduler.Schedule function to select the node, which executes the plug-ins of PreFilter, Filter, PostFilter and Score; call the framework.RunReservePlugins function to run the Reserve plug-in to preserve resources and enter the binding phase (the binding phase takes a long time to prevent resources from being preempted). If the execution fails, call the framework.RunUnreservePlugins function to run the Unreserve plug-in.

Because each scheduling decision changes the context, the Kubernetes needs to be executed serially at this stage. The binding phase is the process of implementing scheduling, and we will create a new Goroutine parallel execution binding loop:

Func (sched * Scheduler) scheduleOne (ctx context.Context) {... Gofunc () {bindingCycleCtx, cancel: = context.WithCancel (ctx) defer cancel () fwk.RunPermitPlugins (bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost) if! allBound {sched.bindVolumes (assumedPod)} fwk.RunPreBindPlugins (bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost) if err: = sched.bind (bindingCycleCtx, assumedPod, scheduleResult.SuggestedHost, state) Err! = nil {fwk.RunUnreservePlugins (bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)} else {fwk.RunPost

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.