The load level in Linux does not exactly correspond to CPU overhead. 07/19 Update SLTechnology News&Howtos

The load level in Linux does not exactly correspond to CPU overhead.

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

This article comes from the official account of Wechat: developing Internal skills practice (ID:kfngxl). Author: Zhang Yanfei allen

Hello, everyone. I'm Brother Fei!

Load is a commonly used performance indicator when viewing the running status of a Linux server. When observing the operation of the online server, we often find out the load and have a look. When the pressure of online request is too high, it is often accompanied by a high load.

But do you really understand the principle of load? Let me list a few questions to see if your understanding of the load is deep enough.

How is the load calculated?

Is there a positive correlation between load and CPU consumption?

How does the kernel expose load data to the application layer?

If your understanding of the above questions is not very accurate, then Brother Fei will take you to learn more about the load in Linux today!

First, understand the load viewing process. We often use the top command to check the load of the Linux system. The load output of a typical top command is shown below.

# topLoad Avg: 1.25, 1.30, 1.95. The Load Avg in the output is what we often call the load, also known as the system average load. Because a simple instantaneous load value does not make much sense. So Linux calculates the average over a period of time, and these three numbers represent the average load values for the past 1 minute, 5 minutes, and 15 minutes, respectively.

So how does the number of data displayed by the top command come from? In fact, the load value in the top command comes from the pseudo file / proc/ loadavg. This process can be seen through the system call to track the top command through the strace command.

# strace topopenat (AT_FDCWD, "/ proc/loadavg", O_RDONLY) = 7 the open function of the pseudo file loadavg is defined in the kernel. When user mode access / proc/ loadavg triggers functions defined by the kernel, the average load variables in the kernel are read here and can be displayed after simple calculation. The overall process is shown in the following figure.

Let's take a look at it again according to the above flow chart. The pseudo file / proc/ loadavg is defined in / fs/ proc/ loadavg.c in kernel. / proc/ loadavg is created in this file, and the operation method loadavg_proc_fops is assigned to it.

/ / file: fs/proc/loadavg.cstatic int _ init proc_loadavg_init (void) {proc_create ("loadavg", 0, NULL, & loadavg_proc_fops); return 0;} contains the corresponding operation method when opening the file in loadavg_proc_fops.

/ / file: fs/proc/loadavg.cstatic const struct file_operations loadavg_proc_fops = {.open = loadavg_proc_open,}; the open function pointer-loadavg_proc_open in loadavg_proc_fops is called when the / proc/loadavg file is opened in user mode. Loadavg_proc_open then calls loadavg_proc_show for processing, and the core calculation is done here.

/ / file: fs/proc/loadavg.cstatic int loadavg_proc_show (struct seq_file * m, void * v) {unsigned long avnrun [3]; / / get the average load value get_avenrun (avnrun, FIXED_1/200, 0) / / printout average load seq_printf (m, "% lu.lu% ld/%d% d\ n", LOAD_INT (avnrun [0]), LOAD_FRAC (avnrun [0]), LOAD_INT (avnrun [1]), LOAD_FRAC (avnrun [1]), LOAD_INT (avnrun [2]), LOAD_FRAC (avnrun [2]), nr_running (), nr_threads Task_active_pid_ns (current)-last_pid) Return 0;} does two things in the loadavg_proc_show function.

Call get_avenrun to read the current load value

Print out the average load value according to a certain format

In the above source code, you can see the strange definitions of FIXED_1/200, LOAD_INT, LOAD_FRAC and so on. The reason why the code is so obscene is that there are no floating-point types such as float and double in the kernel, but are simulated with integers. This code is used to convert between integers and decimals. You just need to know this background, and you don't have to analyze it too much.

In this way, users can read the load data calculated by the kernel by accessing the / proc/ loadavg file. Where getting get_avenrun is just accessing the global array of avenrun.

/ / file:kernel/sched/core.cvoid get_avenrun (unsigned long * loads, unsigned long offset, int shift) {loads [0] = (avenrun [0] + offset) shift; loads [1] = (avenrun [1] + offset) shift; loads [2] = (avenrun [2] + offset) shift;} now we can summarize a question at the beginning: how does the kernel expose load data to the application layer?

The kernel defines a pseudo file / proc/ loadavg, and every time the user opens this file, the loadavg_proc_show function in the kernel is called, and then the avenrun global array variable is accessed and the average load is converted from integer to decimal and printed out.

Well, another new question arises: when and how is the data stored in avenrun global array variables calculated?

Second, the calculation process of the load in the kernel is connected to the section, and we continue to look at the data sources of the avenrun global array variables. The calculation of this array is divided into two steps:

1.PerCPU periodically summarizes the instantaneous load: regularly refresh the current tasks of each CPU to calc_load_tasks, and summarize the load data of each CPU to get the current instantaneous load of the system.

two。 Timed calculation of system average load: based on the overall instantaneous load of the current system, the timer uses the exponentially weighted moving average method (an efficient algorithm for calculating averages) to calculate the average load for the past 1 minute, 5 minutes, and 15 minutes.

Next, we will introduce it in two sections.

2.1 PerCPU periodically summarizes the load in the Linux kernel, there is a subsystem called the time subsystem. In the time subsystem, a timer called high resolution is initialized. In this timer, the load data on each CPU (number of running processes + number of uninterruptible processes) is regularly summarized into the system's global instantaneous load variable calc_load_tasks. The overall process is shown in the following figure.

Let's take a look at the above flowchart and find the source code of the high-resolution timer as follows:

/ / file:kernel/time/tick-sched.cvoid tick_setup_sched_timer (void) {/ / initializes the high resolution timer sched_timer hrtimer_init (& ts-sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); / / sets the expiration function of the timer to tick_sched_timer ts-sched_timer.function = tick_sched_timer;} sets the expiration function to tick_sched_timer during high resolution initialization. Use this function to make each CPU perform some tasks periodically. It is at this time that the current system load is refreshed. One thing to note here is that each CPU has its own independent run queue.

We trace according to the source code of tick_sched_timer, which in turn calls tick_sched_handle = > update_process_times = > scheduler_tick. Eventually the load value on the current CPU will be refreshed to the calc_load_tasks in scheduler_tick. Because each CPU is regularly brushed, what is recorded on the calc_load_tasks is the instantaneous load value of the whole system.

Let's take a look at the core function scheduler_tick, which is responsible for refreshing:

/ / file:kernel/sched/core.cvoid scheduler_tick (void) {int cpu = smp_processor_id (); struct rq * rq = cpu_rq (cpu); update_cpu_load_active (rq);} in this function, get the current cpu and its corresponding run queue rq (run queue), and call update_cpu_load_active to refresh the load data of the current CPU to the global array.

/ / file:kernel/sched/core.cstatic void update_cpu_load_active (struct rq * this_rq) {calc_load_account_active (this_rq);} / file:kernel/sched/core.cstatic void calc_load_account_active (struct rq * this_rq) {/ / get the relative load value of the current running queue delta = calc_load_fold_active (this_rq) If (delta) / / is added to the global instantaneous load value atomic_long_add (delta, & calc_load_tasks);} as seen in calc_load_account_active, the relative load value of the current running queue is obtained through calc_load_fold_active and added to the global instantaneous load value calc_load_tasks. At this point, there is the total number of overall instantaneous loads on the calc_load_tasks at the current time of the system.

Let's expand to see how the load value is calculated based on the run queue:

/ / file:kernel/sched/core.cstatic long calc_load_fold_active (struct rq * this_rq) {long nr_active, delta = 0; / / users in R and D status task nr_active = this_rq-nr_running; nr_active + = (long) this_rq-nr_uninterruptible; / / only return the amount of change if (nr_active! = this_rq-calc_load_active) {delta = nr_active-this_rq-calc_load_active This_rq-calc_load_active = nr_active;} return delta;} Oh, it turns out that the number of processes in both nr_running and nr_uninterruptible states is calculated. The number of task corresponding to R and D states in user space (process OR threads).

Because calc_load_tasks is a long-standing data. So when you refresh the number of processes in rq, you only need to brush the amount of change, not all of it. So the above function returns a delta.

2.2 in the last section of timing calculation system average load, we found the update process of the calc_load_tasks variable of the current instantaneous load of the system. Now we still lack a mechanism to calculate the average load in the past 1 minute, 5 minutes, and 15 minutes.

Traditionally, we use the method of calculating averages by adding up the figures of the past period of time and averaging them. It is impossible to add up all the instantaneous loads of the past N time points and take an average. This is actually the average that we traditionally understand, if there are n numbers, x1, x2,..., xn. Then the average of this data set is (x1 + x2 +... + xn) / N.

However, if you use this simple algorithm to calculate the average load, there are the following problems:

1. Need to store data for each sampling cycle in the past

Assuming that we collect every 10 milliseconds, then we need to use a large array to store all the data of each sample, so the average of the past 15 minutes will have to save 1500 data (15 minutes * 100 times per minute). And every time a new observation appears, the earliest observation is subtracted from the moving average, plus the latest observation, and the memory array is frequently modified and updated.

two。 The calculation process is complicated.

In the calculation, the whole array is added up and divided by the total number of samples. Although the addition is simple, the accumulation of hundreds of numbers is still tedious.

3. Can not accurately express the current trend of change in the traditional average calculation process, the weight of all numbers is the same. But for real-time applications such as average load, the closer the current moment is, the greater the numerical weight should be. Because it can better reflect the trend of recent changes.

Therefore, what is used in Linux is not the traditional average calculation method we think, but an exponentially weighted moving average (Exponential Weighted Moving Average,EMWA) average calculation method.

This exponentially weighted moving average method is widely used in deep learning. In addition, the EMA moving average in the stock market also uses a similar method to calculate the mean. The mathematical expression of the algorithm is A1 = a0 * factor + a * (1-factor). This algorithm is a little complicated to understand, and students who are interested can search by Google themselves.

We only need to know that this method only needs the average of the previous time in the actual calculation, and does not need to save all the instantaneous load values. In addition, the closer to the present time point, the higher the weight, which can well express the recent change trend.

In fact, this is also done at a fixed time in the time subsystem, using a method called exponentially weighted moving average to calculate the three averages.

Let's take a closer look at the execution process in the figure above. The time subsystem will register the handler function of the clock interrupt as timer_interrupt in the clock interrupt.

/ / file:arch/ia64/kernel/time.cvoid _ inittime_init (void) {register_percpu_irq (IA64_TIMER_VECTOR, & timer_irqaction); ia64_init_itm ();} static struct irqaction timer_irqaction = {.handler = timer_interrupt, .flags = IRQF_DISABLED | IRQF_IRQPOLL, .name = "timer"}; timer_interrupt is called each time the clock beat arrives, and then the do_timer function is called.

/ / file:kernel/time/timekeeping.cvoid do_timer (unsigned long ticks) {calc_global_load (ticks);} where calc_global_load is the core of load averaging calculation. It gets the current instantaneous load value calc_load_tasks of the system, calculates the average load for the past 1 minute, 5 minutes, and 15 minutes, and saves it to avenrun for user processes to read.

/ / file:kernel/sched/core.cvoid calc_global_load (unsigned long ticks) {/ / 1 get the current instantaneous load value active = atomic_long_read (& calc_load_tasks); / / 2 calculation of average load avenrun [0] = calc_load (avenrun [0], EXP_1, active); avenrun [1] = calc_load (avenrun [1], EXP_5, active); avenrun [2] = calc_load (avenrun [2], EXP_15, active) } getting the instantaneous load is relatively simple, just reading a memory variable. In calc_load, the exponentially weighted moving average method we mentioned earlier is used to calculate the average load of the past 1 minute, the past 5 minutes, and the past 15 minutes. The specific implementation code is as follows:

/ / file:kernel/sched/core.c/* * A1 = a0 * e + a * (1-e) * / static unsigned longcalc_load (unsigned long load, unsigned long exp, unsigned long active) {load * = exp; load + = active * (FIXED_1-exp); load + = 1UL > FSHIFT;} although the algorithm is complex to understand, the code does look simpler and the amount of computation seems small. And it doesn't matter if you don't understand it, you just need to know that the kernel does not use the original average calculation method, but uses an algorithm that is fast in calculation and can better express the trend of change.

At this point, we begin with the question "how is the load calculated?" A conclusion has also been reached on this question.

Linux periodically aggregates the number of processes in the status of running and uninterruptible in the run queue on each CPU into a global system instantaneous load value, and then periodically uses the exponentially weighted moving average method to count the average load in the past 1 minute, 5 minutes, and 15 minutes.

Third, the relationship between average load and CPU consumption. Now many students have linked the average load with CPU. It is considered that if the load is high, the CPU consumption will be high, and the load is low, and the CPU consumption will be low.

In very old versions of Linux, it was true that only the number of tasks in runnable was counted when counting the load, and these processes only required CPU. In those days, load and CPU consumption were indeed positively correlated. The higher the load, the more processes are running on CPU or waiting for CPU to execute, and the higher the CPU consumption will be.

But as we saw earlier, the 3.10 Linux load average used in this article tracks not only runnable tasks, but also tasks in the uninterruptible sleep state. In fact, the process of uninterruptible state does not occupy CPU.

Therefore, the high load is not necessarily caused by the inability of CPU to handle it, but also by the process that makes the process enter the uninterruptible state because the disk and other resources cannot be scheduled.

Why do you have to modify it like this. I searched the Internet and found the reason in an email as far back as 1993. The following is the original message.

From: Matthias Urlichs Subject: Load average broken? Date: Fri, 29 Oct 1993 11:37:23 + 0200 The kernel only counts "runnable" processes when computing the load average.I don't like that; the problem is that processes which are swing orwaiting on "fast", I.E. Noninterruptible, I go O, also consume resources. It seems somewhat nonintuitive that the load average goes down when youreplace your fast swap disk with a slow swap disk... Anyway, the following patch seems to make the load average much moreconsistent WRT the subjective speed of the system. And, most important, theload is still zero when nobody is doing anything. ; -)-kernel/sched.c.orig Fri Oct 29 10:31:11 1993 for + kernel/sched.c Fri Oct 29 10:32:51 1993 unsigned long nr @-414 for (p = & LAST_TASK; p > & FIRST_TASK) -- p)-if (* p & & (* p)-> state = = TASK_RUNNING) + if (* p & & (* p)-> state = = TASK_RUNNING) | | + (* p)-> state = = TASK_UNINTERRUPTIBLE) | | + (* p)-> state = = TASK_SWING)) nr + = FIXED_1; return nr;} this change was introduced in 1993. As you can see in the Linux source code change shown in this email, the payload formally adds the process of TASK_UNINTERRUPTIBLE and TASK_SWAPPING status (the exchange state was later removed from the Linux). In the body of this email, the author also makes it clear why processes with TASK_UNINTERRUPTIBLE status should be added. I translated his explanation as follows:

The kernel calculates only runnable processes when calculating the average load. I don't like that; the problem is that processes that are "fast" swapping or waiting, that is, uninterruptible I / O, also consume resources. When you replace a fast swap disk with a slow swap disk, the average load drop seems a bit unintuitive. In any case, the following patch seems to make the load average more consistent with the subjective speed of the WRT system. And, most importantly, the load is still zero when no one is doing anything. ; -) "

The main idea of this patch submitter is that the average load should represent the demand for all the resources of the system, not just the CPU resources.

Suppose a process in TASK_UNINTERRUPTIBLE state is queued for disk IO, it does not consume CPU at this time, but is waiting for hardware resources such as disks. Then it should be reflected in the calculation of average load. So the author shows the process of TASK_UNINTERRUPTIBLE state in the average load.

Therefore, the load level indicates that the overall demand for system resources on the current system is more serious. If the load becomes high, it may be that there are not enough CPU resources or disk IO resources, so you need to cooperate with other observation commands to analyze the situation.

Fourth, to sum up, today I took you to learn more about the load in Linux. Let's summarize what we have learned today according to a picture.

I divided the load working principle into the following three steps.

1. The kernel periodically summarizes the load per CPU to the instantaneous load of the system

two。 The kernel uses exponentially weighted moving averages to quickly calculate the averages of the past 1, 5, and 15 minutes

3. The user process reads the average load in the kernel by opening loadavg

Let's go back and summarize some of the issues mentioned at the beginning.

1. How is the load calculated?

It periodically sums up the number of processes in the status of running and uninterruptible in the running queue on each CPU into a global system instantaneous load value, and then periodically uses the exponentially weighted moving average method to count the average load in the past 1 minute, 5 minutes, and 15 minutes.

two。 Is there a positive correlation between load and CPU consumption?

The load level indicates that the overall demand for system resources on the current system is more serious. If the load becomes higher, there may be not enough CPU resources, or there may be not enough disk IO resources. So we can't say that watching the load get higher, we don't think that there are not enough CPU resources.

3. How does the kernel expose load data to the application layer?

The kernel defines a pseudo file / proc/ loadavg, and whenever a user opens this file, the loadavg_proc_show function in the kernel is called, which accesses the avenrun global array variable, converts the average load from an integer to a decimal, and then prints it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.