Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the resource limit of K8s

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the resource limitation of K8s". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the resource limitation of K8s".

Write at the front

When I started using Kubernetes on a large scale, I began to think about a problem I didn't encounter in my experiments: when the nodes in the cluster didn't have enough resources, the Pod would get stuck in the Pending state. There is no way to add CPU or memory to a node, so what can you do to remove the Pod from this node? The easiest way is to add another node, which I admit I always do. In the end, this strategy fails to take advantage of one of Kubernetes's most important capabilities: its ability to optimize the use of computing resources. The real problem in these scenarios is not that the nodes are too small, but that we have not carefully calculated the resource limits for Pod.

Resource restriction is one of the many configurations we can provide to Kubernetes, which means two things: what resources are needed for the workload to run and how many resources are allowed to be consumed at most. The first point is very important for the scheduler because it selects the appropriate node. The second point is very important for Kubelet, where the daemon Kubelet on each node is responsible for the running health of Pod. Most readers of this article may have some understanding of resource limitations, but there are actually a lot of interesting details. In the two articles in this series, I will first analyze the memory resource limits in detail, and then analyze the CPU resource limitations in the second article.

Resource restriction

The resource limit is set through the resources field of each container containerSpec, which is a v1 version of the ResourceRequirements type API object. Each object that specifies "limits" and "requests" can control the corresponding resource. Currently, there are only two kinds of resources: CPU and memory. The third resource type, persistent storage, is still the beta version, which I will analyze in a future blog. In most cases, the definitions of deployment, statefulset, and daemonset include podSpec and multiple containerSpec. Here is a complete yaml configuration of the v1 resource object:

Resources: requests: cpu: 50m memory: 50Mi limits: cpu: 100m memory: 100Mi

This object can be understood this way: this container typically requires 5% of CPU time and 50MiB memory (requests), while allowing it to use up to 10% of CPU time and 100MiB memory (limits). I will further explain the difference between requests and limits, but in general, requests is more important when scheduling and limits is more important at run time. Although resource limits are configured on each container, you can think of Pod's resource limits as the sum of the resource limits of the containers in it, and we can observe this relationship from a system perspective.

Memory limit

It's usually easier to analyze memory than to analyze CPU, so I'll start here. One of my goals is to show you how it is implemented in the memory system, that is, what Kubernetes does to the container runtime (docker/containerd) and what the container runtime does to the Linux kernel. Starting with the analysis of memory resource limitations also lays the foundation for later analysis of CPU. First, let's review the previous example:

Resources: requests: memory: 50Mi limits: memory: 100Mi

The unit suffix Mi denotes MiB, so the resource object defines the memory that the container requires 50MiB and can use 100MiB at most. Of course, there are other units that can express it. To understand how to use these values to control the container process, let's first create a Pod with no configured memory limits:

$kubectl run limit-test-- image=busybox-- command-/ bin/sh-c "while true; do sleep 2; done" deployment.apps "limit-test" created

With the Kubectl command, we can verify that the Pod has no resource restrictions:

$kubectl get pods limit-test-7cff9996fc-zpjps-otakjsonpathpaths'{.spec.containers [0] .resources} 'map []

The coolest thing about Kubernetes is that you can jump outside the system and look at each component, so let's log in to the node running Pod and see how Docker runs the container:

$docker ps | grep busy | cut-d''- f15c3af3101afb$ docker inspect 5c3af3101afb-f "{{.HostConfig.Memory}" 0

The .HostConfig.Memory field of this container corresponds to the-- memory parameter of docker run, with a value of 0 indicating that it is not set. What would Docker do with this value? To control the amount of memory that the container process can access, Docker configures a set of control group, or cgroup. Cgroup was incorporated into the Linux 2.6.24 kernel in January 2008. It is a very important topic. We say that cgroup is a collection of related properties used by the container to control how the kernel runs the process. There is a corresponding cgroup for memory, CPU, and various devices. Cgroup is hierarchical, which means that each cgroup has a parent from which it can inherit attributes, all the way up to the root cgroup created when the system starts.

Cgroup can be easily seen through / proc and / sys pseudo-file systems, so it's easy to check how the container configures the cgroup of memory. In the Pid namespace of the container, the pid of the root process is 1, but outside of namespace, it presents a system-level pid, which we can use to find its cgroups:

$ps ax | grep / bin/sh 9513? Ss 0:00 / bin/sh-c while true; do sleep 2; done$ sudo cat / proc/9513/cgroup...6:memory:/kubepods/burstable/podfbc202d3-da21-11e8-ab5e-42010a80014b/0a1b22ec1361a97c3511db37a4bae932d41b22264e5b97611748f8b662312574

I listed the memory cgroup, which is what we are concerned about. You can see the cgroup level mentioned earlier in the path. Some of the more important points are: first, the path is cgroup that starts with kubepods, so our process inherits every property of the group, as well as the property of burstable (Kubernetes sets Pod to the burstable QoS category) and a set of Pod representations for auditing. The last segment of the path is the cgroup actually used by our process. We can append it to / sys/fs/cgroups/memory for more information:

$ls-l / sys/fs/cgroup/memory/kubepods/burstable/podfbc202d3-da21-11e8Muthab5eMuth42010a80014b Universe 0a1b22ec1361a97c3511db37a4bae932d41b22264e5b97611748f8b662312574. RWMULY RULYUR-1 root root 0 Oct 27 19:53 memory.limit_in_bytes-rw-r--r-- 1 root root 0 Oct 27 19:53 memory.soft_limit_in_bytes

Once again, I only listed the records that we care about. Instead of focusing on memory.soft_limit_in_bytes for the time being, we shift our focus to the memory.limit_in_bytes property, which sets a memory limit. It is equivalent to the-memory parameter in the Docker command, that is, the memory resource limit in Kubernetes. Let's see:

$sudo cat / sys/fs/cgroup/memory/kubepods/burstable/podfbc202d3-da21-11e8-ab5e-42010a80014b/0a1b22ec1361a97c3511db37a4bae932d41b22264e5b97611748f8b662312574/memory.limit_in_bytes9223372036854771712

This is what happens on my node when no resource limit is set. Here is a simple explanation for it (https://unix.stackexchange.com/questions/420906/what-is-the-value-for-the-cgroups-limit-in-bytes-if-the-memory-is-not-restricte). So we see that if the memory limit is not set in Kubernetes, it will cause the Docker to set the HostConfig.Memory value to 0, and further cause the container process to be placed under the memory.limit_in_bytes memory cgroup with the default value of "no limit". Let's now create a Pod that uses the 100MiB memory limit:

$kubectl run limit-test-image=busybox-limits "memory=100Mi"-command-/ bin/sh-c "while true; do sleep 2; done" deployment.apps "limit-test" created

Once again, we use kubectl to verify our resource configuration:

$kubectl get pods limit-test-5f5c7dc87d-8qtdx-100Mi pathways'{.spec.containers [0] .resources} 'mapping [memory: map] requests: map [memory: 100Mi]]

You will notice that in addition to the limits we set up, Pod also adds requests. When you set limits instead of requests, Kubernetes defaults to making requests equal to limits. It makes a lot of sense if you look at it from a dispatcher's point of view. I will discuss requests further below. When the Pod starts, we can see how Docker configures the container and the memory cgroup of the process:

$docker ps | grep busy | cut-d'-f18fec6c7b6119 $docker inspect 8fec6c7b6119-- format'{{.HostConfig.memory}} '104857600$ ps ax | grep / bin/sh 29532? Ss 0:00 / bin/sh-c while true; do sleep 2; done$ sudo cat / proc/29532/cgroup...6:memory:/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11 $sudo cat / sys/fs/cgroup/memory/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11/memory.limit_in_bytes104857600

As you can see, Docker correctly sets the memory cgroup for this process based on our containerSpec. But what does this mean for the runtime? Linux memory management is a complex topic, and what Kubernetes engineers need to know is that when a host opportunity comes under pressure on memory resources, the kernel may selectively kill processes. A process that uses more than limited memory has a higher chance of being killed. Because the task of Kubernetes is to schedule Pod on as many of these nodes as possible, this can lead to abnormal memory pressure on the nodes. If your container uses too much memory, it is likely to be oom-killed. If Docker receives a notification from the kernel, Kubernetes will find the container and try to restart the Pod based on the settings.

So what is the memory requests that Kubernetes creates by default? Does a memory request with one 100MiB affect cgroup? Maybe it sets up the memory.soft_limit_in_bytes we saw earlier? Let's see:

$sudo cat / sys/fs/cgroup/memory/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11/memory.soft_limit_in_bytes9223372036854771712

You can see that the soft limit is still set to the default value "no limit". Even though Docker supports setting through the parameter memory-reservation, Kubernetes does not support this parameter. Does this mean that it is not important to specify a memory requests for your container? No, it's not. Requests is more important than limits. Limits tells the Linux kernel when your process can be killed to clean up space. Requests helps Kubernetes schedule to find the right node to run Pod. If you don't set them, or if you set them very low, it may have a bad effect.

For example, suppose you do not configure memory requests to run Pod, but configure a higher limits. As we know that Kubernetes points the value of requests to limits by default, Pod may fail to schedule without a node with the right resource, even if it doesn't actually need that many resources. On the other hand, if you run a Pod with a lower requests value, you are actually encouraging the kernel oom-kill to drop it. Why? Suppose your Pod usually uses 100MiB memory, but you only configure 50MiB memory requests for it. If you have a node with 75MiB memory space, then the Pod will be dispatched to that node. When Pod memory consumption expands to 100MiB, it puts more pressure on this node, and the kernel may choose to kill your process. So we need to configure Pod's memory requests and limits correctly.

Hopefully this article will help explain how Kubernetes container memory limits are set and implemented, and why you need to set these values correctly. If you give Kubernetes enough information it needs, it can intelligently schedule your tasks and maximize the use of your cloud computing resources.

CPU restriction

The CPU resource limit is more complex than the memory resource limit, and the reasons are detailed below. Fortunately, CPU resource limits are controlled by cgroup as well as memory resource limits, and the ideas and tools mentioned above also apply here, and we just need to pay attention to their differences. First, let's add the CPU resource limit to the yaml in the previous example:

Resources: requests: memory: 50Mi cpu: 50m limits: memory: 100Mi cpu: 100m

The unit suffix m means 1/1000 cores, that is, 1 Core = 1000m. Therefore, the resource object specifies that the container process needs 50 to 1000 cores (5%) to be scheduled, and allows a maximum of 100 prime 1000 cores (10%) to be used. Similarly, 2000m represents two complete CPU cores, which you can write as 2 or 2.0. To understand how Docker and cgroup use these values to control the container, let's first create a Pod with only CPU requests configured:

$kubectl run limit-test-image=busybox-requests "cpu=50m"-command-/ bin/sh-c "while true; do sleep 2; done" deployment.apps "limit-test" created

We can verify that the Pod is configured with 50m CPU requests through the kubectl command:

$kubectl get pods limit-test-5b4c495556-p2xkr-oclinic jsonpathways'{.spec.containers [0] .resources} 'mapping [maps: map [CPU: 50m]]

We can also see that Docker configures the same resource limits for the container:

$docker ps | grep busy | cut-d''- f1f2321226620e $docker inspect f2321226620e-- format'{{.HostConfig.CpuShares}}'51

Why is 51 shown here instead of 50? This is because both Linux cgroup and Docker divide the number of CPU cores into 1024 shares, while Kubernetes divides it into 1000 shares.

Shares is used to set the relative value of CPU, and is for all CPU (kernel). The default value is 1024. If there are two cgroup in the system, the shares value of An and BMagie An is 1024, and the shares value of B is 512, then A will get 1024 / (1204) 512 = 66% of CPU resources, while B will get 33% of CPU resources.

Shares has two characteristics:

If An is not busy and does not use 66% of the CPU time, then the remaining CPU time will be assigned to B by the system, that is, B's CPU utilization can exceed 33%.

If a new cgroup C is added and its shares value is 1024, then the limit of A becomes 1024 / (1204 512 1024) = 40% and the limit of B becomes 20%.

From the above two characteristics, we can see:

Shares basically doesn't work in free time, but only when CPU is busy, which is an advantage.

Because shares is an absolute value, you need to compare with other cgroup values to get your own relative limit, and on a machine that deploys a lot of containers, the number of cgroup varies, so this limit also changes. You set a high value by yourself, but others may set a higher value, so this function cannot accurately control CPU usage.

In the same way that Docker configures the memory cgroup of the container process when configuring the memory resource limit, Docker configures the cpu,cpuacct cgroup of the container process when setting the CPU resource limit:

$ps ax | grep / bin/sh 60554? Ss 0:00 / bin/sh-c while true; do sleep 2 Done$ sudo cat / proc/60554/cgroup...4:cpu,cpuacct:/kubepods/burstable/pode12b33b1-db07-11e8-b1e1-42010a800070/3be263e7a8372b12d2f8f8f9b4251f110b79c2a3bb9e6857b2f1473e640e8e75 $ls-l / sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pode12b33b1-db07-11e8-b1e1-42010a800070/3be263e7a8372b12d2f8f8f9b4251f110b79c2a3bb9e6857b2f1473e640e8e75total 0drwxr-xr-x 2 root root 0 Oct 28 23:19. Drwxr-xr-x 4 root root 0 Oct 28 23:19.-rw-r--r-- 1 root root 0 Oct 28 23:19 cpu.shares

The HostConfig.CpuShares attribute of the Docker container is mapped to the cpu.shares attribute of cgroup, which can be verified:

$sudo cat / sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/podb5c03ddf-db10-11e8-b1e1-42010a800070/64b5f1b636dafe6635ddd321c5b36854a8add51931c7117025a694281fb11444/cpu.shares51

You may be surprised that setting CPU requests propagates the value to cgroup, while in the previous article we did not propagate the value to cgroup when we set up memory requests. This is because the soft limit kernel feature of memory does not work for Kubernetes, while setting cpu.shares is useful for Kubernetes. I will discuss in detail why this is the case later. Now let's look at what happens when you set up CPU limits:

Kubectl run limit-test-- image=busybox-- requests "cpu=50m"-- limits "cpu=100m"-- command-- / bin/sh-c "while true; dosleep 2; done" deployment.apps "limit-test" created

Once again use kubectl to verify our resource configuration:

$kubectl get pods limit-test-5b4fb64549-qpd4n-oclinic jsonpathogens'{.spec.containers [0] .resources} 'mapping [maps: map [CPU: 100m] requests: map [CPU: 50m]]

View the configuration of the corresponding Docker container:

$docker ps | grep busy | cut-d''- f1f2321226620e$ docker inspect 472abbce32a5-- format'{{.HostConfig.CpuShares}} {{.HostConfig.CpuQuota}} {{.HostConfig.CpuPeriod}}'51 10000 100000

You can clearly see that CPU requests corresponds to the HostConfig.CpuShares property of the Docker container. CPU limits, on the other hand, is less obvious and is controlled by two properties: HostConfig.CpuPeriod and HostConfig.CpuQuota. These two attributes in the Docker container map to the other two properties of the process's cpu,couacct cgroup: cpu.cfs_period_us and cpu.cfs_quota_us. Let's take a look:

$sudo cat / sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod2f1b50b6-db13-11e8-b1e1-42010a800070/f0845c65c3073e0b7b0b95ce0c1eb27f69d12b1fe2382b50096c4b59e78cdf71/cpu.cfs_period_us100000 $sudo cat / sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod2f1b50b6-db13-11e8-b1e1-42010a800070/f0845c65c3073e0b7b0b95ce0c1eb27f69d12b1fe2382b50096c4b59e78cdf71/cpu.cfs_quota_us10000

As I said, these values are the same as those specified in the container configuration. But how do the values of these two properties come from the 100m cpu limits we set in Pod, and how do they implement the limits? This is because cpu requests and cpu limits are implemented using two separate control systems. Requests uses the cpu shares system. Cpu shares divides each CPU core into 1024 time slices and ensures that each process will receive a fixed proportion of time slices. If there are a total of 1024 time slices and each of the two processes has cpu.shares set to 512, they will each get about half of the CPU available time. However, the cpu shares system cannot precisely control the upper limit of CPU utilization. If one process does not set shares, the other process is free to use CPU resources.

Around 2010, the Google team and others noticed the problem. To solve this problem, a second more powerful control system, the CPU bandwidth control group, was later added to the linux kernel. The bandwidth control group defines a period, usually 1x10 seconds (that is, 100000 microseconds). A quota is also defined to indicate the number of CPU times that the process is allowed to use within the set cycle length, and the two files work together to set the upper limit of CPU usage. The units of both files are microseconds (us). The value of cfs_period_us ranges from 1 millisecond (ms) to 1 second (s). The value of cfs_quota_us is greater than 1ms. If the value of cfs_quota_us is-1 (default), it is not limited by CPU time.

Here are a few examples:

# 1. Limit the use of 1 CPU (CPU time of 250ms per 250ms) $echo 250000 > cpu.cfs_quota_us / * quota = 250ms * / $echo 250000 > cpu.cfs_period_us / * period = 250ms * / # 2. Limit the use of 2 CPU (kernels) (CPU time of 1000ms per 500ms, even if two cores are used) $echo 1000000 > cpu.cfs_quota_us / * quota = 1000ms * / $echo 500000 > cpu.cfs_period_us / * period = 500ms * / # 3. Limit the use of 20% of 1 CPU (CPU time of 10ms per 50ms, even with 20% of a CPU core) $echo 10000 > cpu.cfs_quota_us / * quota = 10ms * / $echo 50000 > cpu.cfs_period_us / * period = 50ms * /

In this example, we set the cpu limits of Pod to 100m, which means 100 CPU cores, or 10000 of the 100000 microsecond CPU time period. So the limits translation to cpu,cpuacct cgroup is set to cpu.cfs_period_us=100000 and cpu.cfs_quota_us=10000. By the way, cfs stands for Completely Fair Scheduler (absolute Fair scheduling), which is the default CPU scheduling algorithm in Linux systems. There is also a real-time scheduling algorithm, which also has its own corresponding quota value.

Now let's sum it up:

The cpu requests set in Kubernetes will eventually be set to the value of the cpu.shares property by cgroup, and cpu limits will be set to the value of the cpu.cfs_period_us and cpu.cfs_quota_us properties by the bandwidth control group. Like memory, cpu requests is mainly used to inform the scheduler at least how many cpu shares are needed on the scheduler node before it can be scheduled.

Unlike memory requests, setting cpu requests sets a property in cgroup to ensure that the kernel allocates that amount of shares to the process.

Cpu limits is also different from memory limits. If the container process uses more memory resources than the memory usage limit, the process will be a candidate for oom-killing. But the container process can basically never exceed the set CPU quota, so the container will never be expelled for trying to use more CPU time than is allocated. The CPU resource limit is imposed in the scheduler to ensure that the process does not exceed this limit.

What happens if you don't set these properties in the container, or set them to inaccurate values? As with memory, setting only limits but not requests,Kubernetes will set the requests of CPU to the same value as limits. It would be great if you know exactly how much CPU time your workload requires. What if only CPU requests is set but CPU limits is not set? In this case, Kubernetes will ensure that the Pod is scheduled to the appropriate node, and that the node's kernel will ensure that the available cpu shares on the node is greater than the cpu shares requested by Pod, but your process will not be prevented from using more than the requested CPU. Setting neither requests nor limits is the worst-case scenario: the scheduler does not know what the container needs, and the process has unlimited use of cpu shares, which may have some negative impact on node.

Finally, I want to tell you that it is troublesome to configure these parameters manually for each pod. Kubernetes provides LimitRange resources that allow us to configure the default request and limit values of a namespace.

Default limit

You already know from the above discussion that ignoring resource restrictions can have a negative impact on Pod, so you might think that it would be nice if you could configure the default request and limit values of a namespace, so that these restrictions will be added by default every time a new Pod is created. Kubernetes allows us to set resource limits on each namespace through LimitRange resources. To create default resource limits, you need to create a LimitRange resource in the corresponding namespace. Here is an example:

ApiVersion: v1kind: LimitRangemetadata: name: default-limitspec: limits:-default: memory: 100Mi cpu: 100m defaultRequest: memory: 50Mi cpu: 50m-max: memory: 512Mi cpu: 500m-min: memory: 50Mi cpu: 50m type: Container

Here are a few fields that may confuse you. Let me open them and analyze them for you.

The default field below the limits field represents the default limits configuration for each Pod, so any Pod of an unallocated limits is automatically allocated 100Mi limits memory and 100m limits CPU.

The defaultRequest field represents the default requests configuration for each Pod, so any Pod of an unallocated requests is automatically allocated 50Mi requests memory and 50m requests CPU.

The max and min fields are special, and if these two fields are set, the Pod will not be allowed to be created as long as the limits and requests set by the Pod in this namespace exceed this upper and lower limit. I haven't found the use of these two fields yet. If you know, please let me know if you leave a message.

The default values set in LimitRange are finally implemented by the admission controller LimitRanger plug-in in Kubernetes. The admission controller consists of a series of plug-ins that modify the Spec-field of the Pod before creating the Pod after the API receives the object. For the LimitRanger plug-in, it checks whether limits and requests are set for each Pod, and if not, configure it with the default values set in LimitRange. By checking the annotations comments in Pod, you can see that the LimitRanger plug-in has set the default values in your Pod. For example:

ApiVersion: v1kind: Podmetadata: annotations: kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu request for container limit-test' name: limit-test-859d78bc65-g6657 namespace: defaultspec: containers:-args:-/ bin/sh-- c-while true; do sleep 2 Done image: busybox imagePullPolicy: Always name: limit-test resources: requests: cpu: 100m Thank you for your reading, the above is the content of "what is K8s resource limit". After the study of this article, I believe you have a deeper understanding of what is K8s resource restriction. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report