In-depth understanding of Kubernetes resource limitations: memory 04/10 Update SLTechnology News&Howtos

In-depth understanding of Kubernetes resource limitations: memory

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Write at the front

When I started using Kubernetes on a large scale, I began to think about a problem I didn't encounter in my experiments: when the nodes in the cluster didn't have enough resources, the Pod would get stuck in the Pending state. There is no way to add CPU or memory to a node, so what can you do to remove the Pod from this node? The easiest way is to add another node, which I admit I always do. In the end, this strategy fails to take advantage of one of Kubernetes's most important capabilities: its ability to optimize the use of computing resources. The real problem in these scenarios is not that the nodes are too small, but that we have not carefully calculated the resource limits for Pod.

Resource restriction is one of the many configurations we can provide to Kubernetes, which means two things: what resources are needed for the workload to run and how many resources are allowed to be consumed at most. The first point is very important for the scheduler because it selects the appropriate node. The second point is very important for Kubelet, where the daemon Kubelet on each node is responsible for the running health of Pod. Most readers of this article may have some understanding of resource limitations, but there are actually a lot of interesting details. In the two articles in this series, I will first analyze the memory resource limits in detail, and then analyze the CPU resource limitations in the second article.

Resource restriction

The resource limit is set through the resources field of each container containerSpec, which is a v1 version of the ResourceRequirements type API object. Each object that specifies "limits" and "requests" can control the corresponding resource. Currently, there are only two kinds of resources: CPU and memory. The third resource type, persistent storage, is still the beta version, which I will analyze in a future blog. In most cases, the definitions of deployment, statefulset, and daemonset include podSpec and multiple containerSpec. Here is a complete yaml configuration of the v1 resource object:

This object can be understood this way: this container typically requires 5% of CPU time and 50MiB memory (requests), while allowing it to use up to 10% of CPU time and 100MiB memory (limits). I will further explain the difference between requests and limits, but in general, requests is more important when scheduling and limits is more important at run time. Although resource limits are configured on each container, you can think of Pod's resource limits as the sum of the resource limits of the containers in it, and we can observe this relationship from a system perspective.

Memory limit

It's usually easier to analyze memory than to analyze CPU, so I'll start here. One of my goals is to show you how it is implemented in the memory system, that is, what Kubernetes does to the container runtime (docker/containerd) and what the container runtime does to the Linux kernel. Starting with the analysis of memory resource limitations also lays the foundation for later analysis of CPU. First, let's review the previous example:

The unit suffix Mi denotes MiB, so the resource object defines the memory that the container requires 50MiB and can use 100MiB at most. Of course, there are other units that can express it. To understand how to use these values to control the container process, let's first create a Pod with no configured memory limits:

$kubectl run limit-test-- image=busybox-- command-/ bin/sh-c "while true; do sleep 2; done"

Deployment.apps "limit-test" created

With the Kubectl command, we can verify that the Pod has no resource restrictions:

$kubectl get pods limit-test-7cff9996fc-zpjps-ointjsonpathpaths'{.spec.containers [0] .resources}'

Map []

The coolest thing about Kubernetes is that you can jump outside the system and look at each component, so let's log in to the node running Pod and see how Docker runs the container:

$docker ps | grep busy | cut-D''- F1

5c3af3101afb

$docker inspect 5c3af3101afb-f "{{.HostConfig.Memory}"

The .HostConfig.Memory field of this container corresponds to the-- memory parameter of docker run, with a value of 0 indicating that it is not set. What would Docker do with this value? To control the amount of memory that the container process can access, Docker configures a set of control group, or cgroup. Cgroup was incorporated into the Linux 2.6.24 kernel in January 2008. It is a very important topic. We say that cgroup is a collection of related properties used by the container to control how the kernel runs the process. There is a corresponding cgroup for memory, CPU, and various devices. Cgroup is hierarchical, which means that each cgroup has a parent from which it can inherit attributes, all the way up to the root cgroup created when the system starts.

Cgroup can be easily seen through / proc and / sys pseudo-file systems, so it's easy to check how the container configures the cgroup of memory. In the Pid namespace of the container, the pid of the root process is 1, but outside of namespace, it presents a system-level pid, which we can use to find its cgroups:

* $ps ax | grep / bin/sh

9513? Ss 0:00 / bin/sh-c while true; do sleep 2; done

$sudo cat / proc/9513/cgroup

...

6:memory:/kubepods/burstable/podfbc202d3-da21-11e8-ab5e-42010a80014b/0a1b22ec1361a97c3511db37a4bae932d41b22264e5b97611748f8b662312574

I listed the memory cgroup, which is what we are concerned about. You can see the cgroup level mentioned earlier in the path. Some of the more important points are: first, the path is cgroup that starts with kubepods, so our process inherits every property of the group, as well as the property of burstable (Kubernetes sets Pod to the burstable QoS category) and a set of Pod representations for auditing. The last segment of the path is the cgroup actually used by our process. We can append it to / sys/fs/cgroups/memory for more information:

$ls-l / sys/fs/cgroup/memory/kubepods/burstable/podfbc202d3-da21-11e8-ab5e-42010a80014b/0a1b22ec1361a97c3511db37a4bae932d41b22264e5b97611748f8b662312574

...

-rw-r--r-- 1 root root 0 Oct 27 19:53 memory.limit_in_bytes

-rw-r--r-- 1 root root 0 Oct 27 19:53 memory.soft_limit_in_bytes

Once again, I only listed the records that we care about. Instead of focusing on memory.soft_limit_in_bytes for the time being, we shift our focus to the memory.limit_in_bytes property, which sets a memory limit. It is equivalent to the-memory parameter in the Docker command, that is, the memory resource limit in Kubernetes. Let's see:

$sudo cat / sys/fs/cgroup/memory/kubepods/burstable/podfbc202d3-da21-11e8-ab5e-42010a80014b/0a1b22ec1361a97c3511db37a4bae932d41b22264e5b97611748f8b662312574/memory.limit_in_bytes

9223372036854771712

This is what happens on my node when no resource limit is set. Here is a simple explanation for it (https://unix.stackexchange.com/questions/420906/what-is-the-value-for-the-cgroups-limit-in-bytes-if-the-memory-is-not-restricte). So we see that if the memory limit is not set in Kubernetes, it will cause the Docker to set the HostConfig.Memory value to 0, and further cause the container process to be placed under the memory.limit_in_bytes memory cgroup with the default value of "no limit". Let's now create a Pod that uses the 100MiB memory limit:

$kubectl run limit-test-image=busybox-limits "memory=100Mi"-command-/ bin/sh-c "while true; do sleep 2; done"

Deployment.apps "limit-test" created

Once again, we use kubectl to verify our resource configuration:

$kubectl get pods limit-test-5f5c7dc87d-8qtdx-ointjsonpathpaths'{.spec.containers [0] .resources}'

Map[limits: map[memory: 100Mi] requests: map[memory: 100Mi]]

You will notice that in addition to the limits we set up, Pod also adds requests. When you set limits instead of requests, Kubernetes defaults to making requests equal to limits. It makes a lot of sense if you look at it from a dispatcher's point of view. I will discuss requests further below. When the Pod starts, we can see how Docker configures the container and the memory cgroup of the process:

$docker ps | grep busy | cut-D''- F1

8fec6c7b6119

$docker inspect 8fec6c7b6119-- format'{{.HostConfig.memory}}'

104857600

$ps ax | grep / bin/sh

29532? Ss 0:00 / bin/sh-c while true; do sleep 2; done

$sudo cat / proc/29532/cgroup

...

6:memory:/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11

$sudo cat / sys/fs/cgroup/memory/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11/memory.limit_in_bytes

104857600

As you can see, Docker correctly sets the memory cgroup for this process based on our containerSpec. But what does this mean for the runtime? Linux memory management is a complex topic, and what Kubernetes engineers need to know is that when a host opportunity comes under pressure on memory resources, the kernel may selectively kill processes. A process that uses more than limited memory has a higher chance of being killed. Because the task of Kubernetes is to schedule Pod on as many of these nodes as possible, this can lead to abnormal memory pressure on the nodes. If your container uses too much memory, it is likely to be oom-killed. If Docker receives a notification from the kernel, Kubernetes will find the container and try to restart the Pod based on the settings.

So what is the memory requests that Kubernetes creates by default? Does a memory request with one 100MiB affect cgroup? Maybe it sets up the memory.soft_limit_in_bytes we saw earlier? Let's see:

$sudo cat / sys/fs/cgroup/memory/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11/memory.soft_limit_in_bytes

9223372036854771712

You can see that the soft limit is still set to the default value "no limit". Even though Docker supports setting through the parameter memory-reservation, Kubernetes does not support this parameter. Does this mean that it is not important to specify a memory requests for your container? No, it's not. Requests is more important than limits. Limits tells the Linux kernel when your process can be killed to clean up space. Requests helps Kubernetes schedule to find the right node to run Pod. If you don't set them, or if you set them very low, it may have a bad effect.

For example, suppose you do not configure memory requests to run Pod, but configure a higher limits. As we know that Kubernetes points the value of requests to limits by default, Pod may fail to schedule without a node with the right resource, even if it doesn't actually need that many resources. On the other hand, if you run a Pod with a lower requests value, you are actually encouraging the kernel oom-kill to drop it. Why? Suppose your Pod usually uses 100MiB memory, but you only configure 50MiB memory requests for it. If you have a node with 75MiB memory space, then the Pod will be dispatched to that node. When Pod memory consumption expands to 100MiB, it puts more pressure on this node, and the kernel may choose to kill your process. So we need to configure Pod's memory requests and limits correctly.

For related services, please visit:

Https://support.huaweicloud.com/cce/index.html?utm_content=cce_helpcenter_2019

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.