What are the decision factors affecting Kubernetes scheduling? 04/16 Update SLTechnology News&Howtos

What are the decision factors affecting Kubernetes scheduling?

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the decision-making factors that affect Kubernetes scheduling". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Which node has available resources?

When the appropriate node is selected, the scheduler checks whether each node has sufficient resources to meet the Pod scheduling. If you have declared the amount of CPU and memory required by Pod (through requests and limits), the scheduler uses the following formula to calculate the available memory on a given node:

Scheduling available memory = node total memory-reserved memory

Reserved memory refers to:

The memory used by the Kubernetes daemon, such as kubelet, containerd (a container runtime).

The memory used by the node operating system, for example, kernel daemon.

By using this equation, the scheduler ensures that too much Pod competes to consume all available resources of the node, resulting in other system exceptions such as system triggering oom.

Affect the scheduling process

Without being affected by the user, the scheduler performs the following steps when dispatching the Pod to the node:

The scheduler detected that a new Pod has been created, but has not yet assigned it to the node

It checks the Pod requirements and filters out all inappropriate nodes accordingly

The remaining nodes are sorted according to weight, and the one with the highest weight ranks first.

The scheduler selects the first node in the sorted list and assigns Pod to it.

Typically, we let the scheduler automatically select the appropriate node (provided the Pod is configured with resource requests and restrictions). However, it may sometimes be necessary to influence this decision by forcing the scheduler to select a specific node or manually adding weights to multiple nodes to make it more suitable for Pod scheduling than other nodes.

Let's see how we can do this:

Node name

In the simplest node selection configuration, you can force Pod to run on the specified node simply by specifying its name in .spec.nodeName. For example, the following YAML defines Pod to force scheduling on app-prod01:

ApiVersion: v1kind: Podmetadata: name: nginxspec: containers:-name: nginx image: nginx nodeName: app-prod01

Note that this method is the simplest but least recommended node selection method for the following reasons:

If for some reason a node with the specified name cannot be found (for example, the hostname has been changed), Pod will not run.

If the node does not have the resources required for Pod to run, the Pod will fail and the Pod will not be dispatched to other nodes.

This causes Pods to be tightly coupled to their nodes, which is a bad design practice.

Node selector

The first and easiest way to override the scheduler's decision is to use the .spec.nodeSelector parameter in the Pod definition or the Pod template (if you are using a controller such as Deployments). NodeSelector accepts one or more key-value pairs of tags, which must be set at the node to schedule the Pod properly. Suppose you recently bought two computers with SSD disks, and you want all database-related Pod to be scheduled on nodes supported by SSD to get the best database performance. The Pod YAML for DB Pod might look like this:

ApiVersion: v1kind: Podmetadata: name: dbspec: containers:-name: mongodb image: mongo nodeSelector: disktype: ssd

According to this definition, when the scheduler selects the appropriate Pod allocation node, only the node with the disktype=ssd tag will be considered.

In addition, you can use any built-in tags that are automatically assigned to nodes to manipulate selection decisions. For example, node hostname (kubernetes.io/hostname), architecture (kubernetes.io/arch), operating system (kubernetes.io/os) and so on can be used for node selection.

Node affinity

Node selection is useful when you need to select a specific node to run our Pod. However, the way to select nodes is limited, and only nodes that match all defined tags are considered for Pod placement. Node Affinity gives you more flexibility by allowing you to define hard and soft node requirements. The hard requirement must match on the node to be selected. Soft conditions, on the other hand, allow you to add more weight to nodes with specific labels so that they are higher in the list than their peers. Nodes without soft requirements tags will not be ignored, but they have a smaller weight.

Let's give an example: our database is Imax O-intensive. We need the database Pods to always run on nodes supported by SSD. In addition, if Pod is deployed on nodes in a region zone1 or zone2 because they are physically closer to the application node, their latency will be shorter. The definition of Pod that meets our needs might be as follows:

ApiVersion: v1kind: Podmetadata: name: dbspec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:-matchExpressions:-key: disk-type operator: In values:-ssd preferredDuringSchedulingIgnoredDuringExecution:-weight: 1 preference: matchExpressions:-key: zone operator: In values:-zone1 -zone2 containers:-name: db image: mongo

The nodeAffinity section uses the following parameters to define hard and soft requirements:

RequiredDuringSchedulingIgnoredDuringExecution: when deploying DB Pod, the node must have a disk-type=ssd.

PreferredDuringSchedulingIgnoredDuringExecution: when sorting nodes, the scheduler gives higher weights to nodes labeled zone=zone1 or zone=zone2. If there are nodes for disk-type=ssd and zone=zone1, a node with disk-type=ssd and no zone tag or a node that points to another zone is preferred. The weight can be any value between 1 and 100, and the weight number gives the matching node a higher weight than the other nodes. The larger the number, the higher the weight.

Note that when making a selection, node affinity allows you more freedom to choose which tags should exist (or not) on the target node. In this example, we use the In operator to define multiple tags, any one of which exists on the target node. Other operators are NotIn, Exists, doesnoexistists, Lt (less than), and Gt (greater than). It is worth noting that NotIn and doesnot existist implement the so-called node anti-affinity.

Node affinity and node selector are not mutually exclusive, they can coexist in the same definition file. However, in this case, the node selector and node affinity hard requirements must match.

Pod affinity

Node selectors and node affinity (and anti-affinity) help us influence the scheduler's decision about where to place the Pods. However, it only allows you to select based on the label on the node. It doesn't care about the label of Pod itself. You may need to select based on the Pod tag in the following situations:

All middleware Pod needs to be placed on the same physical node, along with those Pod with role=front tags, to reduce the network latency between them.

As a security best practice, we do not want the middleware Pod to coexist with the Pod that handles user authentication (role=auth). This is not a strict requirement.

As you can see, these requirements cannot be met with node selectors or affinity, because Pod tags are not considered in the selection process-- only node tags are considered.

To meet these needs, we use Pod affinity and anti-affinity. In essence, they work in the same way as node affinity and anti-affinity. The hard requirement must be met to select the target node, while the soft condition increases the chance of owning the selected node (weight), but not a strict requirement. Let's give an example:

ApiVersion: v1kind: Podmetadata: name: middlewarespec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution:-labelSelector: matchExpressions:-key: role operator: In values:-frontend topologyKey: kubernetes.io/hostname podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution:-weight: 100podAffinityTerm: labelSelector: matchExpressions:-key: role Operator: In values:-auth topologyKey: kubernetes.io/hostname containers:-name: middleware image: redis

In the above Pod definition file, we set the hard and soft requirements as follows:

RequiredDuringSchedulingIgnoredDuringExecution: our Pod must be scheduled on a Pod node with a label of app-front.

PreferredDuringSchedulingIgnoredDuringExecution: our Pod should not (but it can) be dispatched to a node running Pod labeled role=auth. Like node affinity, soft requirement sets the weight from 1 to 100 to increase the probability of nodes relative to other nodes. In our example, the soft requirement is placed in the poantiaffinity, making a node running a Pod with the label role=auth less likely to be selected when the scheduler makes a decision.

TopologyKey is used to make finer-grained decisions about where the rules will be applied. TopologyKey accepts a tag key that must appear on the node considered during the selection process. In our example, we use an auto-populated label that is automatically added to all nodes by default and references the hostname of the node. But you can use other auto-populated tags, or even custom tags. For example, you might want to apply the Pod affinity rule only to nodes that have rack or zone tags.

Comments about IgnoredDuringExecution

You may have noticed that both hard and soft requirements have IgnoredDuringExecution suffixes. This means that after the scheduling decision is made, the scheduler will not try to change the Pods that has been placed, even if the conditions change. For example, according to the node affinity rule, schedule a Pod to a node with a label of app=prod. If the tag is changed to app=dev, the old Pod will not be terminated and the new Pod will be launched on another node with the app=prod tag. This behavior may change in the future to allow the scheduler to continue to check the node and Pod association (and anti-association) rules after deployment.

Stain and tolerance

In some scenarios, you may want to prevent Pod from being dispatched to specific nodes. You may be running tests or scanning this node for threats, and you don't want your application to be affected. Node anti-affinity can achieve this goal. However, this is a significant administrative burden because you need to add anti-association rules to each new Pod deployed to the cluster. For this scenario, you should use a stain.

When a node is configured with a stain, Pod cannot be dispatched to it unless Pod can tolerate it. Tolerance is just a key-value pair that matches the stain. Let's give an example to illustrate:

The host web01 needs to be contaminated so that it does not accept more Pod. The taint command is as follows:

Kubectl taint nodes web01 locked=true:NoSchedule

The above command places a stain on a node named web01, which has the following attributes:

Tag locked=true, which must be configured on the Pod that you want to run on that node.

The stain type of the NoSchedule. The stain type defines the behavior of applying a stain, and it has the following possibilities:

NoSchedule: this node must not be scheduled.

PreferNoSchedule: avoid scheduling as much as possible, similar to soft affinity.

NoExecute: not only will it not be scheduled, it will also expel the existing Pod on the Node.

On a tainted node, the definition file for Pod might look like this:

ApiVersion: v1kind: Podmetadata: name: mypodspec: containers:-name: mycontainer image: nginx tolerations:-key: "locked" operator: "Equal" value: "true" effect: "NoSchedule"

Let's take a closer look at the tolerance part of this definition:

In order to have proper tolerance, we need to specify the key (locked), true (value), and operator.

Operator can be one of two values:

Equal: when using the equal operator, the key, value, and contamination effect must match the contamination of the node.

Exists: when using the exists operator, you don't need to match the stain with tolerance, just match it.

If you use the Exists operator, you can ignore tolerance keys, values, and effects. A Pod with this tolerance can be dispatched to a node with any stain.

Note that placing tolerance on Pod does not guarantee that it will be deployed on contaminated nodes. It only allows behavior to happen. If you want to force Pod to join a contaminated node, you must also add node affinity to its definition as discussed earlier.

TL; DR

Automatic placement of Pod on nodes is one of the reasons for the birth of Kubernetes. As an administrator, as long as you make a good declaration of Pod requirements, you don't have to worry about whether the node has enough free resources to run these Pod. However, sometimes you have to manually intervene and override the system's decisions about where to place the Pods. In this article, we discussed several methods that you can use to have a greater impact on the scheduler of a particular node when you decide to deploy Pods. Let's take a quick look at these methods:

Node name: you can force this Pod to run on that particular node by adding the hostname of the node to the .spec.nodeName parameter defined by Pod. Any selection algorithm used by the scheduler will be ignored. This method is not recommended.

Node selector: by placing a specified tag on a node, Pod can use the nodeelector parameter to specify one or more key-value tags that must exist on the target node to be selected to run Pod. This approach is recommended because it adds a lot of flexibility and establishes a loosely coupled node-Pod relationship.

Node affinity: this approach adds more flexibility when choosing which node should be considered to schedule a particular Pod. Using node affinity, Pod may strictly require scheduling on nodes with specific labels. It can also express a certain degree of preference for specific nodes by influencing the scheduler to give more weight to specific nodes.

Pod affinity and anti-affinity: you can use this method when Pod coexists (or does not co-exist) with other Pod on the same node. Pod affinity allows Pod to be deployed on nodes where Pod is running with a specific label. Instead, Pod may force the scheduler not to schedule it to the node on which Pod is running with a specific label.

Stain and tolerance: in this approach, you don't need to decide which nodes to schedule the Pod to, but whether to accept all Pod schedules or only the selected Pod schedules. By polluting a node, the scheduler will not consider that node as a scheduling node for any Pod unless the Pod is configured for tolerance. Tolerance consists of keys, values, and stained effects.

This is the end of the content of "what are the decision factors that affect Kubernetes scheduling". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.