How to analyze Taint, Toleration and Node Affinity in Kubernetes Advanced scheduling 07/12 Update SLTechnology News&Howtos

How to analyze Taint, Toleration and Node Affinity in Kubernetes Advanced scheduling

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

It is believed that many inexperienced people are at a loss about how to analyze Taint, Toleration and Node Affinity in the advanced scheduling of Kubernetes. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

(avoid Pod and Node appearing in a small paragraph of text at the same time, so Node is expressed in node Chinese characters)

Taint and Toleration

Theoretical support of stain

1.1 what are the effects of stain setting

Effect of use (Effect):

PreferNoSchedule: the scheduler tries to avoid dispatching Pod to a node with this stain effect. If it cannot be avoided (such as insufficient resources on other nodes), Pod can also be dispatched to this stain node.

NoSchedule: Pod that does not tolerate the stain effect will never be dispatched to this node, and Pod (static Pod) managed through kubelet is unrestricted; Pod that has not previously set a stain can continue to run if it is already running on this node (a node with stains).

NoExecute: the scheduler will not schedule the Pod to the node with this stain effect, and will expel the Pod that already exists on the node.

The first premise of the stain setting is that the stain key on the node and the stain on the Pod tolerate key matching.

1.2 Measurement of the effect of setting a stain

When the Pod does not set stain tolerance and the node sets stain

When the node's stain effect is set to: PreferNoSchedule, the Pod that already exists on this node will not be expelled; the new Pod that has not set stain tolerance can still be scheduled to this node.

When the node's stain effect is set to: NoSchedule, the Pod that already exists on this node will not be expelled; at the same time, the new Pod will not be scheduled on this node.

When the stain effect of a node is set to: NoExecute, the Pod that already exists on this node will be expelled (the expulsion time is determined by the tolerationSeconds field, and it will be expelled immediately if it is less than or equal to 0); the new Pod will not be scheduled to this node.

When the stain is set in Node and the corresponding stain tolerance is set in Pod, the measured results are as follows:

Stain tolerance setting, what is the difference between the two operators Exists and Equal?

In configuration:

The Exists must set the value to an empty string and care only about whether the key matches the node's smudge key.

Equal needs to set both key and value to match the Key and value of the stain node.

The understanding between the two deepens:

If there are multiple stains on a node, Pod tolerates only one of them using Exists, and still cannot dispatch this node because Pod does not tolerate other stains on this node.

If there are multiple stains on a node, Pod uses Equal to tolerate only one of the stains, but still cannot dispatch this node because Pod still does not tolerate other stains of this node.

If you want a Pod to be dispatched to a node that contains multiple stains, the Pod needs to tolerate all stains on that node.

1.3 Tips for stain tolerance:

In the stain tolerance setting, if key,value is a null character and the operator is Exists, then Pod can tolerate all smudges on the node. (note: still obey the level setting of tolerance effect). For example, when a Pod sets stain tolerance, the key,value is empty and the operator is Exists, and the tolerance effect is: NoExecute, then the Pod will not be scheduled to the node with the stain effect: NoSchedule.

When setting stain tolerance, if the tolerance effect (effect) of Pod is set to null characters, then Pod can match all tolerance effects.

When setting stain tolerance, if the key,value is empty, the operator is Exists, and the tolerance effect (effect) is also empty, it is not set.

By default, the operator is Equal.

If the effect of the node is NoExecute, and you do not want Pod to be expelled immediately, you can set TolerationSeconds (delayed expulsion time). If the value is 0 or a negative number, the node will be expelled immediately, and if the value is greater than 0, the expulsion will begin after this time.

From the test results, as long as the node has a stain and the effect is: NoExecute, no matter whether the Pod tolerates the stain or not, it cannot work properly on the corresponding node (it has been in the process of deletion and reconstruction). The reason is that it can be scheduled to the node is the result of the selection of the scheduler, and the killing of Pod is the result of the local kubelet decision, which is the effect of the different work of the two components, except for the following configuration.

Tolerations:-operator: Exists

# the stain configuration of this Pod can tolerate all stains, all effects, and all can be dispatched to all nodes (including Node where the effect is set to: NoExecute).

1.4 Cognitive misunderstanding

1.4.1 when a node sets a stain, it can be scheduled and run normally as long as Pod is set to tolerate the stain. (wrong)

When the effect of a stain on a node is set to NoExecute, and the tolerance effect of Pod for this stain is also NoExecute, Pod can be dispatched, but it will also be Terminating, constantly in the process of Terminating,ContainerCreating.

Stain the Node:

Kubectl taint nodes 1xx status=unavailable:NoExecute

Stain tolerance set by Pod:

Tolerations:-effect: NoExecute key: status operator: Equal tolerationSeconds: 0 value: unavailable

Effect:

Tolerations:-operator: Exists# the stain configuration of this Pod can tolerate all smudges, all effects, and all can be dispatched to all nodes (including Node where the effects are set to: NoExecute).

1.4.2 when a node has multiple stains, as long as the Exists operator is used to match one of them, the Pod can be dispatched to the corresponding node. (wrong)

The reason is that Pod can only match one of the stains, but still can't match the others. Therefore, it cannot be dispatched.

1.4.3 when setting Pod tolerance, you only need to match key and value, regardless of the setting of tolerance effect. (wrong)

The setting of tolerance is as important as or even more important than the setting of key/value. If the tolerance effect does not match. It will also cause Pod scheduling to fail to the corresponding node.

1.4.4 if Pod does not set any stain tolerance, Pod cannot be dispatched to tainted nodes. (wrong)

If the node's stain effect is: PreferNoSchedule, Pod without setting any stain tolerance can be dispatched to this node. The reason is that PreferNoSchedule means no scheduling first, but Pod can still be dispatched to this node when no node is available.

two

Node Affinity

Node Affinity allows designated applications to be dispatched to specified nodes, which is conducive to the stability of applications, reduces the possibility of preempting resources between important and unimportant businesses, and reduces the impact of unimportant services on important businesses. On the other hand, it can also isolate multi-tenants. Provide a specific operating environment for tenants according to the needs of tenants.

2.1 key points of NodeAffinity configuration

There are two main categories of NodeAffinity configuration:

RequiredDuringSchedulingIgnoredDuringExecution (strong affinity)

PreferredDuringSchedulingIgnoredDuringExecution (preferred affinity)

However, in the real configuration, there will be confusion:

How strong is strong affinity?

What is the first choice of affinity?

When there is a strong affinity configuration, there are two ways to configure. What is the difference between the two?

What is the rule of the weight value in the preferred affinity? The higher the value, the higher the weight? Or the smaller the value, the higher the value (1 maximum)?

In the preferred affinity configuration, if Pod can match multiple Label of A node and one Label of B node (the sum of the Label weights of An is equal to the weight of B single Label), will Pod be dispatched to A node first?

When reducing capacity, do you start to kill from the nodes with low weight? We can not rely solely on comments and understanding to guess the answer to these questions, we must get the real answer through actual measurement, otherwise we will have to pay more cost once we go into production and then want to modify it.

If Pod is bound to a node in a strong affinity manner and Pod is running normally on this node, whether deleting the label of the node will cause Pod restart to drift.

Strong affinity:

RequiredDuringSchedulingIgnoredDuringExecution

Example Node Labels setting:

Level: important (important), general (general), unimportant (not important)

Configuration of Pod and computing:

Note: the configuration of strong affinity is divided into two parts: and operation, or operation

RequiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:-matchExpressions:-key: level operator: In values:-important-key: app operator: In values:-1

In the configuration of the and operation, we find that we need to match both the tags of level=important and the tags of app=1 in the same matchExpressions. In other words: Pod will only select nodes that match both Label.

According to the Node affinity setting of the above Pod, two Label seek an intersection, and only the nodes that meet both Label will be included in the scheduling pool of this Pod. Obviously, only 10.x.x.80 is the node. Therefore, this Pod can only be dispatched to this node, and if this node does not have enough resources, then the Pod scheduling fails.

Pod or arithmetic configuration:

RequiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:-matchExpressions:-key: level operator: In values:-important-matchExpressions:-key: level operator: In values:-unimportant

In the configuration of the OR operation, we find that there is a matchExpressions array in which the Label list is joined. In other words: Pod can choose to match as long as one of the Label nodes, not a full match.

For example:

The Label setting of the node follows that of the previous example. As long as the label of the node satisfies one of the tags of Pod, the node can be included in the scheduling pool of this Pod. Obviously, the optional nodes of this Pod are: 10.x.x.78, 10.x.x.79, 10.x.x.80, 10.x.x.86, 10.x.x.87, 10.x.x.88.

Preferred affinity:

PreferredDuringSchedulingIgnoredDuringExecution

Its style should be: if Pod can be scheduled to the node of the specified Label, if not, it is not forced, Pod can choose other nodes, even if this node does not have Label at all or the Label of the node does not match me at all.

Pod preferred affinity settings:

PreferredDuringSchedulingIgnoredDuringExecution:-preference: matchExpressions:-key: level operator: In values:-important weight: 5-preference: matchExpressions:-key: app operator: In values: -"1" weight: 4-preference: matchExpressions:-key: master operator: In values:-"1" weight: 10

Example: the Label setting of Node follows the previous example. According to the above configuration, we will see:

As shown in the table, the final Pod is scheduled to 10.x.x.85 first, because the weight of app=1 is 4 and the weight of level=important is 5, so the weight of node 10.x.x.80 is: 9, but still less than the weight of node: 10.x.x.85.

2.2 Summary of problems

In fact, the difference between strong compatibility and preferred compatibility is reflected in the selection of nodes by Pod. In terms of strong affinity, if the node can not match the Label requirements of Pod, Pod will never be dispatched to this kind of node. Even if Pod scheduling fails (yes, it is the head iron), as far as the preferred affinity is concerned, it is a very happy thing to be able to schedule to the optimal node. If you can not schedule to the optimal node, you can choose the second place, and there is always something suitable for you. (answer question 1)

The preferred affinity is reflected in the weight value of PodLabel, but has nothing to do with the matching number of node Label. (answer question 2)

In the preferred affinity configuration, there will be an additional weight value field (weight). The higher the value, the greater the weight, and the higher the probability of Pod scheduling to the node corresponding to this Label. (answer question 4)

A node has multiple Label and the node can satisfy all the Label required by Pod, if the sum of the weight values of multiple Label is still less than that of a single Label node, then the Pod preferred is the node with high weight value; if the Pod can match all the Label of node An and a Label of node B at the same time. However, the sum of the weights of the Label of node An is exactly equal to the weight of a single Label of node B. in this case, the priority of Pod is An or B, which is random (random only for affinity, and other situations should be considered in the actual situation). Regardless of the number of Label matches. (answer question 5)

When creating or expanding Pod, priority is given to nodes with large Label matching weights. If other conditions of this node are not met (such as insufficient memory), select secondary weights, and finally select nodes whose Label does not match or have no Label at all.

(answer question 6) when downsizing, it is worth noting that Pod is randomly selected to kill instead of starting with low-weight nodes as we expected.

(answer question 7) the answer is no. The running Pod will not be scheduled to a new node. When the Pod is restarted for some reason (meaning that the name of the Pod is changed and the rescheduling is triggered, the name is not changed, which means the scheduler is not triggered, but only restarted in place), it will be automatically scheduled to the node that meets the affinity selection.

three

Summary of the use of stains and Node Affinity

As far as the stain is concerned, its use is usually negative, that is, when a Node does not allow most Pod scheduling to allow only a small number of Pod scheduling, or when nodes do not participate in the workload at all. For example, our common master nodes do not schedule load pod to ensure the stability of master components; nodes have special resources, which are not needed by most applications but not by some applications, such as GPU.

In the case of Node Affinity, its use can be positive, that is, we want the Pod of an application to be deployed on a specified bunch of nodes. Of course, it can also be negative, such as what we often call Node anti-compatibility, only need to set the operator to NotIn to achieve the desired goal.

As far as stains are concerned, if the stain effect set by the node is NoSchedule or NoExecute, it means that Pod without stain tolerance can never be dispatched to these nodes.

As far as Node Affinity is concerned, if nodes have Label set, but Pod does not have any Node Affinity settings, then Pod can be dispatched to these nodes.

After reading the above, have you mastered the methods of Taint, Toleration and Node Affinity analysis in Kubernetes advanced scheduling? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.