What is the method of implementing high SLO on large-scale Kubernetes clusters 07/01 Update SLTechnology News&Howtos

What is the method of implementing high SLO on large-scale Kubernetes clusters

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces to you what is the method to achieve high SLO on a large-scale Kubernetes cluster, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Why SLO?

Gartner's definition of SLO: under the framework of SLA, SLO is the goal that the system must achieve; the success of the caller needs to be ensured as much as possible. Some people may be confused about SLI/SLO/SLA, so let's take a look at the relationship between the three:

SLI defines an indicator that describes how easy a service is to meet a good standard. For example, Pod is delivered within 1min. We usually define SLI in terms of delay, availability, throughput, and success rate.

SLO defines a small goal to measure the percentage of a SLI indicator that reaches a good standard over a period of time. For example, 99% of Pod is delivered within 1min. When a service publishes its SLO, users have expectations for the quality of the service.

* * SLA * * is a protocol derived from SLO, which is often used to determine how much the server will lose when the target ratio defined by SLO is not completed. Generally speaking, SLA agreements will form legally efficient contracts in black and white and are often used between service providers and external customers (such as Aliyun and Aliyun users). Generally speaking, when the SLO between internal services is broken, it is usually not financial compensation, but more responsibility recognition.

Therefore, we pay more attention to SLO within the system.

What we concern about Larger K8s Cluster?

With the continuous development of the production environment, the K8s cluster is becoming more and more complex and the scale of the cluster is increasing. How to ensure the availability of K8s cluster in large-scale environment? It is a difficult problem for many manufacturers. For K8s clusters, we are usually concerned with the following issues:

The first question is whether the cluster is healthy, whether all components are working properly, and how many Pod creation failures are in the cluster. This is a question of overall metrics.

The second question is what happens in the cluster, whether there are any exceptions in the cluster, and what users do in the cluster. This is a question of tracking ability.

The third problem is that after there is an exception, which component has a problem and leads to a decrease in the success rate, which is a problem of cause location.

So, how do we solve the above problems?

First, we will define a set of SLO to describe the availability of the cluster.

Next, we must be able to track the life cycle of the Pod in the cluster; for failed Pod, we also need to analyze the cause of the failure in order to quickly locate the abnormal components.

Finally, we need to eliminate the anomaly of the cluster through optimization means.

SLls on Large K8s Cluster

Let's first take a look at some indicators of the cluster.

The first indicator: cluster health. Currently, there are three values of Healthy/Warning/Fatal to describe. Warning and Fatal correspond to the alarm system, such as P2 alarm. If P0 alarm occurs, the cluster is Warning;. If P0 alarm occurs, the cluster is Fatal and must be processed.

The second indicator: success rate. The success rate here refers to the success rate of Pod creation. The success rate of Pod is a very important indicator. Ants create millions of Pod in a week, and the fluctuation of success rate will cause a large number of Pod failures, and the decline of Pod success rate is the most intuitive response of abnormal clusters.

The third indicator: the amount of residual Terminating Pod. Why not delete the success rate? Because at the million level, even if the success rate of Pod deletion reaches 99.9%, then the number of Terminating Pod is still thousands. So much Pod remains that it takes up the capacity of the application and is unacceptable in the production environment.

The fourth indicator: service online rate. The service availability is measured by the probe, and the probe failure means that the cluster is unavailable. Service availability is designed for Master components.

The last indicator: the number of failed machines, which is an indicator of a node dimension. Faulty machines usually refer to those physical machines that cannot deliver Pod correctly. The disk may be full or the load may be too high. Cluster failure machines must also be "quickly found, quickly isolated, and repaired in time". After all, the failure will affect the capacity of the cluster.

The success standard and reason classification

With the indicators of the cluster, we need to refine these indicators and define the criteria for success.

Let's first take a look at the success rate indicator of Pod creation. We divide Pod into normal Pod and Job class Pob. The RestartPolicy of ordinary Pod is Always,Job, and the RestartPlicy of Pod is Never or OnFailure. Both set delivery times, for example, delivery must be completed within 1 minute. The delivery standard of an ordinary Pod is that the Pod in the 1min has the Ready;Job class Pod. The delivery standard is that the status of the Pod in the 1min has reached Running, Succeeded or Failed. Of course, the creation time needs to exclude the PostStartHook execution time.

For the deletion of Pod, the successful criterion is that Pod is deleted from the ETCD within a specified period of time. Of course, the deletion time needs to be excluded from the PreStopHookPeriod time.

For the faulty machine, it is necessary to find and isolate and downgrade the machine as soon as possible. For example, if the physical machine disk is read-only, you must taint the Pod in 1min. As for the recovery time of the faulty machine, it is necessary to set different recovery time according to different causes of the failure. For example, if a system failure requires an important installation of the system, the recovery time will be longer.

With these standards, we also sort out the reasons for the failure of Pod, some of which are caused by the system and we need to care about, and some of which are caused by users and we do not need to care about.

For example, RuntimeError is a system error, and there is a problem with the underlying Runtime. ImagePullFailed,Kubelet failed to download images. Because ants have Webhook to verify the access to images, all image downloading failures are usually caused by system reasons.

For the user reasons, it can not be solved on the system side, we only provide these failure reasons to the users by interface query, and let the users solve them themselves. ContainerCrashLoopBackOff, for example, is usually caused by the exit of the user container.

The infrastructure

Around the goal of SLO, we have built a set of systems, on the one hand, to show the current indicators of the cluster to end users and operation and maintenance personnel; on the other hand, various components cooperate with each other, and through the analysis of the current cluster status, we get various factors that affect SLO, and provide data support for improving the success rate of cluster pod delivery.

From the top-down perspective, the top-level components are mainly oriented to a variety of indicators, such as cluster health status, pod creation, deletion, upgrade success rate, number of residual pods, number of unhealthy nodes and other indicators. Among them, Display Board is what we often call the monitoring market.

We also build an Alert alarm subsystem to support flexible configuration, which can configure a variety of alarm methods for different indicators, such as phone calls, text messages, e-mails, etc., according to the percentage and absolute value of the index decline.

Analysis System gives a more detailed report on cluster operation by analyzing the historical data of indicators, as well as the collected node metrics and master component indicators. Where:

The Weekly Report subsystem gives the statistics of the current cluster pod creation / deletion / upgrade this week, as well as a summary of the reasons for the failure.

Terminating Pods Number gives the list of pods added in the cluster over a period of time that cannot be deleted through the K8s mechanism and the reasons for pods residue.

Unhealthy Nodes gives a list of the total available time of all nodes in the cluster, the available time of each node, the operation and maintenance records of each node, and the nodes that cannot be recovered automatically and need to be restored manually.

To support these functions, we have developed Trace System to analyze and show the specific reasons for the failure of a single pod to create / delete / upgrade. It includes three modules: log and event collection, data analysis and pod lifecycle display:

The log and event collection module collects the operation log and pod/node event of each master component and node component, and stores the log and event with pod/node as the index.

The data analysis module analyzes the time spent in each stage of the pod life cycle, and determines the reasons for the failure of pod and the reasons for the unavailability of nodes.

Finally, the Report module exposes the interface and UI to the end user, showing the pod life cycle and the cause of the error to the end user.

The trace system

Next, take a failed case of pod creation as an example to show you the workflow of the tracing system.

After the user enters pod uid, tracing system finds the corresponding life cycle analysis record of pod through pod index, and determines whether the delivery is successful or not. Of course, the data stored by storage not only provides basic data for end users, but also analyzes the operation status of the cluster and each node through the pods life cycle of the cluster. For example, too many pods in the cluster are scheduled to hot nodes, and the delivery of different pods leads to resource competition on the node, resulting in high node load, but the delivery capacity is declining, which is finally manifested as pods delivery timeout on the node.

For another example, through the historical statistics, the execution time baseline of each stage in the pods life cycle is analyzed. Taking the baseline as the evaluation standard, the average time and time distribution of different versions of the component are compared, and the component improvement suggestions are given. In addition, through the proportion of step time that each component is responsible for in the overall pods life cycle, find out the steps that account for more, and provide data support for the subsequent optimization of pod delivery time.

Node Metrics

A cluster in good health not only requires master components to remain highly available, but also node stability can not be ignored.

If pod creation is compared to a rpc call, each node is a rpc service provider, and the total capacity of the cluster is equal to the sum of pod creation requests that each node can handle. Each unavailable node represents a decline in cluster delivery capacity and available resources, which requires ensuring the high availability of nodes in the cluster as far as possible; each failure of pod delivery / deletion / upgrade also means higher user costs and lower experience, which requires that only when the cluster node ensures good health can the pods dispatched to the node be delivered successfully.

In other words, we should not only find the node anomaly as soon as possible, but also repair the node as soon as possible. By analyzing the function of each component on the pod delivery link, we supplement the metrics of various types of components, and convert the running state of host to metrics, and then collect it to the database. Combined with the pod delivery results on each node, we can build a model to predict node availability, analyze whether the node has irrecoverable anomalies, and appropriately adjust the proportion of nodes in the scheduler, so as to improve the success rate of pod delivery.

Pod creation / upgrade failed, users can solve it by retrying, but pod deletion failed. Although components will try again and again with the concept of K8s-oriented final state, there will be dirty data eventually, such as pod deleted on etcd, but dirty data remains on the node. We design and implement a patrol system, which obtains the pods dispatched to the current node by querying apiserver. By comparison, we find the residual process / container / volumes directory / cgroup / network devices on the node, and try to release the residual resources through other ways.

Unhealthy node

Next, the processing flow of the fault machine is described.

There are many data sources for judging the fault machine, mainly including the monitoring indicators of the node, such as:

Some type of Volume mount failed

NPD (Node Problem Detector), which is a framework for the community

Trace system, such as Pod creating continuous report image download failure on a node

SLO, such as a large amount of Pod left on a single machine

We have developed a number of Controller to inspect these faults and form a list of fault machines. A malfunctioning machine can have several faults. For the faulty machine, different operations will be carried out according to the fault. The main operations are: hit Taint to prevent Pod scheduling; lower the priority of Node; directly automatic processing for recovery. For some special reasons, such as the disk is full, it needs to be checked manually.

The faulty machine system produces a daily newspaper every day to show what the faulty machine system has done today. Developers can improve the whole fault machine handling system by constantly adding Controller and processing rules.

Tips on increasing SLO

Next, let's share some ways to get a high SLO.

First, in the process of improving the success rate, the biggest problem we face is the problem of mirror download. You know, Pod must be delivered within a specified time, and image downloads usually take a lot of time. To this end, we specifically set an error in ImagePullCostTime by calculating the download time of the image, that is, the download time of the image is too long, so that the Pod cannot be delivered on time.

Fortunately, Dragonfly, the Ali image distribution platform, supports Image lazyload technology, that is, remote images. When Kubelet creates a container, there is no need to download images. Therefore, this greatly accelerates the delivery speed of Pod. About Image lazyload technology, you can take a look at Ali Dragonfly's sharing.

Second, to improve the success rate of a single Pod, with the improvement of the success rate, it becomes more and more difficult. You can introduce some workload to retry. In ants, the paas platform will try again and again until the Pod is successfully delivered or times out. Of course, previous failed nodes need to be excluded when retrying.

Third, the critical Daemonset must be checked. If the critical Daemonset is missing and the Pod is dispatched, it is very easy to have problems, thus affecting the creation / deletion of links. This requires access to the failure machine system.

Fourth, many Plugin, such as CSI Plugin, need to register with Kubelet. It is possible that everything is normal on the node, but failed to register with Kubelet. This node is also unable to provide the services delivered by Pod and needs to be connected to the failure machine system.

Finally, because the number of users in the cluster is very large, isolation is very important. On the basis of privilege isolation, it is also necessary to achieve QPS isolation and capacity isolation to prevent one user's Pod from exhausting the cluster capacity, so as to protect the interests of other users.

On the large-scale Kubernetes cluster to achieve high SLO method is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.