The Dilemma and layout of the flexible expansion of the Advanced Kubernetes 04/21 Update SLTechnology News&Howtos

The Dilemma and layout of the flexible expansion of the Advanced Kubernetes

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

* * dilemma and layout of Kubernetes auto-scaling * * contents: 2.1 dilemma of traditional auto-scaling 1, problems of auto-scaling in kubernetes 2, extension of the concept of auto-scaling 2.2 kubernetes auto-scalable layout

2.1 the dilemma of traditional scalability

In the traditional sense, the main problem solved by elastic scaling is the contradiction between capacity planning and actual load.

The blue water mark indicates that the cluster resource capacity continues to expand with the increase of load, and the red curve indicates that the actual load of cluster resources changes.

The purpose of auto scaling is to solve the problem that when the actual load increases and the cluster resource capacity does not have time to respond.

According to the traditional understanding, for example, if there are several web servers, add the machine when the load is high, and then subtract the machine from the load.

What is the difference between the traditional elastic expansion and the K8s elastic expansion?

In the traditional sense, in fact, the elastic scaling of K8s is also a topic that people pay more attention to, but considering the elastic scaling on K8s, we have to start with the traditional elastic scaling to see what is the difference between the traditional scaling and the elastic scaling of K8S. traditionally, the problem that will be solved is the contradiction between capacity planning and actual load. Why, you can see from the above chart that the blue ones are available resources. The red one is the actual load, and the blue one is like a cluster pool, such as a web. There are four servers with a total capacity of 4c8g, with a total capacity of 11,16c and 32g. With such a capacity, for example, when Singles Day holiday comes, then your load is high, so you may need a load like 18cforce 36g. Then this resource pool itself is definitely not enough, beyond the scope of availability. This is the actual load. At this time, we have to expand our capacity. What we are thinking about is to quickly expand the capacity of the machine, one more, or more, to cope with the actual load. In fact, this is how elastic scaling comes. When the resource utilization rate is triggered, we can respond quickly to the expansion of capacity. In the traditional way, it may be a little slow, that is, before the response comes, it has already exceeded your load. Because the actual load in the face of some activities, are some relatively fast emergencies, you are not ready, or a malicious * at once exceed your load, say why the contradiction, in fact, is between the available resources and the actual load, the key challenge is actually here, can quickly pop up, fast recovery, is whether this resource pool can be quickly enlarged We consider whether the cost can be reduced during the trough, but it is really difficult to do it well, so from the elastic scaling considered before, can it be popped up quickly? at present, there is no good way to add servers in advance, so it is to solve this contradiction, such as whether some fast traffic can respond.

If there are known activities, then the expansion of this server in advance basically comes from this, including Singles Day, Taobao JD.com, like Ali, they themselves have a huge resource pool, and they will put a lot of their business in this pool, and reserve a lot of resources, and then give the cluster to use, like we usually add servers in advance, including on the cloud.

In fact, putting it in K8s is not to solve this contradiction. Elastic scaling is to solve the contradiction between capacity planning and actual load, but there is really no rapid pop-up and rapid recovery, but elastic scaling still stays here, but after K8s, there is another difference between K8s and this traditional one. There is no particularly good solution to solve it. Now, uh, K8s can also go according to the original idea, with two considerations. What is the basis of the previous auto scaling? If our traditional resource pool has exceeded our original capacity, how can we judge to add the machine automatically? even if the pop-up time is relatively slow, we have to do this thing. It is impossible to effectively respond to this resource pool, so we will not pop it up. Then there is still a strategy. How is this strategy done? according to what kind of pre-value, it is generally based on cpu and memory. It is generally based on these to do the elastic scaling of the CVM, except these, there is no good way, such as the public cloud aws, like Aliyun is also in this form, that is, you set a pre-value, if the overall resources exceed this pre-value, add the machine, but generally the server has a certain reservation, and generally will not fill it up at once, unless there are some abnormal * In most cases, this cluster pool has some reservations. According to a trend of previous visits, make a reservation of 20% and 30%, give you a buffer time, and 20% of the traffic will be defeated, which is not too good. So increase these reservations to ensure the availability of the cluster.

It may not be realistic to go to this traditional deployment in K8s based on the percentage of cpu utilization.

1. The problems of elastic scaling in Kubernetes

The general practice is to reserve cluster resources to ensure that the cluster is available, usually about 20%. There seems to be no problem with this approach, but if you put it into Kubernetes, you will find the following two problems.

The percentage of machine utilization is fragmented due to non-uniform machine specifications.

In a Kubernetes cluster, there is usually more than one type of machines. Suppose there are two kinds of machines in the cluster: 4C8G and 16C32G. For 10% resource reservation, the two specifications represent completely different meanings.

Like some traditional web, the specification of the server is basically the same, generally use polling, rarely use the minimum number of connections, rotation training, if the specification is not the same, the server is a waste, one is 4-core 8g, one is 16-core 32g, if the polling of the big one is a waste of money, basically the same weight value.

If you go to the cloud, basically the configuration is unified, and so is the expansion. For example, idc may have some different specifications, in order to make use of these resources, but in K8s, there is no need to unify these machines, and even take some high-equipped servers to pile up a K8s machine, because K8s itself brings you a large resource pool, that is, logically use them as a resource pool. K8s uniformly dispatches, it will judge whether this resource can be scheduled or not, according to your actual utilization, rather than polling, in this way, it will ensure that your utilization rate of each node is relatively high, so this is a large resource pool, so different specifications of the machine, do not have much impact, but for elastic scaling, it has an impact.

For example, there are three servers, one is 4c 8g, one is 16c 64g, and one is 8c 16g. If you are still based on the previous cpu, memory is used to scale up and down, to put it bluntly.

For example, the utilization rate is all 80%, which is definitely not the same, because the percentage varies with different specifications, and it is obviously not possible to expand the capacity of this kind of node.

OK, according to the downsizing, there are three machines, and the load is down, so it is necessary to evaluate which node, and then find an idle node to downsize. It just happens that these three machines are a little idle, and the judgment must be based on cpu and memory. If they are all 80%, the downsizing is low, then the downsizing does not make much sense, but if you downsize a large size, it may lead to a lot of downsizing. The overall cluster utilization will drop a lot, so in the face of these problems, it is mainly the problem of capacity reduction.

Especially in the scenario of capacity reduction, in order to ensure the stability of the cluster after reduction, we usually remove a node from the cluster, so how to determine whether a node can remove its utilization percentage is an important indicator. At this time, if the large rule machine has low utilization and is judged to scale down, it is likely to cause competition after the node is reduced and the container is rescheduled. If you give priority to downsizing small rule machines, it may result in a lot of redundancy of resources after downsizing.

two。 Machine utilization does not rely solely on host computing

In most production environments, resource utilization will not maintain a high water level, but in terms of dispatching, dispatching should maintain a relatively high water level, so as to ensure the stability of the cluster without wasting too many resources.

For example, if K8S applies for specifications, it is generally based on two points, the first is request, the project reference value, and the second is the maximum resource limit of limit. Generally, capacity reduction and expansion will be considered according to request. For example, if the specification applied for is 2c2g, then pod must be considered. How many specifications to apply for should be reserved for these specifications. I did not apply for 2c2g, but I did not consider any load. Well, it is impossible to say that when I downsize, I will take this into account. I will definitely not. I will calculate the 2c2g in this cluster. The cluster resources department is calculated solely on the host computer. There are two dimensions, one pod and one request, just like looking at the node to add one dimension before. So to do this auto scaling, we have to take this into consideration. It does not mean that I have applied for request2c2g, and I do not need 2c2g. When you go to scale, you have to use it according to the actual load, then after you scale down, in case the resource utilization of the request you applied for goes up, is it not enough on your remaining nodes? it will cause pod competition at that time, and it will be affected to a certain extent when you add nodes, so this is the second existing problem.

How to solve these problems?

The first is the percentage of non-uniform utilization of machine specifications, in fact, the best form is to conceptually make the configuration the same, but this situation is not very realistic, because the servers purchased by the company, including these legacy machines, are all for rational use to do this cluster pool, but you can make them all the same with some new machines, but it is not good if the specifications are not the same.

The second is that the machine department relies solely on host computing, which considers capacity reduction. Even if you want to refer to the utilization of node, but also to the request requested by pod in the entire cluster, you must take into account the actual load of the request specification, not according to the actual load. These are all means.

Now, generally speaking, there is a contradiction between cost savings and availability. Now that the business has been diversified, we can classify these businesses to solve this problem.

2. The extension of the concept of elastic expansion.

Not all businesses have peak traffic, and more and more subdivided business patterns lead to a jump between cost savings and availability.

Online load type: micro service, website, API offline task type: offline computing, machine learning timing task type: timed batch computing

Different types of loads have different requirements for flexible scaling, online loads are sensitive to pop-up time, offline tasks are sensitive to price, and scheduled tasks are sensitive to scheduling.

For example, the traditional load type can use this online load type, and then it is offline, offline computing, machine learning, which are all periodic, and may not be real-time online. This kind of concern is that it is more sensitive, that is, when I work, it consumes a lot more than the online load type, such as big data's processing, machine learning. This piece, uh, is that the time at a certain moment is relatively high, so we have to consider this cost, so we cannot say that it is all offline tasks, occasionally large, and then the machine is the peak value specification, and that kind of price specification will be much higher. the third kind is this kind of timed task-based, timed batch computing, such as scheduling, regularly doing something, and then backing up this piece, this piece may be more sensitive to scheduling. Then, when there are more tasks, there will be a global scheduling system, and then dispatched to allocate, so it is more sensitive to scheduling, so the concept of auto-scaling extends to business to make a reasonable distinction. For example, online load is considered in the same way as before, pop-up time is considered, that is, the sensitivity of expansion, such as micro services, websites, API, if the load is compared. Can we increase the number of machines? offline tasks are sensitive to a certain period of time, provide enough resources when they can be used, can they be recycled when they are not in use, and let other resources be used, so that they can save some overhead and schedule tasks.

The next step is how to layout in K8s. This is what we are going to consider. We cannot add a dimension according to the utilization of the host, so in the ecology of K8s, for the above different forms of business, the form of self-scaling also has many components, as well as a variety of scenarios.

It is roughly divided into three kinds of elastic expansion.

2.2 kubernetes flexible layout

In the ecology of Kubernetes, different components are provided in multiple dimensions and levels to meet different scaling scenarios.

There are three types of elastic expansion:

CA (Cluster Autoscaler): Node level automatic expansion / reduction cluster-autoscaler components HPA (Horizontal Pod Autoscaler): number of Pod automatic expansion / reduction VPA (Vertical Pod Autoscaler): horizontal expansion, Pod configuration automatic expansion / reduction, mainly CPU, memory addon-resizer components

If it is recommended that HPA combine with cluster-autoscaler to manage the auto scaling of the cluster on the cloud.

Just like creating a new pod, a waiting state will not be scheduled for you, and resources will not be scheduled for you. If pod exceeds the specification of request, and my resource pool is not stable, just like online load, it can allow waiting time to add new nodes, so if request is limited, if the request is full, but the cluster cannot allocate a new pod, the cluster is still in a stable state. Because it also has a maximum limit, limit, if it exceeds this limit, the cluster will be in an unstable state, so it is necessary to expand the node node, so node provides this automatic expansion, providing this component, CA (cluster autoscaler), to achieve automatic expansion at the node level. At present, this component is mainly connected to public clouds, such as Aliyun, Microsoft Cloud, aws and so on. You can achieve, schedule their CVM to achieve your own capacity expansion and reduction. Of course, you can also study similar components to achieve automatic capacity expansion and reduction. The second kind is based on this pod. In fact, it is mainly aimed at your existing resource pool. If your existing resource pool is relatively abundant, then I can schedule a new pod, no problem, and can also schedule it out. That is to say, your application has 10 replicas, even if the request runs full, and the concurrency of 10 replicas is 10, 000, and now the load is not enough. Now I need to expand the replicas, expand the capacity of 20 replicas, then my concurrency is 20, 000, but I have enough resources in the cluster pool, so I can cope with the load of my current business, so when we expand the capacity of K8s, we will generally expand and reduce capacity in two dimensions. One is node, the other is pod, and then the third dimension, which is not often used. The above two are considered horizontally. The third is the horizontal expansion of pod, which refers to the specification of limit, which helps you increase this quota. At present, this is less. At present, this is often used by node and pod. If it is recommended on the cloud that HPA is combined with cluster-autoscaler for cluster scaling management.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.