In the super-large-scale commercial K8s scenario, how does Alibaba dynamically solve the problem of on-demand allocation of container resources? 07/19 Update SLTechnology News&Howtos

In the super-large-scale commercial K8s scenario, how does Alibaba dynamically solve the problem of on-demand allocation of container resources?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Author | Zhang Xiaoyu (Zhongyuan) Ali Cloud Container platform Technical expert

Guide: resource utilization has always been a topic of concern to many platform management and R & D personnel. Through the working practice of Alibaba Container platform team in this field, the author sorted out a set of plans to improve the utilization of resources, hoping to bring you some discussion and thinking.

Introduction

I wonder if you have ever had such an experience: how many resources should we allocate to the container when we have a Kubernetes cluster and then start to deploy the application?

It's hard to say. Because of Kubernetes's own mechanism, we can understand that the container's resources are essentially a static configuration.

If I find that resources are insufficient, in order to allocate more resources to containers, we need to rebuild Pod;. If we allocate redundant resources, then our worker node nodes do not seem to be able to deploy many containers.

May I ask, can we allocate container resources according to demand? Next, we will discuss the answer to this question with you in this sharing.

Real challenges in the production environment

First of all, please allow us to throw out the challenges of our actual production environment according to our actual situation. You may still remember Tmall's Singles Day holiday in 2018, when the total turnover reached 213.5 billion. From this we can see that the whole leopard can support the system behind such a large trading volume, and its application type and number should be what kind of scale.

At this scale, the words we often hear about container scheduling, such as container orchestration, load balancing, cluster scaling, cluster upgrade, application release, application grayscale, and so on, are no longer easy to deal with after being modified by the word "very large-scale cluster". Scale itself is our biggest challenge. How to operate and manage such a huge system, and follow the effect advertised by the industry dev-ops, is like letting elephants dance. But teacher Ma once said that elephants should do what elephants are supposed to do, so why go dancing.

Help from Kubernetes

Cdn.com/a20c0dfb35bf47b45d1f7791daa53c7531222775.png ">

With the question of whether elephants can dance, we need to start with the system behind APP such as Taobao and Tmall.

This set of Internet system application deployment can be divided into three stages: traditional deployment, virtual machine deployment and container deployment. Compared with traditional deployment, virtual machine deployment has better isolation and security, but it inevitably produces a lot of loss in performance. Container deployment puts forward a lighter solution under the background of virtual machine deployment to achieve isolation and security. Our system also runs along such a main channel. Assuming that the underlying system is like a giant ship, in the face of a large number of containers-containers, we need an excellent captain to dispatch and arrange them, so that the large ship of the system can avoid layers of dangers and obstacles and reduce the difficulty of operation. and have more flexibility, and finally achieve the goal of navigation.

Ideal and reality

At the beginning, when we think of the beautiful scenarios of containerization and Kubernetes, our ideal container choreography would look like this:

Calm: our engineers more calmly face complex challenges, no longer frown but more smile and self-confidence; elegance: every online change operation can be as calm as tasting red wine, gracefully press the enter key; orderly: from development to testing, and then to grayscale release, in one fell swoop; stability: the system is robust, despite the strong wind from east, west, north and south, our system stands firm. Year-round system availability N more than 9; efficient: save more manpower to achieve "happy work, serious life".

However, the ideal is very plump, and the reality is very bony. We were greeted by confusion and embarrassment of all shapes.

Messy, because as a new technology stack, many supporting tools and workflow construction is in the initial stage. The tools that run well in the Demo version are rolled out on a large scale in real situations, and all kinds of hidden problems will be exposed and emerge one after another. From development to operation and maintenance, all the staff are on the run passively. In addition, "large-scale roll-out" also means directly facing different production environments: heterogeneous machines, complex requirements, and even the past habits of adapting users, and so on.

In addition to the exhausting confusion, the system also faces various crashes of the application container: insufficient memory leads to too little OOM,CPU quota allocation, resulting in throttle of the process, insufficient bandwidth, and a sharp increase in response latency. Even the volume of transactions in the face of peak access due to the lack of system power caused by the cliff decline, and so on. All these enable us to accumulate a lot of experience in large-scale commercial Kubernetes scenarios.

Face the stability of the problem

Problems have to be faced. As a master said: if there is something wrong with it, then there must be something wrong. So we have to analyze what the problem is. For the memory OOM,CPU resources being throttle, we can infer that the initial resources we allocated to the container are insufficient.

The lack of resources will inevitably lead to the decline of the stability of the whole application service. For example, the scenario in the figure above: although it is a copy of the same application, it may be because the load balancing is not strong enough, or because the application itself is heterogeneous, or even because the machine itself is heterogeneous, resources of the same value may have equal value and significance for different copies of the same application. In numerical terms, they seem to have allocated the same resources, but in the actual load work, it is very likely that the phenomenon is uneven.

In the scenario of resource overcommit, there will be serious resource competition when the whole node is short of resources, or when the CPU share pool resources are insufficient. Resource competition is one of the biggest threats to application stability. So we should try our best to remove all threats in the production environment.

We all know that stability is important, especially for front-line R & D staff who control the life and death of millions of containers. Perhaps an inadvertent operation may cause a production accident with great impact.

Therefore, we have also done systematic prevention and background work in accordance with the general process.

In the prevention dimension, we can conduct full-link stress testing, and predict the number of copies and resources required by the application in advance by scientific means. If it is not possible to accurately budget resources, only redundant allocation of resources will be used. In the bottom dimension, we can downgrade unimportant services and temporarily expand the capacity of major applications after the arrival of large-scale access traffic.

But for a sudden increase in traffic for a few minutes, so many combinations of punches cost a lot of money, it doesn't seem cost-effective. Maybe we can come up with some solutions to meet our expectations.

Resource utilization ratio

Review our application deployment: the containers on the node generally belong to a variety of applications, these applications themselves are not necessarily, and generally not at the peak of access at the same time. For hosts with mixed deployment applications, it may be more scientific if they can all allocate the resources of the running container above the wrong peak.

The resource requirements of the application may have ups and downs and periodic changes just like the moon. For example, online business, especially trading business, they show a certain periodicity in the use of resources, such as: in the early morning, morning, its usage is not very high, but in the noon, afternoon will be higher.

For example: for the important moment of An application, it may not be so important for B application, so it is a good choice to suppress B application properly and vacate resources for An application. It sounds a bit like time-sharing reuse. However, if we configure resources according to the demand at the peak of traffic, there will be a lot of waste.

In addition to online applications with high real-time requirements, we also have offline applications and real-time computing applications: offline computing is not so sensitive to the use of CPU, Memory or network resources and time, so it can be run at any time; real-time computing may be very sensitive to time.

In the early days, our business was deployed independently at different nodes according to the type of application. From the above diagram, if they reuse resources by time sharing, we will find that the actual maximum usage is not 2 / 2 / 1 / 5, but the maximum demand for important emergency applications at a certain time, that is, 3. If we can monitor the real usage of each application and assign a reasonable value to it, then we can have the actual effect of improving resource utilization.

For e-commerce applications, for Web applications that adopt a heavyweight Java framework and related technology stacks, HPA or VPA is not an easy thing in a short time.

Let's start with HPA, we may be able to pull up the Pod in seconds and create a new container, but is the pulled container really available? It may take a long time from creation to availability, and for big promotion and snap-up second kill, this kind of traffic "flood peak" may only last for a few minutes or more than ten minutes. If we wait until all the copies of HPA are available, the marketing activity may already be over.

As for the community's current VPA scenario, the logic of deleting the old Pod and creating a new Pod is even more difficult to accept. So taken together, we need a more practical solution to make up for the vacancy of HPA and VPA in this single machine resource scheduling.

Solution delivery standards

First of all, we need to set a deliverable standard for the solution, that is, "both stability, utilization, automation, and of course, it would be better if we can be intelligent," and then refine the delivery criteria:

Security and stability: the tool itself is highly available. The algorithms and implementation methods used must be controllable; the business container allocates resources on demand: it can timely predict the resource consumption in the not-too-distant future according to the real-time resource consumption of the business, so that users can understand the real demand for resources in the future; the resource cost of the tool itself is small: the resource consumption of the tool itself should be as small as possible, so as not to become a burden on operation and maintenance. Easy to operate, strong expansibility: can play with this tool without training, of course, the tool also has good expansibility for users DIY; to quickly discover & timely response: real-time, that is, the most important characteristic, which is different from HPA or VPA in solving resource scheduling problems. Design and implementation

The figure above shows our initial tool flow design: when an application is faced with high business access requirements, the demand for CPU, Memory or other resource types becomes larger. Based on the real-time basic data collected by Data Collector, we use Data Aggregator to generate a portrait of a container or the whole application, and then feedback the portrait to Policy engine. Policy engine instantly quickly modifies the parameters in the container Cgroup file directory.

Our earliest architecture was as simple as we thought, with intrusive changes in kubelet. Although we only added a few interfaces, this approach is really not elegant enough. For each kubenrnetes upgrade, there are certain challenges for the upgrade of Policy engine-related components.

In order to achieve fast iteration and decoupling from Kubelet, we have made a new evolution of the implementation. That is to containerize key applications. In this way, the following effects can be achieved:

Do not invade and modify K8s core components; facilitate iteration-release; with the help of Kubernetes-related QoS Class mechanism, container resource allocation, resource cost can be controlled.

Of course, in the subsequent evolution, we are also trying to get through with HPA,VPA, after all, there is a complementary relationship between these and Policy engine. Therefore, our architecture has further evolved into the following situation. When Policy engine is unable to deal with more complex scenarios, escalating events allows the central side to make a more global decision. Expand capacity horizontally or increase resources vertically.

Let's discuss the design of Policy engine in detail. Policy engine is the core component of intelligent scheduling and Pod resource adjustment on stand-alone nodes. It mainly includes api server, command center command center and executive layer executor.

Api server is used to serve external requests for querying and setting the running status of policy engine; command center makes the decision of Pod resource adjustment according to the real-time container portrait and the load of the physical machine itself and resource usage; and Executor adjusts the container resource limit according to command center's decision. At the same time, executor also persists each adjusted revision info so that it can be rolled back in the event of a failure.

The command center regularly obtains real-time portraits of the container from data aggregator, including aggregated statistical data and prediction data, and first judges the status of the node, such as abnormal disk of the node, or network failure, which means that the node has experienced an exception and needs to protect the site, and no longer adjusts the resources of the Pod, so as not to cause system shock and affect operation, maintenance and debugging. If the state of the node is normal, the command center will make policy rules and filter the container data again. For example, the CPU rate of the container is high, or the response time of the container exceeds the safe threshold. If the condition is satisfied, the resource adjustment suggestion for the container collection that meets the condition is given and passed to executor.

In architectural design, we follow the following principles:

Plug-in: all rules and policies are designed to be modified through configuration files, decoupling from the code of the core control process as much as possible, from updates and releases of other components such as data collector and data aggregator, and improving scalability

Stability, which includes the following aspects:

The controller is stable. The decision of the command center is based on the premise that it does not affect the single machine or even the overall stability, including the stability of container performance and the stability of resource allocation. For example, at present, each controller is only responsible for the control of one cgroup resource, that is, in the same time window, Policy engine does not adjust multiple resources at the same time, so as not to cause resource allocation shock and interfere with the adjustment effect; trigger rule stability. For example, the original trigger condition of a rule is that the performance index of the container exceeds the safety threshold, but in order to prevent the control action from being triggered by a sudden peak value, we customize the trigger rule as follows: the low percentile of the performance indicator in the past window exceeds the security threshold. If the rules are met, it means that most of the performance indicators during this period have exceeded the security threshold, and the control action needs to be triggered. In addition, unlike the community version of Vertical-Pod-Autoscaler, Policy engine does not actively expel the relocation container, but directly modifies the container's cgroup file.

Self-healing: the execution of actions such as resource adjustment may produce some exceptions. We have added a self-healing rollback mechanism in each controller to ensure the stability of the whole system.

Do not rely on application prior knowledge: stress testing, customizing strategies for all different applications, or stress testing of applications that may be arranged together in advance will lead to huge overhead and reduced scalability. Our strategy is as general as possible in design, using indicators and control strategies that do not depend on specific platforms, operating systems and applications as far as possible.

In terms of resource adjustment, Cgroup allows us to isolate and limit the CPU, memory, network and disk IO bandwidth resources of each container. Currently, we mainly adjust the container CPU resources and explore the feasibility of dynamically adjusting memory limit and swap usage in time division multiplexing scenarios to avoid OOM. In the future, we will support dynamic adjustment of container network and disk IO.

Adjustment effect

The figure above shows some of the experimental results we got in the test cluster. We mix high-priority online applications and low-priority offline applications in the test cluster. SLO is 250ms, and we want the 95 percentile of latency for online applications to be lower than the threshold 250ms.

As can be seen in the experimental results:

About 90s ago, the load of online applications was very low; the average and percentile of latency were below 250ms; after 90s, we pressurized online applications, increased traffic and increased load, resulting in the 95 percentile of online applications exceeding SLO; in about 150s. Our small-step sprint control strategy was triggered, and throttle gradually competed offline applications for resources. By about 200s, the performance of online applications returned to normal, and the 95th percentile of latency fell below SLO.

This shows the effectiveness of our control strategy.

Experiences and lessons

Let's summarize some of the experiences and lessons we have gained during the whole project, hoping that these lessons will be helpful to people who encounter similar problems and situations.

Avoid hard coding and component micro-service, which is not only convenient for rapid evolution and iteration, but also conducive to circuit breaker exception service. Try not to call interfaces in the class library that are still alpha or beta features. For example, we used to directly call the CRI API to read some information of the container, or do some update operations, but with the modification of the interface field or method, some features of the co-construction will become unavailable. Perhaps sometimes, it may be more reliable to call an unstable API to obtain the print information of an application directly. Dynamic resource adjustment based on QoS: as we said before, there are tens of thousands of applications within Ali Group, and the transfer chain between applications is very complex. The abnormal container performance of application An is not necessarily caused by the shortage or competition of resources on the stand-alone node, but is likely to be caused by the access delay of its downstream application B, application C, or database and cache. Because of the limitation of this kind of information on the stand-alone node, the resource adjustment based on the stand-alone node information can only adopt the strategy of "best effort", that is, best effort. In the future, we plan to open the resource regulation link between the stand-alone node and the central end, where the central end synthesizes the performance information and resource adjustment requests reported by the stand-alone node to uniformly redistribute resources, or rearrange containers, or trigger HPA, so as to form a closed-loop intelligent resource regulation link at the cluster level, which will greatly improve the stability and comprehensive resource utilization of the entire cluster dimension. Resource v.s. Performance model: some people may have noticed that there is no obvious proposal to build a "resource v.s. Performance" model for the container in our tuning strategy. This model is very common in academic papers, generally carrying out off-line or on-line pressure testing on several applications being tested, changing the resource allocation of the application, measuring the performance index of the application, and getting the curve of performance changing with resources. finally used in real-time resource control algorithms. Under the circumstances that the number of applications is relatively small, the call chain is relatively simple, and the physical machine hardware configuration in the cluster is also relatively small, this method based on pressure testing can be used to find the best or sub-optimal resource adjustment scheme in all possible cases, so as to get better performance. However, in the scenario of Ali Group, we have tens of thousands of applications, and the versions of many key applications are also released very frequently. often after the new version is released, the old stress test data, or resource performance model, is not applicable. In addition, many of our clusters are heterogeneous clusters, and the performance data tested on one physical machine will not be repeated on another different type of physical machine. All these bring obstacles to our direct application of resource control algorithms in academic papers. Therefore, in view of the internal scene of Ali Group, we adopt this strategy: do not carry out offline pressure testing on the application, and obtain the displayed resource performance model. Instead, establish a real-time dynamic container portrait, use the statistical data of container resources in the past window as a forecast for a short period of time in the future, and update dynamically; finally, based on this dynamic container portrait, implement the resource adjustment strategy of small steps and fast running, watch while walking, and do your best. Summary and prospect

To sum up, our work has mainly achieved the following benefits:

Through time-sharing reuse and mixed deployment of containers with different priorities (that is, online and offline tasks), and through the dynamic adjustment of container resource limits, it ensures that online applications can get sufficient resources under different loads, so as to improve the comprehensive resource utilization of the cluster. Through the intelligent dynamic adjustment of the container resources on the stand-alone node, it reduces the performance interference between applications and ensures the performance stability of high-priority applications. Through Daemonset deployment, various resource adjustment strategies can automatically and intelligently run on the node, reduce manual intervention, and reduce the human cost of operation and maintenance.

Looking ahead, we hope to strengthen and expand our work in the following areas:

Closed-loop control link: as mentioned earlier, due to the lack of global information on stand-alone nodes, the adjustment of resources has its limitations, so we can only do our best. In the future, we hope to be able to open the access to HPA and VPA, so that stand-alone nodes and the central end of the linkage for resource adjustment, maximize the benefits of flexible scaling. Container rearrangement: even for the same application, the load and physical environment of different containers are dynamic. Adjusting pod resources on a single machine may not necessarily meet the dynamic needs. We hope that real-time container portraits on a single machine can provide more effective information for the central side and help the scheduler of the central side to make more intelligent container rearrangement decisions. Policy intelligence: our current resource adjustment strategy is still coarse-grained, and the resources that can be adjusted are limited. In the future, we hope to make the resource adjustment strategy more intelligent, and take into account more resources, such as disk and network IO bandwidth adjustment, to improve the effectiveness of resource adjustment. Container portraits are refined: the current container portraits are also relatively rough, relying only on statistical data and linear prediction; the types of indicators to describe container performance are also limited. We hope to find more accurate, general-purpose indicators that reflect the performance of the container, in order to more finely describe the current state of the container and the degree of demand for different resources. Find the interference source: we hope to find an effective scheme on the stand-alone node to accurately locate the interference source when the application performance is impaired, which is also of great significance to the strategy intelligence. Q & A

Q1: do I have to get resources if I directly modify the cgroup container?

A1: the technical basis of container technology isolation is the cgroup level. When the host frees up enough resources, you can get more resources by setting a higher value to cgroup. Similarly, for applications with low priority, setting a lower value of cgroup resources will have the effect of restraining the operation of the container.

Q2: how does the underlying layer distinguish between online and offline priorities?

A2: the bottom layer cannot automatically get who is online, who is offline, or whose priority is high and who is low. We can do this through various extensions provided by Kubernetes. The simplest thing is to identify it through label,Annotation. Of course, extending QoS class is also a way of thinking. The community version of QoS class is too conservative and gives users little room to play. We have also enhanced it through these aspects. It may be pushed to the community at the right time. Automatic perception is a direction, perceiving who is the source of interference and who is some kind of resource-based application, which we are still developing. To be truly dynamic, it must be an intelligent system with automatic perception.

Q3: "unlike the community version of Vertical-Pod-Autoscaler, Policy engine does not actively expel the moving container, but directly modifies the container's cgroup file". If it is not actively expelled, what will happen if the resources of Node are online?

A3: that's a good question. First of all, we need to distinguish which kind of resource it is. If it is CPU, we can adjust the value of cpu quota under the cgroup of the low-priority container, and first suppress the competition for CPU from the low-priority container. Then appropriately raise the relevant resource value of the high-priority container. If it is a memory resource, this cannot directly reduce the cgroup value of the low priority container, otherwise it will result in OOM. We will continue to discuss the adjustment of learning memory resources in other sharing. This technology is quite special.

Q4: only modify cgroup, how to ensure that K8s can allocate more containers to a single physical machine?

A4: text LVB shows that the resource consumption of containers is not constant. In many cases, their resource consumption shows a tidal phenomenon, and more applications are deployed under the same resource conditions. to complete more jobs is to maximize the utilization of resources. The occurrence of oversold resources is the greatest value of our discussion on this topic.

Q5: that is to say, for low-priority containers, the request setting is much smaller than the limit, and then you dynamically adjust the cgroup?

A5: in the existing QoS scenario, you can understand that the adjusted Pod is burstable. However, instead of directly adjusting the limit value of Pod metadata, we adjust the value of limit reflected in cgroup, which will be adjusted back when resource competition slows down. We do not recommend that the cgroup data of a single machine be separated from the central data of etcd for too long. If we deviate for a long time, we will sound an alarm like VPA and link VPA to make adjustments. Of course, during the peak period of container operation, any operation to rebuild the container is unwise.

Q6: the overall understanding is that you overmatch the physical machine with a certain proportion of pod at the beginning, and then dynamically adjust the cgroup value of the container through the strategy?

A6: this dynamic adjustment also makes sense if the resources are completely abundant and redundant. It is not true that high-priority applications will be disturbed when the resources are full. In fact, when the CPU of the host reaches a certain proportion, for example, 50%, the delay of the application becomes larger. In order to fully ensure the SLO of high-priority applications, it is also valuable to sacrifice the normal operation of low-priority CPU.

Has Q7:Policy engine considered open source?

A7: there are plans for open source, and Policy engine is more related to its own application attributes. E-commerce applications or big data's strategies for dealing with applications are different. We will first open source the framework and attach some simple strategies, and more strategies can be customized by users.

Q8: most of the applications I encountered before are not aware of the configuration of cgroup correctly, so in many cases, parameters need to be set according to cpu or mem in the startup parameters, that is to say, even if the cgroup is changed, it will not work for them, so the usage scenario is limited.

A8: it is valuable to restrict the use of container resources. Limiting low-priority applications themselves can also improve the SLO of high-priority applications, although the effect is less obvious. Considerations of stability are also important.

What is the current use of Q9:Policy engine in Ali? How much scale is used in production to dynamically adjust in this way? Is it used in conjunction with the community's HPA VPA?

A9: Policy engine is already used in some Ali clusters. As for the scale, it is impossible to disclose for the time being. When it comes to the linkage of many components, the HPA and VPA of the community are not quite able to meet our needs at the moment. So Ali's HPA and VPA are both developed by us, but they are consistent with the principles of the community. The open source of Ali HPA can focus on the Openkruise community. I don't have any exact information about the VPA open source project.

Q10: when the stand-alone node resources are insufficient to provide container expansion, is it possible to expand the capacity of HPA or VPA?

A10: when there is a shortage of stand-alone nodes, the application can add copies to cope with it through HPA. However, VPA fails if the original node is selected for update. Can only be dispatched to other resource-rich nodes. In the case of a steep increase in traffic, rebuilding the container may not meet the demand, which may lead to an avalanche, that is, during the reconstruction process, other unupgraded copies of the application accept more traffic, OOM is dropped, and the newly launched container is instantly OOM, so restart the container needs to be cautious. Rapid capacity expansion (HPA) or rapid promotion of high-priority resources and suppression of low-priority container resources are more effective.

Follow the official account of "Alibaba Cloud Origin" and reply to the keyword "1010" to obtain the PPT of this article.

"Alibaba Cloud's native Wechat official account (ID:Alicloudnative) focuses on micro-services, Serverless, containers, Service Mesh and other technology areas, focuses on cloud native popular technology trends, and large-scale cloud native landing practices, and is the technical official account that best understands cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.