The online rate of K8s cluster nodes is more than 99.9%, and the expansion efficiency is increased by 50%. We have made these three in-depth improvements. 04/25 Update SLTechnology News&Howtos

The online rate of K8s cluster nodes is more than 99.9%, and the expansion efficiency is increased by 50%. We have made these three in-depth improvements.

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Click to download "different double 11 Technologies: cloud Native practice in Alibaba economy"

Cdn.com/2d3f2c0a733aa3bc75c82319f10a13c2f82b3771.jpeg ">

This article is excerpted from the book "different double 11 Technologies: cloud Native practice in Alibaba economy". Click on the picture above to download it!

Author | Zhang Zhen (Shouchen) Senior Technical expert of Ali Yunyun native application platform

Guide: in 2019, Alibaba's core system is 100% cloud-native, which perfectly supports the double 11 promotion. The posture of going to the cloud this time is very unusual, not only embracing Kubernetes, but also taking the opportunity of embracing Kubernetes to carry out a series of in-depth reforms to the operation and maintenance system.

As a cloud native best practice, Kubernetes has become the de facto container orchestration engine standard. The landing of Kubernetes in Alibaba Group has mainly gone through four stages:

Research and development and exploration: in the second half of 2017, Alibaba Group began to try to use Kubernetes api to transform the internal self-research platform, and began to transform the application delivery link to adapt to the initial grayscale of Kubernetes;: in the second half of 2018, Alibaba Group and Ant Financial Services Group jointly invested in the research and development of Kubernetes technology ecology, striving to replace the internal self-research platform through Kubernetes, realizing small-scale verification and supporting part of the traffic of double 11 that year. Cloud grayscale: at the beginning of 2019, Alibaba economy began to carry out a comprehensive cloud transformation. Alibaba Group completed the small-scale verification of the cloud computer room by redesigning the Kubernetes landing plan to adapt to the cloud environment and transform the backward operation and maintenance habits. Large-scale landing: after June 18, 2019, Alibaba Group began to comprehensively promote the landing of Kubernetes, completed the goal of running all core applications in Kubernetes before the promotion, and perfectly supported the double 11 exam.

In the past few years of practice, a question has always been lingering in the minds of architects: under Alibaba's large and complex business, leaving behind a large number of traditional operation and maintenance habits and the operation and maintenance system that supports these habits, what on earth should Kubernetes insist on? What do you want to compromise? What are you going to change?

This article will share Alibaba's thoughts on these issues in recent years. The answer is clear: embracing Kubernetes is not an end in itself, but by embracing the cloud native transformation of Kubernetes to pry the business, and through the ability of Kubernetes to cure the serious and stubborn diseases under the traditional operation and maintenance system, release the cloud elasticity and speed up the application delivery of the business.

In Alibaba's Kubernetes implementation practice, he focused on the following key cloud native transformations:

Oriented to final state transformation

In Alibaba's traditional operation and maintenance system, application changes are accomplished by PaaS by creating operation orders, initiating workflows, and then initiating changes to the container platform one by one.

When the application is released, PaaS will look up all the relevant containers in the database and initiate changes to the container platform to modify the container image for each container. Each change is actually a workflow that involves pulling the mirror, stopping the old container, and creating the new container. Once an error or timeout occurs in the workflow, PaaS is required to retry. Generally speaking, in order to ensure the timely completion of the work order, the retry will only be executed a few times, and after several failed retries, we can only rely on manual processing.

When downsizing is applied, PaaS will specify the container list to delete according to the input of the operation and maintenance personnel. Once the deletion of a container fails or times out due to host exception, PaaS can only try again and again. To ensure the completion of the ticket, the container deletion can only be considered successful after a certain number of retries. If the host subsequently returns to normal, the deleted container is likely to still be running.

The traditional process-oriented container changes have the following problems that cannot be solved:

Failure of a single change does not guarantee ultimate success

For example, once the container image change fails, PaaS cannot guarantee the ultimate consistency of the container image; once the container is deleted, it fails.

There is no guarantee that the container will be deleted in the end. Both examples need to be patrolled to deal with inconsistent containers. Because the patrol task is less carried out, it is difficult to ensure its correctness and timeliness.

Multiple changes will conflict

For example, the release and expansion process of an application need to be locked, otherwise the newly expanded container image will not be updated. Once the change is locked, the efficiency of the change will be greatly reduced.

Kubernetes's capabilities provide an opportunity to solve this problem. Kubernetes's workload provides declarative API to modify the number of instances and versions of the application. The controller of workload can monitor the actual situation of pod and ensure that the number and version of pod instances meet the final state, thus avoiding the conflict between concurrent expansion and release. The kubelet of Kubernetes repeatedly attempts to start pod based on the spec of pod until the pod meets the final state described by spec. Retry is implemented within the container platform and is no longer bound to the work order status of the application.

Self-healing ability transformation

Under Alibaba's traditional operation and maintenance system, the container platform only produces resources, and the startup of applications and service discovery are performed by the PaaS system after the container is started. This hierarchical method gives the PaaS system the greatest freedom and promotes the prosperity of Alibaba's first wave of container ecology after containerization. But there is a serious problem with this approach, namely:

The container platform cannot independently trigger the expansion and expansion of the container, so it needs to do complex linkage with one PaaS after another, and the upper PaaS also needs to do a lot of repetitive work. This hinders the efficient self-healing repair of the container platform in the event of host failure, restart, abnormal process in the container and stuck, and makes elastic scaling and scaling very complicated.

In Kubernetes, through the commands of the container and lifecycle hooks, the process of starting the application and checking the startup status of the application by PaaS can be built into pod. In addition, by creating a service object, you can associate the container with the corresponding service discovery mechanism, thus realizing the unity of the lifecycle of container, application and service. The container platform no longer just produces resources, but delivers services that can be used directly by the business. This greatly simplifies the construction of fault self-recovery and automatic elastic capacity expansion after cloud.

Really give full play to the resilience of the cloud.

In addition, in the case of host failure, PaaS traditionally needs to expand the capacity of the application before deleting the container on the host. However, in large-scale clusters, we find that we often get stuck in the step of application expansion. The quota of application resources may not be enough, and the free resources in the cluster that meet the application scheduling restrictions may not be enough. If the capacity cannot be expanded, the container on the host cannot be expelled, and the abnormal host cannot be sent for repair. Over time, the whole cluster is easy to fall into a large number of faulty machines, unable to repair and immobile.

In Kubernetes, the handling of the fault machine is much simpler and rougher, and it is no longer required to expand the capacity of the application first, but to delete the container on the fault machine directly, and then expand the capacity by the load controller. At first glance, this kind of scheme sounds daring. When landing in Kubernetes, many PaaS students reject this method very much, thinking that it will seriously affect the stability of the business. In fact, the vast majority of core business applications maintain a certain redundant capacity in order to switch global traffic or deal with sudden business traffic. Temporarily deleting a certain amount of containers will not cause insufficient business capacity at all.

The key problem we face is how to determine the available capacity of the business, of course, this is a more difficult problem, but for self-healing scenarios, we do not need an accurate capacity assessment at all, only a pessimistic estimate that can promote the operation of self-healing. In Kubernetes, the amount of transferability to an application can be quantitatively described by PodDisruptionBudget, for example, the number or proportion of concurrency that the application can be expelled can be set. This value can be set by referring to the proportion of each batch at the time of release. If the application is generally released in 10 batches, you can set the maxUnavailable in PodDisruptionBudget to 10% (for scale, if there are only 10 instances of the application, Kubernetes still thinks that one instance can be expelled). What if none of the instances of the application allow eviction? Sorry, such applications need to be modified before they can enjoy the benefits of cloud. General applications can allow the migration of instances by changing their own architecture or automating the operation and maintenance of the application through operator.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.