Operation and maintenance practice of Alibaba large-scale Dragon bare Metal Kubernetes Cluster 04/27 Update SLTechnology News&Howtos

Operation and maintenance practice of Alibaba large-scale Dragon bare Metal Kubernetes Cluster

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Author | Yao Jie, Senior Technical expert on Cluster Management of Aliyun Container platform

This article is excerpted from the book "different double 11 Technologies: cloud Native practice in Alibaba economy", which can be downloaded by clicking.

Cdn.com/2d3f2c0a733aa3bc75c82319f10a13c2f82b3771.jpeg ">

Guide: Alibaba's technical people are proud that in 2019 Alibaba's double 11 core system was launched on the cloud in a native way, perfectly supporting the peak traffic of 54.4w and 268.4 billion of trading volume. The computing power behind the massive transactions comes from the perfect integration of container technology and dragon bare metal.

The form of cloud machine resources in the group

Alibaba double 11 adopts a three-place and five-unit architecture, with the exception of two mixed units, the other three are cloud units. After the verification of 618 and 99, the performance and stability of the Dragon model have been greatly improved and can stably support Shuang 11. The three trading cloud units of double 11 this year have been 100% based on Shenlong bare metal, and the core e-commerce Shenlong cluster has reached tens of thousands of units.

Dragon architecture

Aliyun ECS virtualization technology has gone through three generations. The first two generations are Xen and KVM. DPCA is the third generation ECS virtualization technology product developed by Alibaba. It has the following four major technical features:

Storage and network VMM and ECS management, separate deployment from computing virtualization; computing virtualization further evolved to Near

Metal Hypervisor; storage and network VMM are accelerated through chip-specific IP services; pooling supports elastic bare metal and ECS virtual machine production.

In short, Shenlong offload the virtualization overhead of network / storage to a FPGA hardware acceleration card called MOC card, reducing the computing virtualization overhead of the original ECS by about 8%, and smoothing the overall cost of DPCA through the manufacturing cost advantage of large-scale MOC cards. Dragon physical machine features, can be secondary virtualization, leaving enough room for the evolution and development of new technologies, for the use of a variety of virtualization technologies, such as Kata, Firecracker and so on.

Before Alibaba migrated to the Dragon architecture on a large scale, we were surprised to find that the performance of the group's e-commerce container running on the cloud was 10% and 15% better than that of the non-cloud physical machine. After analysis, we found that the main reason is that the virtualization cost has been offload to the MOC card, and the CPU/Mem of the Dragon has no virtualization overhead, while each container running on the Dragon on the cloud has its own ENI ENI, which has obvious performance advantages. At the same time, each container has its own ESSD storage cloud disk. The IOPS of a single disk is as high as 1 million, which is 50 times faster than that of SSD cloud disk, and its performance is better than that of non-cloud SATA and SSD local storage. This has also strengthened our determination to adopt the Dragon on a large scale to support Singles' Day.

Dragon + container + Kubernetes

In the era of All in Cloud, enterprise IT architecture is being reshaped, and cloud native has become the shortest way to unleash the value of cloud computing. In 2019, Alibaba's double 11 core system was launched on the cloud in a native way, based on Citroen servers, lightweight cloud native containers and a new ASI (alibaba serverless infra.) scheduling platform compatible with Kubernetes scheduling. Among them, the Kubernetes Pod container is perfectly integrated with the Dragon naked metal, and the Pod container, as the delivery aspect of the business, runs on the Dragon instance.

The following is the form of Pod running on the dragon:

ASI Pod runs on Shenlong bare metal node, offload network virtualization and storage virtualization to independent hardware node MOC card, and uses FPGA chip acceleration technology, storage and network performance are better than ordinary physical machine and ECS;MOC has independent operating system and core, and can assign independent CPU core to AVS (network processing) and TDC (storage processing). ASI Pod consists of Main container (business master container), operation and maintenance container (star-agent side-car container) and other auxiliary containers (such as Local cache container of an application). Pod shares network namespaces, UTS namespaces and PID namespaces through Pause containers (ASI closes the sharing of PID namespaces); Main containers and OPS containers of Pod share data volumes, declare cloud disks through PVC, and mount data volumes to the corresponding cloud disk mount points. Under the ASI storage architecture, each Pod has an independent cloud disk space, which supports read-write isolation and disk size restriction. ASI Pod is directly connected to the ENI ENI on the MOC card through the Pause container. ASI Pod only occupies independent resources, such as 16C (CPU) / 60g (memory) / 60g (disk), no matter how many containers there are inside. Large-scale operation and maintenance of Shenlong

In 2019, the scale of the Dragon Cluster, which is the core transaction on the double 11 cloud, has reached tens of thousands. It is very challenging to manage and maintain such a large-scale Shenlong cluster. This includes the selection of specifications for various business instances on the cloud, flexible expansion and reduction of large-scale clusters, division and control of node resources, statistical analysis of core indicators, basic environment monitoring, downtime analysis, node label management, node restart / locking / release, node self-healing, failure machine rotation, kernel patch upgrade, large-scale inspection and other capacity building.

The following areas are expanded in detail:

Instance specification

First of all, different instance specifications need to be planned for different types of business, including infrastructure and services with different characteristics, such as entry layer, core business systems, middleware, database and cache services. some require high-performance computing, some require high network packet sending and receiving capabilities, and some require high-performance disk reading and writing capabilities. Systematic and overall planning is needed in the early stage to avoid improper selection of case specifications to affect business performance and stability. The core configuration parameters of the instance specification include vcpu, memory, number of ENI, number of cloud disks, system disk size, data disk size, network packet sending and receiving capability (PPS).

The main model of e-commerce core system application is 96C/527G, and each Kubernetes Pod container occupies an elastic Nic and an EBS cloud disk, so the limit of elastic Nic and cloud disk is very critical. This time, Shenlong raised the limit of elastic Nic and EBS cloud disk to 64 and 40, effectively avoiding the waste of CPU and memory resources. In addition, for different types of business, the core configuration will be slightly different. For example, due to the need to bear a large amount of ingress traffic, aserver instances in the entry layer have high requirements for network packet sending and receiving capabilities of MOC. In order to avoid AVS network softswitch CPU being full, the requirements for CPU allocation parameters of network and storage in Shenlong MOC card are different. The CPU allocation of MOC card network / storage of conventional computing Citroen instances is 4x8. And the aserver dragon instance needs to be configured at 6:6. For example, for cloud hybrid models, you need to provide a separate nvme local instance for offline tasks. Reasonable planning of models and specifications for different types of business will greatly reduce costs and ensure performance and stability.

Resource elasticity

Double 11 requires a large amount of computing resources to support flood peak traffic, but these resources cannot be held normally, so you need to divide daily clusters and large-scale promotion clusters reasonably. In the weeks before Singles Day, you can flexibly apply for a large number of Shenlong from Aliyun through elastic capacity expansion of nodes, deploy them in independent large-scale cluster groups, and large-scale expansion of Kubernetes Pod to deliver business computing resources. Immediately after Singles Day, the Pod containers in the Dachu cluster will be scaled down in batches, and the DPCA instances in the cluster will be destroyed offline as a whole. Only regular DPCA instances will be held on a daily basis. By flexibly expanding and reducing the capacity of the large-scale cluster, the cost can be greatly reduced. In addition, in terms of the delivery cycle of the Dragon, the application to the creation of the machine has been shortened from the hour / day level to the minute level since the launch of the cloud this year. Thousands of dragons can complete the application within 5 minutes, including computing, network and storage resources. The cluster creation efficiency has been greatly improved in 10 minutes and imported into the Kubernetes cluster, laying the foundation for the future normalization of elastic resource pool.

Core index

For the operation and maintenance of large-scale DPCA cluster, there are three very core indicators to measure the overall health of the cluster, namely, downtime rate, schedulable rate, and online rate.

The reasons for the downtime of the dragon on the cloud are usually divided into hardware problems and kernel problems. Through the daily downtime trend statistics and downtime root cause analysis, the stability of the cluster can be quantified and the potential risk of large-scale downtime can be avoided. Schedulability is a key indicator of cluster health. Due to various hardware and software reasons, the container of cluster machines cannot be scheduled to these abnormal machines, such as Load greater than 1000, disk pressure, docker process does not exist, kubelet process does not exist, and so on. In Kubernetes cluster, the status of these machines will be notReady. On Singles Day in 2019, we greatly improved the rotation efficiency of faulty machines through the characteristics of downtime restart and cold migration of DPCA, keeping the schedulable rate of DPCA at more than 98%, and greatly promoting resources to be prepared and leisurely. On the other hand, the downtime rate of Shuang 11 Dragon is kept below 0.2 ‰, which is quite stable.

Label management

With the increase of the size of the cluster, the management becomes more difficult. For example, how to filter out all the machines in the production environment under "cn-shanghai" Region and the instance specification is "ecs.ebmc6-inc.26xlarge". We implement batch resource management by defining a large number of preset tags. Under the Kubernetes architecture, machines are managed by defining Label. Label is a Key-Value health value pair, you can use the standard Kubernetes interface on the dragon node to play Label. For example, the Label of the machine instance specification can be defined as "sigma.ali/machine-model": "ecs.ebmc6-inc.26xlarge", and the Region where the machine is located can be defined as "sigma.ali/ecs-region-id": "cn-shanghai". Through the perfect label management system, we can quickly screen machines from tens of thousands of Dragon nodes and perform routine operation and maintenance operations such as grayscale batch service release, batch restart, batch release and so on.

Downtime analysis

For very large clusters, daily downtime is very common, and the statistics and analysis of outage is very important to identify whether there is a systemic risk. There are many kinds of downtime. Hardware failure will lead to downtime, and bug of the kernel will also lead to downtime. Once downtime occurs, business will be interrupted and some stateful applications will be affected. We monitor the downtime of the resource pool through ssh and port ping inspection, count the historical trend of downtime, and alarm if there is a sudden increase in downtime; at the same time, we analyze the correlation of the down machines, such as classifying them according to the computer room, environment, unit and grouping to see whether they are related to a specific computer room; classifying the models and CPU to see if they are related to specific hardware At the same time, OS version and kernel version are classified to see if they all happen on some specific kernels.

The root cause of the outage is also comprehensively analyzed to see whether it is a hardware failure or an active operation and maintenance event. The kdump mechanism of the kernel generates vmcore when crash occurs. We also classify the information extracted from vmcore to see the number of downtime associated with a particular type of vmcore. Some error messages in the kernel log, such as mce log and soft lockup error information, can also find out whether the system is abnormal before and after downtime.

Through this series of downtime analysis work, the corresponding problems are submitted to the kernel team, and kernel experts will analyze vmcore, and kernel defects will give hotfix to solve these problems that lead to downtime.

Node self-healing

Operation and maintenance of large-scale Dragon Cluster will inevitably encounter software and hardware failures, and in the cloud technology stack is thicker, the problems will be more complex. If it is unrealistic to rely solely on manual work to deal with problems, we must rely on automation capabilities to solve them. 1-5-10 node self-healing provides the ability to find abnormal problems in 1 minute, locate in 5 minutes and repair in 10 minutes. Major Citroen machine exceptions include downtime, tamping machine, high host load, full disk space, too many openfiles, unavailable core services (Kubelet, Pouch, Star-Agent), and so on. The main repair actions include downtime restart, business container expulsion, abnormal software restart, and automatic disk cleaning. More than 80% of the problems can be self-healed by restarting the downtime machine and expelling the business container to other nodes. In addition, we monitor the two system events of Dragon Reboot restart and Redepoly instance migration to achieve automatic repair of system or hardware failures on the NC side.

Prospects for the future

On Singles Day in 2020, the infrastructure of Alibaba's economy will be 100% based on Kubernetes, the next-generation hybrid architecture based on runV security containers will be implemented on a large scale, and the lightweight container architecture will evolve to the next stage.

In this context, on the one hand, Kubernetes node management will develop in the direction of pool management in Alibaba economy, opening up cloud inventory management, improving node resilience, using off-peak resources according to business characteristics, and further reducing machine holding time so as to significantly reduce costs.

In terms of technical objectives, we will use a core engine based on Kubernetes Machine-Operator to provide highly flexible node operation and maintenance scheduling capabilities to support the final maintenance of node operation and maintenance state. On the other hand, based on the complete global data acquisition and analysis capability, it provides the ultimate full-link monitoring / analysis / kernel diagnosis capability to comprehensively improve the stability of the container basic environment. Provide extreme performance observation and diagnosis technical support for lightweight container / immutable infrastructure evolution.

"different double 11 Technology: cloud Native practice in Alibaba economy"

Behind the 268.4 billion turnover of double 11 is the repeated attempt and practice of one technical problem after another.

This time, we dug deep into the practical details of cloud native technology in Singles' Day, selected 22 representative articles and rearranged them into the book "different double 11 Technology: cloud Native practice in Alibaba economy".

It will bring you different highlights of double 11 cloud native technology:

The problems encountered in the practice of Shuang 11 super-large K8s cluster and their solutions are described in detail; the best combination of cloud biochemistry: Kubernetes+ container + Shenlong, to achieve 100% of the technical details of the core system on the cloud; double 11 Service Mesh super-large-scale landing solution.

"Alibaba Cloud's native Wechat official account (ID:Alicloudnative) focuses on micro-services, Serverless, containers, Service Mesh and other technology areas, focuses on cloud native popular technology trends, and large-scale cloud native landing practices, and is the technical official account that best understands cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.