How Kubernetes is changing Cloud Infrastructure 04/09 Update SLTechnology News&Howtos

How Kubernetes is changing Cloud Infrastructure

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces how Kubernetes changes the cloud infrastructure, the content is very detailed, interested friends can refer to, I hope it can be helpful to you.

I. background and status quo

Kubernetes is an open source system that allows container applications to enter a large-scale industrial production environment, and it is also a de facto standard in the field of cluster scheduling. It has been widely accepted by the industry and has been applied on a large scale. Kubernetes has become the management engine of Meituan's cloud infrastructure, which not only brings efficient resource management, but also greatly reduces costs, but also lays a solid foundation for the promotion of Meituan's cloud native architecture, supporting some platforms such as Serverless and cloud native distributed database to complete the containerization and cloud biochemistry construction.

Since 2013, Meituan has built a cloud infrastructure platform with virtualization technology as the core; in 2016, he began to explore container technology and landed internally, building a Hulk1.0 container platform based on the original OpenStack resource management capabilities; in 2018, Meituan began to build a Hulk2.0 platform based on Kubernetes technology; by the end of 2019, we basically completed the containerization transformation of Meituan Cloud infrastructure. In 2020, we firmly believe that Kubernetes is the future standard of cloud infrastructure, and begin to explore the landing and evolution of cloud native architecture.

Currently, we have built a cloud infrastructure represented by Kubernetes, Docker and other technologies to support the entire service and application management of Meituan. The containment rate has reached more than 98%. There are dozens of Kubernetes clusters, tens of thousands of management nodes and hundreds of thousands of Pod. However, for disaster recovery, our maximum single cluster is set to 5K nodes.

The following figure shows our current scheduling system architecture based on Kubrnetes engine, building a unified resource management system with Kubernetes as the core, serving various PaaS platforms and businesses. In addition to directly supporting Hulk containerization, it also directly supports Serverless, Blade and other platforms, realizing the containerization and cloud biochemistry of the PaaS platform.

II. Obstacles and benefits of the transition from OpenStack to Kubernetes

For a company with a mature technology stack, the transformation of the entire infrastructure is not smooth. In the era of OpenStack cloud platform, the main problems we face include the following aspects:

The architecture is complex, operation and maintenance is difficult: the management module of computing resources in the whole architecture of OpenStack is very large and complex, problem troubleshooting and reliability has always been a big problem.

The problem of environmental inconsistency is prominent: it is a common problem in the industry before the emergence of container images, which is not conducive to the rapid launch and stability of the business.

Virtualization itself takes up a lot of resources: virtualization itself takes up about 10% of the host resource consumption, which is a huge waste of resources when the cluster is large enough.

The cycle of resource delivery and recovery is long, and it is not easy to deploy flexibly: on the one hand, the whole virtual machine creation process is lengthy; on the other hand, various initialization and configuration of resources are time-consuming and error-prone, which leads to a long cycle from application to delivery of machine resources. rapid resource allocation is a problem.

The high and low peaks are obvious, and the waste of resources is serious: with the rapid development of the mobile Internet, there are more and more high and low peaks in the company's business. in order to ensure the stability of the service, we have to prepare resources according to the highest resource demand. this leads to serious idle resources during the low peak, which leads to waste.

2.1 processes and obstacles of containerization

In order to solve the problems of virtual machines, Meituan began to explore the landing of a more lightweight container technology, that is, the Hulk1.0 project. However, based on the resource environment and architecture at that time, Hulk1.0 is a container platform implemented by the resource management layer based on the original OpenStack. OpenStack provides the resource management capability of the underlying host, which solves the business demand for flexible resources, and the entire resource delivery cycle is reduced from minute level to second level.

However, with the promotion and implementation of Hulk1.0, some new problems have been exposed:

Poor stability: because the underlying resource management capabilities of OpenStack are reused, the whole expansion process includes two-tier resource scheduling, and the data synchronization process is complex, and the isolation of the server room is also poor. Problems often occur in one data center, and the expansion and expansion of other data centers are also affected.

Lack of capacity: due to many systems involved and cross-departmental cooperation, the migration and recovery ability of fault nodes is not easy to achieve, the type of resources is relatively single, and the whole fault troubleshooting and communication is inefficient.

Poor scalability: the control plane of Hulk1.0 has limited ability to manage underlying resources, which can not be expanded rapidly according to scenarios and requirements.

Performance: the demand for the delivery speed of scale-up and flexible resources is further improved, and the weak isolation of container technology leads to more interference and negative feedback on business services.

After a period of optimization and improvement, the above problems can not be solved completely. In this case, we have to rethink the architectural rationality of the entire container platform, and at this time Kubernetes has been gradually recognized and applied by the industry, and its clear architecture and advanced design ideas let us see hope. Therefore, we build a new container platform based on Kubernetes. In the new platform, Hulk is completely based on the native Kubernetes API and connects with the internal release and deployment system through Hulk API. In this way, the two-tier API decouples the entire architecture, the domain is clear, application management and resource management can be iterated independently, and the powerful orchestration and resource management capabilities of Kubernetes are highlighted.

The core idea of containerization is to let Kubernetes manage well at the resource level, and to solve the dependence on Meituan application management system and operation and maintenance system through the upper control layer, maintain the native compatibility of Kubernetes, reduce subsequent maintenance costs, and complete the requirement of rapid convergence of resource management. At the same time, it also reduces the learning cost of users' resource applications based on the new platform, which is very important, and it is also the "basis" that we can quickly and massively migrate infrastructure resources.

2.2 challenges and strategies in the containerization process

2.2.1 complex, flexible, dynamic and configurable scheduling strategy

Meituan has many products and a wide variety of business lines and application features, so accordingly, we have a lot of requirements for resource types and scheduling strategies. For example, some businesses need specific resource types (SSD, high memory, high IO, etc.), and some businesses need specific fragmentation strategies (such as computer room, service dependence, etc.), so how to deal with these diversified needs is a big problem.

To solve these problems, we have added a policy engine to expand the link. Businesses can customize the input policy requirements for their own application APPKEY. At the same time, based on the service profile analyzed by big data, they will also recommend business policies according to business characteristics and the company's application management policies. Eventually, these policies will be saved to the policy center. In the process of capacity expansion, we will automatically label the corresponding requirements for the application instances, and finally take effect in Kubenretes to complete the expected delivery of resources.

2.2.2 refined resource scheduling and operation

Fine resource scheduling and operation, the reason for fine operation is mainly out of two considerations: the complex resource demand scenarios of the business, and the lack of resources.

We deploy multiple Kubenretes clusters relying on private and public cloud resources. Some of these clusters carry general business and some are proprietary for specific applications. We deploy cloud resources in the cluster dimension, including the division of computer rooms, the distinction of models, and so on. Under the cluster, we build special areas of different business types according to different business needs, in order to achieve the isolation of resource pools to meet the needs of business. In a finer dimension, we do cluster-level resource scheduling for application-level resource requirements, disaster recovery requirements and stability, and finally achieve finer-grained resource isolation and scheduling such as CPU, MEM and disks based on different underlying hardware and software.

2.2.3 improvement and governance of application stability

Whether it is VM or the original container platform, there have been problems in terms of application stability. To this end, we need to make more efforts on the SLA that ensures the application.

2.2.3.1 Container reuse

In the production environment, the restart of the host is a very common scenario, which may be active or passive, but from the user's point of view, the restart of the host means that some of the user's system data may be lost, and the cost is still high. We need to avoid the migration or reconstruction of containers and restart the recovery directly. However, we all know that in Kubernetes, there are several restart strategies for containers in Pod: Always, OnFailure and Never. After the host is restarted, the container will be recreated.

To solve this problem, we have added a Reuse policy for the restart policy type of the container. The process is as follows:

When kubelet is in SyncPod, if the restart policy is Reuse, it will get the App container corresponding to the exited status of Pod. If it exists, pull up the latest App container (there may be multiple containers). If it does not exist, create a new container directly.

Update the pauseID of the App container mapping to the new pause container ID, thus establishing the mapping between the new pause container under Pod and the original App container.

Re-pull the App container to complete Pod status synchronization. Eventually, even if the host is restarted or the kernel is upgraded, the container data will not be lost.

2.2.3.2 Numa awareness and binding

Another pain point for users is related to container performance and stability. We continue to receive business feedback, and there are significant differences in the performance of containers with the same configuration, mainly due to the high request latency of some containers. After our testing and in-depth analysis, we found that these containers have access to CPU across Numa Node, and the problem disappeared after we limited the CPU use of containers to the same Numa Node. Therefore, for some delay-sensitive services, in order to ensure the consistency and stability of application performance, we need to be aware of the use of Numa Node on the scheduling side.

In order to solve this problem, we collect the allocation of Numa Node in the Node layer, increase the awareness and scheduling of Numa Node in the scheduler layer, and ensure the balance of resource use. For some sensitive applications that are forced to bind Node, the expansion fails if a suitable Node is not found. For some applications that do not need to bind Numa Node, you can choose a strategy that can be satisfied as much as possible.

2.2.3.3 other stability optimization

At the scheduling level, we add load awareness and fragmentation and optimization strategies based on service profile application characteristics to the scheduler.

In the fault container detection and processing, the alarm self-healing component based on the landing of the feature library can detect, analyze and deal with the alarm in seconds.

Zone isolation is adopted for some applications with special resource requirements, such as high IO and high memory, to avoid affecting other applications.

2.2.4 platform service containerization

I believe that students who have done ToB business should know that there is a big customer solution for any product, then this situation will also exist internally for a company like Meituan. The containerization of platform business has a characteristic: there are a large number of instances, thousands or tens of thousands, so the resource cost is relatively high; the business status is relatively high, which is generally a very core business, which requires high performance and stability. Therefore, it is impractical to solve the problems of this kind of business in a "one move" way.

Here, we take the MySQL platform as an example, the database business requires very high stability, performance and reliability, and the business itself is mainly physical machines, so the cost pressure is very high. For the containerization of the database, we mainly customize and optimize the resource allocation from the host side.

Aiming at CPU resource allocation, the way of monopolizing CPU collection is adopted to avoid contention between Pod.

Improve stability by allowing custom SWAP sizes to handle short periods of high traffic and turning off Numa Node and PageCache.

In disk allocation, Pod exclusive disks are used to isolate IOPS, and pre-allocate and format disks to improve the speed of capacity expansion and improve the efficiency of resource delivery.

Scheduling supports unique break-up strategy and capacity reduction confirmation to avoid the risk of capacity reduction.

In the end, we increased the delivery efficiency of the database by 60 times, and in most cases the performance was better than the previous physical machine.

2.2.5 priority guarantee of business resources

For an enterprise, based on cost considerations, resources will always be in a state of shortage, so how to ensure the supply and distribution of resources is very important.

The business budget quota determines the supply of resources and is dedicated to proprietary resources through the special area.

Build flexible resource pools and open up public clouds to deal with sudden resource needs.

Ensure the use of resources according to the priority of business and application types, and ensure that the core business gets the resources first.

Multiple Kubenretes clusters and multiple computer rooms are used for disaster recovery to deal with the failure of clusters or data centers.

2.2.6 Landing of Cloud Native Architecture

After migrating to Kubernetes, we further implemented the landing of the cloud native architecture.

To solve the obstacles to the management of cloud native applications, we designed and implemented Meituan's native cloud application management engine-KubeNative, which makes the configuration and information management of applications transparent to the platform. The business platform only needs to create native Pod resources without paying attention to the application information synchronization and management details, and supports each PaaS platform to expand the capabilities of the control plane and run its own Operator.

The following figure shows our entire cloud native application management architecture, which supports the landing of platforms such as Hulk container platform, Serverless and TiDB.

2.3 benefits from infrastructure migration

98% of the company's business has been containerized, significantly improving the efficiency of resource management and business stability.

The stability of Kubernetes is over 99.99%.

Kubernetes has become the standard of Meituan's internal cluster management platform.

III. Challenges and strategies for operating large-scale Kubernetes clusters

In the whole process of infrastructure migration, in addition to solving the problems left over by history and system construction, with the rapid growth of the scale and number of Kubernetes clusters, we encounter a new challenge: how to operate large-scale Kubernetes clusters stably and efficiently. In the past few years of Kubernetes operation, we also gradually fumbled out a set of operational experience to verify the feasibility.

3.1 Optimization and upgrade of core components

The Kubernetes we originally used was version 1.6, and the performance and stability were relatively poor. When we reached 1K nodes, there were problems gradually, and when we reached 5K nodes, the basic cluster was not available. For example, scheduling performance is very poor, cluster throughput is relatively low, occasionally "avalanches" occur, and link expansion takes longer.

The analysis and optimization of the core components are summarized from four aspects: kube-apiserver, kube-scheduler, etcd and container.

For kube-apiserver, in order to reduce the 429th request retry for a long time during the restart process, we implement multi-level flow control, reduce the unavailable window from 15min to 1min, reduce the cluster load by reducing and avoiding the List operation of the external system, and balance the load of the node through the internal VIP to ensure the stability of the control node.

In the kube-scheduler layer, we enhance the scheduling awareness policy, and the scheduling effect is more stable than before; the pre-selected interrupt and local optimal strategy proposed for scheduling performance optimization have also been merged into the community and become a general strategy.

For the operation of etcd, the pressure on the main database can be reduced by splitting out independent Event clusters, and the deployment of high-equipped SSD physical machines can achieve 5 times the daily high-traffic access.

At the container level, container reuse improves the fault tolerance of the container, improves the application stability through fine CPU allocation, and improves the fault recovery speed of Node through disk pre-mount of the container.

In addition, the iteration of the community version is very fast, and the high version is better in terms of stability and feature support, so it is inevitable that we need to upgrade the version, but how to ensure the success of the upgrade is a big challenge, especially when we do not have enough Buffer resources to move resources.

For cluster upgrade, the common solution in the industry is directly based on the original cluster upgrade. The solution has the following problems:

The upgrade version is limited and cannot be upgraded across major versions: you can only upgrade from a low version to a high version little by little, which is time-consuming and has a low success rate.

The risk of upgrading the control plane is uncontrollable: especially if there are API changes, the previous data will be overwritten and even non-rollback.

Users are aware that the container needs to be created, and the cost and impact are high: this is a painful point, and container creation is inevitable.

For this reason, we deeply study the control mode of Kubernetes on the container level, design and implement a scheme that can smoothly migrate container data from a low-version cluster to a high-version cluster, and refine the cluster upgrade to an in-situ hot upgrade of containers on each host machine with Node granularity, which can be paused and rolled back at any time. The new scheme mainly migrates Node and Pod data from low-version cluster to high-version cluster through external tools, and solves the compatibility problem of Pod objects and containers. The core idea is that the lower version is compatible with the higher version of API, the Hash of the container is refreshed to ensure that the container under the Pod will not be new, and the Pod and Node resource data are migrated from the low version cluster to the high version cluster through tools.

The highlights of the scheme mainly include the following four aspects:

Cluster upgrade in large-scale production environment is no longer a problem.

The problem that the risk of the existing technical solution is uncontrollable is solved, the risk is reduced to the host level, and the upgrade is more secure.

It is versatile, can be upgraded to any version, and the scheme has a long life cycle.

The problem of new container construction in the upgrade process is solved elegantly, and the in-situ hot upgrade is really achieved.

3.2 platform and operational efficiency

Large-scale cluster operation is a very challenging thing, to meet the rapid development of business and user needs is also a great test for the team, we need to consider the cluster operation and R & D capabilities from different latitudes.

In the entire operation and maintenance capacity building of Kubernetes and etcd clusters, we focus on safe operation, efficient operation and maintenance, standardized management and cost savings. Therefore, for Kubernetes and etcd clusters, we have completed the platform management and operation, covering six aspects: feature expansion, performance and stability, daily operation and maintenance, fault recovery, data operation and security control.

For a Kubernetes team of non-public cloud business, the manpower is still very limited, in addition to the daily operation of the cluster, there are R & D tasks, so we are very concerned about the improvement of operational efficiency. We gradually transformed the daily operation and maintenance to build a set of Kubernetes cluster management platform within Meituan.

Standardize and visualize the management of the cluster to avoid the operation and maintenance of black and white screen.

The problem is converged through alarm self-healing and automatic inspection, so although we have dozens of clusters, our operation and maintenance efficiency is still relatively high, and students on duty rarely need to pay attention to it.

The flow of all operation and maintenance operations not only improves the efficiency of operation and maintenance, but also reduces the probability of failure caused by human operation.

Through the analysis of the operation data, we further do the fine scheduling and fault prediction of the resources, further discover the risk in advance and improve the quality of the operation.

3.3 risk control and reliability assurance

With large scale and wide coverage of business, any cluster failure will directly affect the stability of the service and even the user experience. After many operation and maintenance failures and security pressure, we have formed a set of replicable risk control and reliability guarantee strategy.

In the whole risk control link, we are divided into five levels: indicators, alarms, tools, mechanisms & measures and personnel:

Index data collection, from the node, cluster, component and resource level to collect core indicators as data sources.

Risk push, a multi-level and multi-dimensional alarm mechanism covering core indicators.

In terms of tool support, the risk of misoperation is reduced through initiative, passivity and flow.

In terms of mechanism guarantee, through testing, grayscale verification, release confirmation and drills to reduce negligence.

People are the root of risk, and we have been working hard to build and rotate to ensure the response to the problem.

In terms of reliability verification and operation, we firmly believe that efforts need to be used to evaluate the health of the cluster through cluster inspection and push reports; regular downtime drills ensure that real failures can be quickly recovered. and the daily problems are added to the full-link test to form a closed loop.

IV. Summary and future prospects 4.1 experience

The landing of Kubernetes is fully compatible with the community's Kubernetes API; will only do plug-in expansion, and try not to change the original behavior at the control level.

For some features of the community, learn from each other, and have expected upgrades, do not blindly upgrade and follow up the community version, and try to maintain a core stable version every year.

Landing on the user pain point as a breakthrough, the business is more practical, why do you need to migrate? The business will be afraid of trouble and uncooperative, so to find the business pain point, from the point of view of helping the business, the effect will be different.

Internal cluster management operation value display is also a very important link, so that users see value, business see potential benefits, they will take the initiative to come to you.

In the container age, you can't just look at Ku1. Bernetes itself, for the infrastructure within the enterprise, "up" and "down" integration and compatibility issues are also critical. " "up" is to provide interface for users in business scenarios, because the container does not directly serve the business, and it also involves how to deploy applications, service governance, scheduling and many other aspects. " "downward", that is, the combination of containers and infrastructure, where more compatible resource types, stronger isolation, and higher efficiency in the use of resources are all key issues.

4.2 prospects for the future

Unified scheduling: a small amount of VM will exist for a long time, but the cost of maintaining two sets of infrastructure products at the same time is very high, so we are also implementing Kubernetes to manage VM and containers.

VPA: explore the use of VPA to further improve the efficiency of the entire resource.

Cloud native application management: at present, we have implemented the cloud native application management in the production environment. In the future, we will further expand the coverage of cloud native applications and continuously improve the efficiency of research and development.

Cloud native architecture landing: promote various middleware, storage systems, big data and search business cooperation in all areas of cloud native systems.

So much for sharing about how Kubernetes has changed the cloud infrastructure. I hope the above content can help you and learn more. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.