Detailed explanation of Alibaba Sigma scheduling and Cluster Management system Architecture 05/09 Update SLTechnology News&Howtos

Detailed explanation of Alibaba Sigma scheduling and Cluster Management system Architecture

2025-05-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Detailed explanation of Alibaba Sigma scheduling and Cluster Management system Architecture

Draw the key points

Alibaba experienced 9 years of double 11, the trading volume increased 280times, the trading peak increased more than 800times, and the number of systems showed explosive growth. The complexity and difficulty of the system in the process of supporting double 11 increases exponentially. The essence of double 11 peak is to maximize the user experience and cluster throughput with limited cost, and to solve the peak at a reasonable cost.

This article will explain in detail how Alibaba supports such a huge system from three aspects: unified dispatching system, mixed-department architecture and cloud architecture.

Unified dispatching system

Sigma, which was built in 2011, is a scheduling system to serve Alibaba's online business. There is a whole set of scheduling-centered cluster management system around Sigma.

Sigma has three layers of brain interaction and cooperation: Alikenel, SigmaSlave and SigmaMaster. Alikenel is deployed on each physical machine to enhance the kernel, flexibly adjust resource allocation and time slice allocation according to priority and strategy, and make decisions on task delay, task time slice preemption and unreasonable preemption eviction through the upper rule configuration. SigmaSlave can allocate container CPU and handle emergency scenarios on this machine. The native Slave can quickly make decisions and respond to the interference of delay-sensitive tasks, so as to avoid the service loss caused by the long processing time of global decisions. SigmaMaster is the strongest central brain, which can take charge of the overall situation and make resource scheduling and algorithm optimization decisions for the container deployment of a large number of physical machines.

The whole architecture is a final-state-oriented design concept. After receiving the request, the data is stored in the persistent storage layer, the scheduler identifies the location of scheduling requirements and allocation resources, and the Slave recognizes state changes to promote local allocation and deployment. The overall coordination and final consistency of the system are very good. We started to do the dispatching system in 2011, rewrote it in Go in 2016, and compatible with kubernetes API in 2017. We hope to build and develop together with the strength of ecology.

Mixed architecture

Alibaba began to promote the mixed architecture in 2014 and has been deployed on a large scale within Alibaba. Online services belong to long life cycle, high complexity of rules and strategies, and time delay sensitive tasks. On the other hand, the computing task has the advantages of short life cycle, large scheduling requirements, high throughput, different priority and insensitive to delay. Based on the different essential demands of these two kinds of scheduling, we deal with the two kinds of scheduling in parallel on the mixed deployment architecture, that is, there can be both Sigma scheduling and Fuxi scheduling on one physical machine to achieve the unity of the basic environment. Sigma scheduling starts the PouchContainer container through SigmaAgent. Fuxi also grabs resources on this physical machine to start its own computing tasks. All online tasks are on the PouchContainer container, which is responsible for allocating server resources and running online tasks, and offline tasks fill in its blank space to ensure that the utilization of physical machine resources is saturated, so that the mixed deployment of the two tasks is completed.

The key Technology of Kernel Resource isolation

In CPU HT resource isolation, the Noise Clean kernel feature is made to solve the problem of resource contention between online and offline hyper-threading.

In the CPU scheduling isolation, the Task Preempt feature is added on the basis of CFS to improve the priority of online task scheduling.

In the CPU cache isolation, achieve in-and offline three-level cache (LLC) channel isolation (Broadwell and above) through CAT.

In memory isolation, it has CGroup isolation / OOM priority; Bandwidth Control reduces offline quotas to achieve bandwidth isolation.

In terms of memory elasticity, the mixing effect is improved when the memory is not increased, and when the memcg limit; is offline when it is idle online, it will be released offline in time.

In the network QoS isolation, management and control is marked as gold medal, online as silver medal, offline as bronze medal, and bandwidth is guaranteed by levels.

Key Technologies in online Cluster Management

Draw the memory, CPU, network, disk and network capacity of the application, know its characteristics, resource specification requirements, and the real use of resources at different times, and then analyze the correlation between the overall specification and time, and optimize the overall scheduling.

Affinity mutual exclusion and task priority allocation, which applications put together to make the overall computing power is relatively less, throughput capacity is relatively high, which has a certain affinity.

Different scenarios have different strategies. The strategy of Singles 11 is to give priority to stability. Stability priority means to adopt a tiling strategy to exhaust all resources so that the resource layer reaches the lowest water level. Daily scenarios need to give priority to utilization. "Utilization priority" means to allow the used resources to reach the highest water level and free up a large number of complete resources for large-scale calculation.

The application can achieve automatic contraction, vertical expansion and time-sharing reuse.

The rapid expansion and reduction of the entire site, flexible memory technology and so on.

Hybrid deployment refers to the introduction of computing tasks into online service clusters to improve the efficiency of daily resources. After introducing offline tasks, the average utilization of CPU increases from 10% to more than 40%, and the delay impact of delay sensitive services is less than 5%, which is fully acceptable. At present, our entire hybrid cluster has reached the scale of thousands of units, which has been verified by the double 11 promotion of the core link of the transaction. This optimization can save more than 30% of servers on a daily basis. This year, the deployment scale will be expanded 10 times, and large-scale benefits will be achieved.

Through time-sharing reuse, we can further improve the efficiency of resources. The curve in the figure above is the flow curve of one of our applications. It is very regular, with the left representing the trough at night and the right representing the peak during the day. Normal mixing means that the utilization rate is increased to 40% by occupying the resources in the blue shadow part of the graph. Elastic time-sharing multiplexing technology refers to finding the trough period of application traffic for application portraits, reducing the capacity of applications, releasing a large amount of memory and CPU, and scheduling more computing tasks. Through this technology, the average CPU utilization is increased to more than 60%.

The Progress of PouchContainer Container and containerization

Comprehensive containerization is the key technology to improve the ability of operation and maintenance and unified scheduling. First of all, let's introduce Alibaba's internal container product PouchContainre. It has been built and launched since 2011, based on LXC, and began to absorb Docker mirroring capabilities and be compatible with container standards in early 2015. Alibaba's container is very features. it combines the Ali core and greatly improves the security isolation. it is currently deployed within the Ali Group on a scale of one million.

Let's take a look at the development route of PouchContainer. In the past, virtual machine virtualization technology was used, and the transition from virtualization technology to container technology is faced with many challenges of operation and maintenance system. The migration of operation and maintenance system has a great technical cost. We have achieved seamless migration of Ali internal operation and application perspective, with independent IP, ssh login, independent file system and resource isolation usage visibility. After 2015, Alibaba introduced container standards to form a new set of container PouchContainer and integrated into the entire operation and maintenance system.

The isolation of PouchContainer is very good. It is a rich container. You can log in to see the amount of resources occupied by processes in the container and how many processes there are. Processes will not hang up when they hang up the container, so they can run a lot of processes. The compatibility is very good, and the old version of the kernel also supports it, which is very helpful to the old. At the same time, after the large-scale verification of million-level container deployment, we have developed a set of P2P image distribution mechanism to greatly improve the distribution efficiency. At the same time, it is compatible with more standards in the industry, promotes the construction of standards, and supports RunC, RunV, RunLXC and so on. After the test of the scale of millions of containers, it is stable and efficient, so it is the best choice for enterprises to make full containerization.

The structure of PouchContainer is relatively clear, and how Pouchd interacts with kubelet, swarm and Sigma. The CSI standard has been built together with the industry on storage. Support distributed storage such as ceph and pangu. Use lxcfs on the network to enhance isolation and support multiple standards.

At present, PouchContainer covers most of Ali's BU,2017 deployments up to a million-level deployment, online services reach 100% containerization, and computing tasks begin to be containerized, which equalizes the operation and maintenance costs of heterogeneous platforms. Covers the operation mode, multiple programming languages, DevOps system. PouchContainer covers almost all business sectors of Ali, such as ants, transactions, middleware and so on.

PouchContainer announced open source on October 10, 2017, officially open source on November 19, and plans to release its first major version in March 2018. We hope to promote the development of container field and the maturity of standards through the open source of PouchContainer, and provide the industry with differentiated and competitive technology choices. Not only is it convenient for traditional IT enterprises to benefit from the old, the old infrastructure can also enjoy the benefits and advantages of cloud source technology, and it is convenient for new IT enterprises to enjoy the advantages of large-scale stability and multi-standard compatibility.

PouchContainer open source address: https://github.com/alibaba/pouch

Cloud architecture

Cloud architecture operation and maintenance system

The cluster is divided into online task cluster, computing task cluster and ECS cluster. Basic operation and maintenance systems such as resource management, stand-alone operation and maintenance, condition management, command channels, monitoring and alarm have been opened. In the double 11 scenario, we will draw a separate area on the cloud to communicate with other scenarios. In the interworking area, Sigma scheduling can apply for resources from computing cluster servers to produce Pouch containers, or apply for ECS from cloud open API to produce container resources. In daily scenarios, Fuxi can apply for resources in sigma and create needed containers.

In the double 11 scenario, a large number of online services are built on the container using a large-scale operation and maintenance system, including mixed deployment of the business layer. Each cluster has online service and stateful services and big data analysis. Aliyun's exclusive cluster also deploys online services and stateful data services to achieve datacenter as a computer, multiple data centers are managed like a computer, and resources needed for business development are scheduled across multiple different platforms. Build a hybrid cloud to get the server at a very low cost to solve the problem.

First there is the scale of the server, and then the resource utilization is greatly improved through time-sharing reuse and hybrid deployment. It really realizes the flexible mixed deployment of flexible resource smooth reuse tasks, and achieves the business capacity goal with the shortest time and optimal efficiency with the least number of servers. Through this cloud architecture, we achieved a 50% reduction in new IT costs and a 30% reduction in daily IT costs in Singles Day, bringing about an explosion of technical value in the field of cluster management and scheduling, which also shows that the popularity of container and orchestration and scheduling technology is inevitable.

Ali scheduling system team is committed to creating the most efficient scheduling and cluster management system in the world, and building the best cloud solution through enterprise container and container platform. Look forward to working with colleagues in the industry to reduce the IT cost of the whole industry and accelerate the innovation and development of enterprises.

Shutong (Ding Yu), Alibaba senior technical expert, 8 times participated in double 11 operations, Ali Gao available architecture, double 11 stability person in charge, Ali container, dispatching, cluster management, operation and maintenance technology head.

Alibaba dispatching system team provides dispatching, container and cluster management infrastructure for Alibaba economy, optimizes Alibaba's overall cloud efficiency and cost, and provides sufficient technical competitiveness for Ali economy and cloud business. Committed to building the world's leading and most efficient scheduling cluster management system and efficient and stable enterprise-class rich container engine.

The Pouch container team provides Alibaba economy with container technology in the field of infrastructure, helping Alibaba to fully realize the containerization of his business and lay a solid foundation for the group's "cloud" strategy. The team is committed to building a world-leading, efficient and stable enterprise-class rich container engine.

An excellent team is always eager for talents to join, if you want to enter the technical core and challenge the limits of the computer, please join us; if you want to work with dedicated and excellent people, please join us; if you have a heart dedicated to elegant code, please join us!

The following positions are permanently open: Golang engineer, Java engineer, scheduling architect, container architect, mixed architect, cluster resource management research and development expert, container PaaS platform technical expert, enterprise container platform solution architect, etc.

Extrapolated channel: xianshu.lj@alibaba-inc.com

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.