Zhang Dong, chief scientist of wave cloud sea: a system design method for one cloud and multi-core 05/01 Update SLTechnology News&Howtos

Zhang Dong, chief scientist of wave cloud sea: a system design method for one cloud and multi-core

2025-05-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Recently, Zhang Dong, chief scientist of Chaochao Yunhai, and Kui Kaiyuan, a senior researcher, published a paper entitled "system Design method for one Cloud and Multi-Core" in the 9th issue of Communication of Chinese computer Society in 2023, deeply analyzed the key challenges behind one Cloud and Multi-Core, explained the system design method and key technical route for one Cloud and Multi-Core, and on this basis, described the development roadmap of one Cloud and Multi-Core in three stages. It provides a new idea to promote the iterative evolution of a cloud multi-core to the goal of application awareness and architecture non-perception.

In recent years, the huge market demand has accelerated the development of cloud computing software and hardware in China. Cloud computing innovation chain and industry chain have been initially formed from chips, whole machines, cloud operating systems, middleware to application software. With the acceleration and deepening of the process of "cloud data intelligence" in the industry, the application scenarios show a diversified trend, and more and more data centers choose diversified computing power construction, which brings new challenges to the integration of pooled management and flexible scheduling.

Central processing unit (CPU) is the most widely used computing device, and the phenomenon of multiple isomerism caused by the superposition and combination of multi-manufacturers and different architectures is particularly prominent. Intel, AMD and other x86 architectures are still the dominant force in the data center, but the proportion is gradually shrinking; ARM architecture with many computing cores, low power consumption and other advantages, a strong momentum of development; open source RISC-V architecture is also gradually on the rise. At the same time, under the background of the reconstruction of the global industrial chain, the R & D and production of core components in China have also entered a stage of vigorous development, but due to a late start, different technical routes and different levels of development, multiple heterogeneous processors will co-exist and develop for a long time.

Key scientific issues of one cloud and multi-core

As a computing power supply mode in pursuit of performance-to-price ratio, the upgrade, replacement and expansion of cloud computing processors are changing from a single architecture to a multi-heterogeneous architecture. In the case of differences in the function, performance and reliability of multiple heterogeneous processors, in order to meet the technical requirements of high efficiency and stability, achieve low cost or free switching of applications across processors, avoid supply risks, and ensure the long-term and stable operation of key business, "one cloud multi-core" has become the inevitable trend of the development of cloud computing.

The Internet industry's one-cloud multi-core work for the public cloud started early, breaking its dependence on x86 architecture by developing cost-effective processors with technology and financial reserves, such as Amazon's ARM-based Graviton processor. In view of the contradiction between the diversity of resources in private Yunnan and the complexity of northbound applications, domestic, telecommunications, energy and other industries have also begun to carry out one-cloud and multi-core research and construction, early through cloud management to manage multiple heterogeneous resource pools. although a unified entrance can be formed, the efficiency of resource supply is inefficient due to the fragmentation of resource pools and the inability of applications to be choreographed across architectures.

Cloud Sea operating system (InCloud OS), Apsara Stack, EasyStack, etc. realize the unified scheduling and interconnection of heterogeneous resources through a single resource pool, but the current stage mainly solves the problem of "multi-core" mixed parts, which is still a long way from application-centric cross-architecture operation and low-cost switching. In order to meet the stable operation, smooth switching and flexible expansion of the service under the condition of multi-core coexistence, the following scientific and technical problems need to be solved urgently.

1. Application portability across architectures and running environment equivalence issues. When an application runs on nodes of different processor architectures in a multi-core system, you first need to ensure that the program itself is portable across architectures. Furthermore, hierarchical and modular complex applications can be dynamically migrated, remotely invoked or horizontally extended between heterogeneous nodes. how to ensure the cross-architectural equivalence of the running environment (operating system, runtime, dependent libraries, etc.) is a challenge (see figure 1).

two。 Multivariate heterogeneous computing power quantitative analysis and load-aware scheduling problem. The performance difference of multivariate heterogeneous CPU is 2 to 10 times, and the difference of computing power between nodes with heterogeneous acceleration units is of an order of magnitude. When applications migrate, switch or scale between heterogeneous nodes, they need to ensure a consistent user experience and comply with the service level agreement (Service Level Agreement,SLA) of the business. How to carry out evaluation modeling and quantitative analysis of multivariate heterogeneous computing power equivalence relations to achieve load-aware balancing scheduling and adaptive flexible scaling has become a key scientific problem.

3. State consistency guarantee of distributed applications under non-peer-to-peer architecture. Compared with the equivalence of traditional distributed nodes, the non-equivalence of heterogeneous nodes in a cloud multi-core application can not be ignored. For non-peer-to-peer distributed cloud native applications, to achieve efficient consensus negotiation and data synchronization of stateful tasks among heterogeneous nodes, as well as dynamic control and smooth segmentation of non-intrusive traffic of stateless tasks, it has become a key technical difficulty in cross-architecture cloud native application orchestration.

Design and key Technologies of one Cloud Multi-core system

Niklaus Wirth, the winner of ACM Turing Prize, put forward the famous formula "program = data structure + algorithm", which reveals the time and space nature of the program. As a software definition extension, a cloud multi-core system includes not only the instruction logic of the data plane and the data state, but also the control of multiple heterogeneous resources. Therefore, a cloud multi-core system can be abstracted as "resource management + running program + data state".

Among them, resource management provides hardware resource abstraction such as computing, storage, network and security through software definition, and provides resource encapsulation and running environment for applications with granularity such as virtual machine, container, bare metal (bare metal). According to hierarchical decoupling, the running program is divided into resource layer, platform layer and application layer, such as applications and resource management programs that carry user services. Data state refers to the instantaneous data in memory, database persistence data and traffic state that the program depends on.

According to the above definition, a cloud multi-core system should be designed from three aspects: the operability of the program, the manageability of resources and the transferability of states.

1. The runnability of the program runs in a cloud multi-core system across architectures. the primary design goal of the program is operability, that is, it can be ported and run in different processor architectures. the technical route includes cross-platform language, cross-platform compilation and instruction translation (see Table 1).

Cross-platform languages, represented by Java and Python, realize the cross-architecture operation of independent parts of the program architecture through cross-platform languages, but there are still some architecture-related problems: (1) runtime environment dependence, for example, Java programs running in multi-core systems need to provide different architectures of the Java virtual machine (Java Virtual Machine,JVM) runtime (2) Local library dependencies, such as Java local interface (Java Native Interface,JNI), need to be migrated across platforms.

Cross-platform compilation is cross-compilation, and executable programs of other architectures are generated with the help of specific processor architecture environment and compilation tools. Cross-compilation realizes the cross-platform binary code generation of programs through architecture-independent source code, but for executable programs, it still needs to unify binaries and processor architecture.

Binary translation, that is, instruction set translation technology, is a research hotspot to solve the problem of application cross-architecture migration. The implementation methods include software-level binary translation and chip-level binary translation. Both the software level and the chip level are limited by the translation system. Software-level binary translation needs to transform the application running environment, which increases the complexity of the running environment, but the performance loss of chip-level binary translation process is serious. For example, the current translator efficiency for pure operation programs is 60%-70% of that of direct compilation. If system calls, locks and other operations are involved, the efficiency will drop to 30%-40%, and the binary translation process still has the problem of instruction set incompatibility, such as advanced vector extension (Advanced Vector Extensions,AVX) instructions.

Equivalent encapsulation at run time

The cross-platform language solves the cross-architecture problem of the application, but it needs to provide a cross-architecture runtime; cross-compilation solves the cross-architecture compilation problem, but there is still a runtime dynamic library dependency problem. Therefore, when a program runs in a multi-core system, it needs to consider not only its own operability, but also its dependent runtime for modern complex applications. The feasible way is to encapsulate the application and its runtime dependencies in a standardized container way, as the basic resource encapsulation for cross-architecture deployment and switching of applications.

In other words, different container images are built for different architectures based on the same set of source code, and if the program is built on a cross-platform language, the program script or intermediate code and the runtime are encapsulated as containers. If the program is built on a non-cross-platform language, binaries under various architectures can be built through cross-compilation, and then encapsulated with dependent libraries as containers. This process can be automatically built through a set of pipeline jobs and pushed to the mirror library.

To sum up, the operational design of a cloud multi-core program includes three aspects: first, to compile and run the application across architectures, secondly, to build a standard containment package, and finally to achieve lightweight deployment through cloud resource orchestration management (see figure 2).

two。 The manageability of resources includes architecture awareness and computing quantitative analysis, as well as system-oriented balanced resource scheduling and business-oriented flexible scaling.

Architecture awareness technology architecture awareness is the key for a cloud multi-core to realize node scheduling and adaptive display of interface functions, and is the basis for supporting the operability of programs and realizing resource encapsulation lifecycle management. it can be implemented through collectors, schedulers and interceptors (see figure 3). (1) the collector collects and reports the CPU architecture, hardware characteristics and other information of each node, and establishes a list of hosts including architecture characteristics. (2) the scheduler selects matching host nodes for various granularity resource encapsulation, adopts cascade filter mechanism, loads several independent filters, and matches the creation request with the host in turn. In a cloud multi-core scenario, cascading architecture awareness filters are used to identify the mirror architecture tags in the resource encapsulation creation request, and the host nodes are filtered according to the matching results of CPU architecture characteristics. (3) the interceptor is used to establish a dynamically scalable "architecture-function" mapping matrix, parse the actions and architectural characteristics of resource encapsulation management requests, execute intercept requests and display the results feedback. in order to achieve different architecture function differentiation of automatic identification, dynamic expansion, shield the underlying implementation differences, and provide a unified view of resource management.

Computing power quantification technology

Due to the different computing power of different processors, even if the same application uses the same specification of resource encapsulation (for example, the same CPU core, memory, etc.), the performance of the same application running in the heterogeneous environment is also different. According to the application scenario, computing power can be divided into CPU general computing power and XPU heterogeneous computing power. At present, the main problem facing a cloud multi-core system is the diversity and heterogeneity of CPU. Multi-vendor ARM and x86 processors are different in instruction set, core number, production technology and so on, so there are differences in performance. This difference can be described by the relationship of power equivalence, which can be divided into specification power, effective power and business power according to the hierarchy (see Table 2).

Among them, the versatility of specification computing power is the strongest, the effective computing power is more targeted for specific load types, and the business computing power is closer to the real application scenario, but because of the diversity of load and application, the calculation of effective computing power and business computing power needs to be completed jointly with upstream and downstream ecology.

Balanced scheduling technology

From the resource level, when selecting nodes for resource encapsulation, the load is scheduled according to the node computing capacity, which is a constrained optimization problem with the goal of maximizing resource utilization. After the cascade filter, the balanced scheduling algorithm selects the one with the relatively minimum load from the filtered host nodes as the ultimate goal. For a cloud multi-core system, the key to this process is the quantitative analysis of the computing power of the nodes. Based on the specification computing power to evaluate the specification coefficient of various types of resources, combined with numerical methods such as normalization and main resource fairness, the available computing power of each node can be calculated. The algorithm based on normalization is as follows:

The score Scorej of node j is the sum of the weight scores of r resource types, including CPU, memory, hard disk, etc., such as formula (1). The weight scoring algorithm for each resource type is as shown in formula (2), where ResourceNormalizedji is the minimum-maximum positive normalization of the allocable amount of node j resources I, as shown in formula (3). WeighterMultiplieri is the weight of resources, which can be adjusted according to the CPU, memory or IO intensive type of load to reflect the importance of each resource. Coefficientji is the specification computing power coefficient of each resource. For example, the specification computing power quantization relationship of ARM and x86 CPU is 1 ∶ 2, and the specification factor is 1 and 2, respectively. When the number of CPU cores is the same, x86 nodes are scheduled with higher priority. Thus the balanced scheduling based on computing power quantization in one cloud multi-core scenario is realized.

Elastic telescopic technology

In order to support the elastic scaling for business peaks and troughs, it is necessary to achieve accurate planning, fast scheduling and computational equivalence of resource encapsulation to ensure that the application services play correctly, quickly and accurately (see figure 4). (1) in the aspect of resource planning, according to the probability distribution characteristics of application load in a specific cycle, a load trend model is established based on historical data time series to depict load portraits and capacity portraits of the relationship between application load, quality of service and resources. resource encapsulation and scaling requirements are planned by load trend prediction and abnormal feedback. (2) in the aspect of fast scheduling, based on architecture awareness and balanced scheduling technology, quickly schedule to the best node when expanding resource encapsulation, and pull up the application service to ensure the timely response of the application service. (3) when auto scaling causes resource encapsulation to switch across architectures, the computing power of different architectures is described based on computing power quantization technology, and the equivalent relationship of resource encapsulation is calculated according to effective computing power and business computing power. Ensure that the quality of service scales linearly with the increase or decrease of resources.

3. The application state migration of the state transferability resource layer migrates the persistent data, memory instantaneous state, peripheral configuration and network traffic to the target node as a whole, which involves all the relevant data states in the resource encapsulation. In addition to the application itself, it also involves operating system, middleware and so on, so it is difficult to migrate. In order to solve this problem, we can further follow the idea of decoupling resource layer, platform layer and application layer, and adopt the method of state synchronization and traffic segmentation based on cloud primary micro-service governance.

Resource encapsulation and migration

The online hot migration technology of the virtual machine has been relatively mature, and the memory increment state of the source virtual machine is usually iteratively transferred to the destination host through the pre-copy algorithm. there are also optimization algorithms such as post-copy and hybrid copy, as well as hardware compression acceleration technology to accelerate the convergence of memory copy, reduce downtime and improve migration efficiency. However, there are still some limitations in virtual machine migration, such as the intergenerational gap of CPU from the same vendor, the compatibility between different vendors and architectures, and the lack of hot migration of different architectures. The research on online migration technology of containers started relatively late, which is essentially the migration of process groups. The current research is mainly based on user space checkpoint and recovery (Checkpoint and Restore In Userspace,CRIU) to achieve container runtime state migration, and derived a series of optimization methods to shorten the migration time and reduce the unavailability time.

In addition, adaptive container online migration realizes the matching of CPU and network bandwidth resources by dynamically adjusting the acceleration factor of the compression algorithm, and reduces the transmission time of container snapshots. Although there have been some researches and applications in the overall migration with virtual machines and containers as resource encapsulation granularity, there are still some problems such as large amount of data migration, long downtime and total migration time, so it is difficult to realize the smooth switching of applications across architectures. With the development of cloud native technology, the combination of service governance has become a feasible route, in which the key technologies include data synchronization of stateful services and traffic switching of stateless services.

Data state synchronization

The state synchronization of multiple replicas depends on the distributed consistency algorithm. ACM Turing Award winner Leslie Lambert (Leslie Lamport) proposed a Paxos consensus algorithm based on message passing and high fault tolerance. ZooKeeper's ZAB,MySQL 's wsrep and Etcd,Redis 's Raft protocol are based on their core ideas to achieve data state consistency. On this basis, the data state synchronization of a cloud multi-core platform layer needs to further consider the node asymmetry. The following article takes the Raft protocol as an example.

Election (leader election) process: the master node (leader) periodically sends heartbeats to all slave nodes (follower) to ensure the status of the master node. When a slave node does not receive the heartbeat within a timeout period, the node is transformed into a candidate (candidate) node to participate in the election. The differences in processing capacity and network conditions of each node in a cloud multi-core system lead to differences in timeout influence. An adaptive method based on maximum likelihood estimation can be adopted to prevent nodes with large heartbeat delay and weak processing capacity from frequently triggering elections. At the same time, it ensures that nodes with strong processing capacity can initiate elections quickly. For the voting strategy, the mechanism of node priority or narrowing the range of random timeout is adopted to make it easier for strong nodes to get a majority of votes.

Log replication process: using legal write (quorum write) mechanism, the master node receives requests from the client, initiates write proposals to the slave node and receives feedback votes. Each proposal can only be submitted with more than half of the votes. Heterogeneous nodes in a cloud multi-core are designed as disaster recovery availability zones (Availability Zone,AZ) to ensure that all disaster recovery availability zones are written.

Business traffic segmentation

Cloud native applications distribute traffic to stateless replica instances through gateways or load balancers, and traffic is the state of stateless workloads. In a multi-core system, when applications migrate or stretch between heterogeneous nodes, it is necessary to split the traffic and drain it to the copy of the corresponding node. In order to ensure that the quality of service is not degraded, the specification and quantity of equivalent target copies are determined according to the quantitative analysis of effective computing power and business computing power, and the proportion of traffic borne by them is allocated, and the traffic switching should be fully decoupled from the business logic. it can be realized by using the idea of service grid.

The control plane senses the replica change to generate a traffic segmentation policy, which is sent to the network agent and gateway. For east-west traffic, the network agent hijacks the traffic and forwards it to different copies proportionally according to the segmentation policy. For north-south traffic, the gateway forwards to different copies according to the segmentation policy when the traffic is forwarded (see figure 5). In the instantaneous process of traffic segmentation, due to the failure to start the copy of the target node, TCP connection delay and other factors, there will be a decline in the quality of service of applications such as unresponsive and packet loss. The smooth switching of applications across architectures can be ensured through preheating, probe, retry, and drainage technology.

One cloud multi-core development path

According to the system design of resource manageability, program operation and state transferability, one cloud multi-core can be gradually evolved in three stages (see figure 6).

Phase 1: hybrid deployment, unified management, unified view

The first phase aims at manageability to realize unified pooling management, unified service directory and unified monitoring of operation and maintenance of heterogeneous processor nodes. the deployment and collaboration of applications across architectures are realized through homology and heterogeneity, offline migration, manual switching and business segmentation. At present, the construction of one cloud and multi-core at home and abroad is mainly in this stage. Following the system design method, the author team puts forward the method of continuous integration based on homologous heterogeneity, continuous delivery based on immutable infrastructure and architecture-aware scheduling in the research and development of InCloud OS, which supports the compilation of the source code of the same mainline cloud operating system, constructs the executable programs of heterogeneous nodes, and realizes the minute-level construction of C / C++, Java, Python, Go multi-language code on 8 mainstream processors. A reference guidance scheme is provided for various types of applications (see figure 7).

In the cloud platform built based on InCloud OS, a single resource pool supports all mainstream processor architectures and is cascaded by 1000 nodes per controller, realizing unified management and interconnection of multi-core cloud and multi-core cloud data centers with a distance of more than 1000 kilometers, supporting diversified business requirements of cloud digital intelligence, and formulating technical specifications and reference architecture (see figure 8).

Phase 2: business traction, hierarchical decoupling, architecture upgrade

On the basis of the first stage, in order to further meet the low-cost cross-architecture switching of applications, the second stage realizes application cross-architecture migration, multi-architecture hybrid deployment and traffic segmentation through hierarchical decoupling and architecture upgrade. The author team made a preliminary exploration in the resource layer, platform layer and application layer respectively.

1. In the resource layer, combined with the GuestOS-aware strain mechanism to further improve the applicability of migration for multivariate CPU, an online migration method based on consistent snapshot is proposed. Through changing data block tracking and multi-thread asynchronous optimization, the fast and complete migration of 10 TB large virtual machines is realized. After migration, the system initializes the hardware check, and if the relevant CPU features are not supported, then switch to fallback measures to ensure the normal operation of the system, especially for the Windows virtual machine to achieve CPU, firmware self-adaptation, compatible with desktop version above Win XP and server version above Win 2000, and has been applied in the actual production environment. However, the way of virtual machine migration is not aware of the application, and the migration may lead to the risk of database and application anomalies, which requires the cooperation of application developers to further verify the availability of the virtual machine after migration.

two。 At the platform layer, the current solution adopted in the production environment is to realize the cross-architecture operation of stateful applications through data synchronization and business segmentation (see figure 9). Based on InCloud OS, x86 and ARM database cluster services and data synchronization services are provided. Data synchronization services capture data changes according to the source database pre-write log (Write Ahead Log,WAL), optimize network protocol overhead and delay through compression algorithm, transaction merging and network packet encapsulation, and improve playback efficiency through packet multitask parallel and native loading mechanism at the target side to achieve subsecond data synchronization. The application is designed based on the read-write separation architecture, and is designed for x86 architecture database read-write and ARM architecture database read-only, to realize the cross-architecture operation of the database in a cloud multi-core scenario.

3. At the application layer, InCloud OS completed the first SPEC Cloud benchmark test in one-cloud multi-core scenario in January 2023, verifying the resource manageability based on single resource pool hosting multi-type x86, ARM processor architecture, the cross-architecture operability of computing-intensive clustering algorithm K-means, and the state mobility of IO-intensive distributed database Cassandra, combined with balanced scheduling algorithm. The scalability is more than 90%, the performance exceeds the SLA baseline by 20%, and the average launch time exceeds the world record by 25%.

Stage 3: software definition, computing power standard, full stack multi-core

One cloud multi-core is the integration of core and cloud, and the coordination of platform and ecology. In the third stage, through the cooperation of the upstream and downstream of the industrial chain, such as processor, whole machine, cloud operating system, database, middleware and application, the complete decoupling of application and processor architecture is realized to ensure the long-term stable operation of the business.

1. In the computing resource layer, while improving the performance and reliability of the processor, it defines the standardization and compatibility of the processor design through the system design, and promotes the continuous optimization of the binary translation technology in the application process. On the basis of supporting multi-core processors, the unified abstraction of heterogeneous computing power such as GPU and DPU is extended to realize heterogeneous acceleration cooperation.

two。 In the platform layer, we break through the variable granularity resource scheduling and allocation technology perceived by application characteristics, solve the problem of adaptive configuration and scheduling of application types and resource encapsulation, and study the technology of function topology arrangement, efficient scheduling and fast startup. Solve the problem of flexible construction and flexible expansion of large-scale cloud native applications.

3. In the application layer, promote applications to support multi-core homology and heterogeneity, improve the best practices of cloud original biochemical transformation and upgrading, and combine with the resource layer and platform layer to achieve smooth switching and elastic scaling of application awareness and architecture imperceptibility.

4. In the aspects of computing power evaluation, standards and evaluation, we study the quantitative methods of multi-heterogeneous effective computing power, and combine professional evaluation institutions and the upstream and downstream of the industry chain to establish one-cloud multi-core industry standards.

Conclusion: one-cloud multi-core is the inevitable trend to solve the problem of multi-core coexistence in data center. In order to solve the problems of runnability of applications across architectures, quantitative analysis of computing power, load-aware scheduling, and distributed state consistency of non-peer-to-peer architecture, the author team proposed the core design concept and system design method of a cloud multi-core system.

1. Adhere to the system concept, scene-driven, system design. From the CPU as the core to the system as the core design pattern transformation, application-oriented to establish a multi-heterogeneous integration, software definition and software-hardware collaborative technology development route, continue to improve computing efficiency and energy efficiency.

two。 Strengthen ecological cooperation, layered decoupling and open standards. Processors, whole machines, cloud operating systems, middleware and applications are decoupled layer by layer, and the problems of vertical closure and ecological dispersion caused by a single technology route are eliminated through ecological cooperation, so as to realize the standardization and standardization of one cloud and multi-core.

3. Formulate a roadmap for development, iterative innovation and continuous evolution. From hybrid deployment, offline migration, and manual switching, to smooth switching and elastic scaling based on architecture upgrades, to computing standards and full-stack multi-core iterative evolution.

The current research and practice are in the transition period from the first stage to the second stage. The technologies of program operability, resource manageability and state transferability have been explored and laid out. The next step is to strengthen the cooperation between industry chain and innovation chain. Iterative evolution to the goal of application awareness and architecture non-awareness, so as to promote a more solid and complete theoretical foundation of one-cloud multi-core computing. The mechanism of software-hardware collaboration and software definition is more mature and effective, the application-aware scenario paradigm is more clear and feasible, and the industrial ecology is more standard and standardized.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.