What is the design and principle of Volcano architecture? 04/11 Update SLTechnology News&Howtos

What is the design and principle of Volcano architecture?

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces you what is the design and principle of Volcano architecture, the content is very detailed, interested friends can refer to, hope to be helpful to you.

The background of Volcano

The above figure is an analysis we have done, which is divided into three layers, the bottom is the resource management layer, and the middle is the domain framework, including AI system, HPC, Batch, WKflow management, as well as some micro-services and traffic governance. Further up is the application of the industry and some industries.

As applications in some industries become more complex, it requires more and more solutions. For example, more than 10 years ago, in the financial industry to provide solutions, its architecture is very simple, may need a database, an ERP middleware, can solve most of the bank business.

Now, to collect a large amount of data every day, it needs spark to do data analysis, and even some data lake products to build a data warehouse, and then to do analysis and generate reports. At the same time, it will also use some AI systems to simplify business processes, and so on.

As a result, some industry applications today are more complex than they were 10 years ago, and it may be applied to one or more of the following domain frameworks. In fact, for industry applications, its demand is to integrate multiple domain frameworks, and the demand of the domain framework is that the following resource management management can provide unified resource management.

Kubernetes now more and more carries the role of unified resource management. It can not only provide services for the industry domain framework of HPC, but also serve as the resource management of big data field. Volcano is mainly a batch processing system based on Kubernetes. It is hoped that the upper HPC, the application of the middle layer big data and the lowest layer AI can run more efficiently on the unified Kubernetes.

What kind of problems does Volcano solve?

Challenge 1: scheduling strategy for high performance load

E.g. Fair-share, gang-scheduling

Challenge 2: support multiple job lifecycle management

E.g. Multiple pod template, error handling

Challenge 3: support multiple heterogeneous Hardwar

E.g. GPU, FPGA

Challenge 4: performance optimization for high performance loads

E.g. Scalability, throughput, network, runtime

Challenge 5: support resource management and time-sharing

E.g. Queue, ReclaimVolcano architecture

The blue part is the component of K8s itself, and the green part is some of the new components added by Volcano.

Job submission process:

1. After passing Admission, kubectl will create a Job (Volcano CRD) object in kube-apiserver.

2. JobController creates the corresponding Pods e.g. Replicas according to the configuration of Job

3. After Pod and PodGroup are created, vc-scheduler will go to kube-apiserver to obtain Pod/PodGroup and node information.

4. After obtaining the information, vc-scheduler will select the appropriate node for each Pod according to its configured scheduling policy.

5. After assigning nodes to Pod, kubelet will get the configuration of Pod from kube-apiserver and start the corresponding container

Several points that need to be emphasized:

Scheduling policies in vc-scheduler are all in the form of plug-ins, e.g. DRF, Priority, Gang

Vc-controllers contains QueueController, JobController,PodGroupController and gc-controller

Vc-scheduler can schedule not only batch computing jobs, but also micro-service jobs, and it can coexist with kube-scheduler through multi-scheduler function.

Some components introduce Controller

On the left is Volcano Job Controller, which includes not only the lifecycle management of Volcano,Job, but also job management. We provide unified job management, as long as you use Volcano, you do not need to create a variety of operations, you can run jobs directly.

On the right is CRD Job Controller, which is integrated through the PodGroup below.

Scheduler architecture

Scheduler supports dynamic configuration and loading. Apiserver on the left, Job, Pod and Pod Group;Scheduler in the whole Scheduler,apiserver are divided into three parts, the first layer is Cache, the middle layer is the whole scheduling process, and the right is the scheduling algorithm in the form of plug-ins. Cache will store and process the information of Pod and Pod Group created in apiserver into Jobinfors. The OpenSession in the middle tier takes Pod and Pod Group from the Cache lira and fetches the algorithm plug-in on the right together to run its scheduling work.

The transition between states is based on different operations, as shown in the following figure.

In addition, we have added a lot of states to Pod and Pod. The blue part in the figure is the state that comes with K8s; the green part is the state at session level, and a scheduling cycle, we will create a session, which only works during the scheduling cycle. Once the scheduling cycle is over, these states are invalid; the status of the yellow part is placed in the Cache. The purpose of adding these states is to reduce an interaction between scheduling and API, thereby optimizing scheduling performance.

These states of Pod provide more possibilities for optimization of the scheduler. For example, when Pod expulsion is carried out, it is less expensive to expel Pod in Binding and Bound state than to expel Pod in Running state (think: is there any other state Pod that can be expelled?) ; and the status is recorded within the Volcano schedule, reducing communication with kube-apiserver. But at present, the Volcano scheduler only uses some of the functions of the state. For example, the current preemption/reclaim will only expel the Pod; in the Running state. This is mainly because it is difficult to achieve complete state synchronization in the distributed system, and there will be a lot of state competition in the Pod that expels the Binding and Bound states.

What benefits can it bring in terms of function?

Support mixed deployment of multiple types of jobs

Support multi-queues for multi-tenant resource sharing, resource planning, and time-sharing reuse of resources

Support a variety of advanced scheduling strategies to effectively improve the resource utilization of the whole cluster

Support real-time resource monitoring, for high-precision resource scheduling, such as hot spots, network bandwidth; container engine, network performance optimization, e.g. Load-free

Distributed training scenario: Gang-scheduler

Case 1: 1 job with 2ps + 4workers

Case 2: 2 jobs with 2ps + 4workers

Case 3: 5 jobs with 2ps + 4workers

In comparison between Volcano and kubeflow+kube-scheduler, Case 1 has the same effect when there are sufficient resources; Case 2 runs two jobs at the same time without sufficient resources, and if there is no gang-scheduling, one of the jobs will be busy, etc.; Case 3 when the number of jobs increases to 5, there is a high probability of deadlock; generally only 2 jobs can be completed.

IOAware

The sum of the execution time of 3 jobs; each job has 2ps + 4workers

The execution time of the default scheduler fluctuates greatly

The increase in execution time depends on the proportion of data in the job.

Reduce Pod Affinity/Anti-Affinity and improve the overall performance of the scheduler

Big data scene

Spark-sql-perf (TP-DCS, master)

104 queries concurrently

(8cpu, 64g, 1600SSD) * 4nodes

Kubernetes 1.13

Driver: 1cpujol 4G; Executor: (1cpupjol 4G) * 5

If there is no fixed driver node, run up to 26 query statements at the same time

Because Volcano provides activity-level resource reservation, the overall performance is improved by ~ 30%.

HPC scene MPI on Volcano

Planning GPU shared Featur

1) Computational power optimization:

GPU hardware acceleration, TensorCore

GPU sharing

Teng Teng transformation

2) scheduling algorithm optimization:

Job/Task model, which provides AI class Job unified batch scheduling.

Multi-task queuing to support multi-tenant / department sharing cluster

Optimal affinity scheduling, Gang Scheduling, etc., in multitasking clusters within a single Job

Mainstream distributed training models such as PS-Worker and Ring AllReduce

3) process optimization

Container image

CICD process

Log monitoring

Volcano can support a larger cluster scheduling, we now have 10, 000 nodes million containers, scheduling performance up to 2000 Pod per second.

1) choreography:

Etcd sub-database sub-table, e.g. Event into a single library, wal/snapshot hanging disk

Controller-manager multi-activity is realized through consistent hash decentralized processing.

Flexible expansion of Kube-apiserver based on workload

2) scheduling:

Improve the throughput performance of single scheduler by EquivalenceCache, algorithm pruning and other techniques

Realize the multi-activity of the scheduler and improve the scheduling rate through the shared resource view.

3) Network:

Increase single-node container density and single-cluster ENI capacity through trunkport

Pre-apply for the network port through Warm Pool to improve the distribution speed of the network port

Support large-scale, highly changing cloud native application network based on eBPF/XDP, e.g. Service, network policy

4) engine:

Containerd concurrent startup optimization

Support shimv2 to increase the density of single-node containers

Image download acceleration Lazy loading

Cromwell Community Integration

Cromwell is a process scheduling software that can define different jobs. This software is widely used in the field of gene sequencing and gene computing.

Native support for Volcano in the Cromwell community

Enterprise version has been launched on Huawei Cloud GCS

Support for job dependencies through cromwell

Volcano provides job-oriented and data-dependent scheduling

Volcano CLI

KubeSim

Brief introduction:

Description tool for performance testing and scheduling of clusters

Simulate large-scale K8S cluster without resource restriction

A complete K8S API call will not really create a pod

The simulation work of large-scale special projects and dispatching special projects on the product side has been supported.

Overall structure:

Worker cluster: host kubemark virtual node, hollow pod

Master cluster: manage kubemark virtual nodes, hollow node

Hollow pod = hollow kubelet + hollow proxy

On the Volcano architecture design and principles of what is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.