In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces you what is the design and principle of Volcano architecture, the content is very detailed, interested friends can refer to, hope to be helpful to you.
The background of Volcano
The above figure is an analysis we have done, which is divided into three layers, the bottom is the resource management layer, and the middle is the domain framework, including AI system, HPC, Batch, WKflow management, as well as some micro-services and traffic governance. Further up is the application of the industry and some industries.
As applications in some industries become more complex, it requires more and more solutions. For example, more than 10 years ago, in the financial industry to provide solutions, its architecture is very simple, may need a database, an ERP middleware, can solve most of the bank business.
Now, to collect a large amount of data every day, it needs spark to do data analysis, and even some data lake products to build a data warehouse, and then to do analysis and generate reports. At the same time, it will also use some AI systems to simplify business processes, and so on.
As a result, some industry applications today are more complex than they were 10 years ago, and it may be applied to one or more of the following domain frameworks. In fact, for industry applications, its demand is to integrate multiple domain frameworks, and the demand of the domain framework is that the following resource management management can provide unified resource management.
Kubernetes now more and more carries the role of unified resource management. It can not only provide services for the industry domain framework of HPC, but also serve as the resource management of big data field. Volcano is mainly a batch processing system based on Kubernetes. It is hoped that the upper HPC, the application of the middle layer big data and the lowest layer AI can run more efficiently on the unified Kubernetes.
What kind of problems does Volcano solve?
Challenge 1: scheduling strategy for high performance load
E.g. Fair-share, gang-scheduling
Challenge 2: support multiple job lifecycle management
E.g. Multiple pod template, error handling
Challenge 3: support multiple heterogeneous Hardwar
E.g. GPU, FPGA
Challenge 4: performance optimization for high performance loads
E.g. Scalability, throughput, network, runtime
Challenge 5: support resource management and time-sharing
E.g. Queue, ReclaimVolcano architecture
The blue part is the component of K8s itself, and the green part is some of the new components added by Volcano.
Job submission process:
1. After passing Admission, kubectl will create a Job (Volcano CRD) object in kube-apiserver.
2. JobController creates the corresponding Pods e.g. Replicas according to the configuration of Job
3. After Pod and PodGroup are created, vc-scheduler will go to kube-apiserver to obtain Pod/PodGroup and node information.
4. After obtaining the information, vc-scheduler will select the appropriate node for each Pod according to its configured scheduling policy.
5. After assigning nodes to Pod, kubelet will get the configuration of Pod from kube-apiserver and start the corresponding container
Several points that need to be emphasized:
Scheduling policies in vc-scheduler are all in the form of plug-ins, e.g. DRF, Priority, Gang
Vc-controllers contains QueueController, JobController,PodGroupController and gc-controller
Vc-scheduler can schedule not only batch computing jobs, but also micro-service jobs, and it can coexist with kube-scheduler through multi-scheduler function.
Some components introduce Controller
On the left is Volcano Job Controller, which includes not only the lifecycle management of Volcano,Job, but also job management. We provide unified job management, as long as you use Volcano, you do not need to create a variety of operations, you can run jobs directly.
On the right is CRD Job Controller, which is integrated through the PodGroup below.
Scheduler architecture
Scheduler supports dynamic configuration and loading. Apiserver on the left, Job, Pod and Pod Group;Scheduler in the whole Scheduler,apiserver are divided into three parts, the first layer is Cache, the middle layer is the whole scheduling process, and the right is the scheduling algorithm in the form of plug-ins. Cache will store and process the information of Pod and Pod Group created in apiserver into Jobinfors. The OpenSession in the middle tier takes Pod and Pod Group from the Cache lira and fetches the algorithm plug-in on the right together to run its scheduling work.
The transition between states is based on different operations, as shown in the following figure.
In addition, we have added a lot of states to Pod and Pod. The blue part in the figure is the state that comes with K8s; the green part is the state at session level, and a scheduling cycle, we will create a session, which only works during the scheduling cycle. Once the scheduling cycle is over, these states are invalid; the status of the yellow part is placed in the Cache. The purpose of adding these states is to reduce an interaction between scheduling and API, thereby optimizing scheduling performance.
These states of Pod provide more possibilities for optimization of the scheduler. For example, when Pod expulsion is carried out, it is less expensive to expel Pod in Binding and Bound state than to expel Pod in Running state (think: is there any other state Pod that can be expelled?) ; and the status is recorded within the Volcano schedule, reducing communication with kube-apiserver. But at present, the Volcano scheduler only uses some of the functions of the state. For example, the current preemption/reclaim will only expel the Pod; in the Running state. This is mainly because it is difficult to achieve complete state synchronization in the distributed system, and there will be a lot of state competition in the Pod that expels the Binding and Bound states.
What benefits can it bring in terms of function?
Support mixed deployment of multiple types of jobs
Support multi-queues for multi-tenant resource sharing, resource planning, and time-sharing reuse of resources
Support a variety of advanced scheduling strategies to effectively improve the resource utilization of the whole cluster
Support real-time resource monitoring, for high-precision resource scheduling, such as hot spots, network bandwidth; container engine, network performance optimization, e.g. Load-free
Distributed training scenario: Gang-scheduler
Case 1: 1 job with 2ps + 4workers
Case 2: 2 jobs with 2ps + 4workers
Case 3: 5 jobs with 2ps + 4workers
In comparison between Volcano and kubeflow+kube-scheduler, Case 1 has the same effect when there are sufficient resources; Case 2 runs two jobs at the same time without sufficient resources, and if there is no gang-scheduling, one of the jobs will be busy, etc.; Case 3 when the number of jobs increases to 5, there is a high probability of deadlock; generally only 2 jobs can be completed.
IOAware
The sum of the execution time of 3 jobs; each job has 2ps + 4workers
The execution time of the default scheduler fluctuates greatly
The increase in execution time depends on the proportion of data in the job.
Reduce Pod Affinity/Anti-Affinity and improve the overall performance of the scheduler
Big data scene
Spark-sql-perf (TP-DCS, master)
104 queries concurrently
(8cpu, 64g, 1600SSD) * 4nodes
Kubernetes 1.13
Driver: 1cpujol 4G; Executor: (1cpupjol 4G) * 5
If there is no fixed driver node, run up to 26 query statements at the same time
Because Volcano provides activity-level resource reservation, the overall performance is improved by ~ 30%.
HPC scene MPI on Volcano
Planning GPU shared Featur
1) Computational power optimization:
GPU hardware acceleration, TensorCore
GPU sharing
Teng Teng transformation
2) scheduling algorithm optimization:
Job/Task model, which provides AI class Job unified batch scheduling.
Multi-task queuing to support multi-tenant / department sharing cluster
Optimal affinity scheduling, Gang Scheduling, etc., in multitasking clusters within a single Job
Mainstream distributed training models such as PS-Worker and Ring AllReduce
3) process optimization
Container image
CICD process
Log monitoring
Volcano can support a larger cluster scheduling, we now have 10, 000 nodes million containers, scheduling performance up to 2000 Pod per second.
1) choreography:
Etcd sub-database sub-table, e.g. Event into a single library, wal/snapshot hanging disk
Controller-manager multi-activity is realized through consistent hash decentralized processing.
Flexible expansion of Kube-apiserver based on workload
2) scheduling:
Improve the throughput performance of single scheduler by EquivalenceCache, algorithm pruning and other techniques
Realize the multi-activity of the scheduler and improve the scheduling rate through the shared resource view.
3) Network:
Increase single-node container density and single-cluster ENI capacity through trunkport
Pre-apply for the network port through Warm Pool to improve the distribution speed of the network port
Support large-scale, highly changing cloud native application network based on eBPF/XDP, e.g. Service, network policy
4) engine:
Containerd concurrent startup optimization
Support shimv2 to increase the density of single-node containers
Image download acceleration Lazy loading
Cromwell Community Integration
Cromwell is a process scheduling software that can define different jobs. This software is widely used in the field of gene sequencing and gene computing.
Native support for Volcano in the Cromwell community
Enterprise version has been launched on Huawei Cloud GCS
Support for job dependencies through cromwell
Volcano provides job-oriented and data-dependent scheduling
Volcano CLI
KubeSim
Brief introduction:
Description tool for performance testing and scheduling of clusters
Simulate large-scale K8S cluster without resource restriction
A complete K8S API call will not really create a pod
The simulation work of large-scale special projects and dispatching special projects on the product side has been supported.
Overall structure:
Worker cluster: host kubemark virtual node, hollow pod
Master cluster: manage kubemark virtual nodes, hollow node
Hollow pod = hollow kubelet + hollow proxy
On the Volcano architecture design and principles of what is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.