How to use Elastic Training Operator 04/18 Update SLTechnology News&Howtos

How to use Elastic Training Operator

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to use Elastic Training Operator. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something from this article.

Background

Due to the natural advantages of cloud computing in terms of resource cost and elastic expansion, more and more customers are willing to build AI systems on the cloud. Native cloud technologies such as containers and Kubernetes have become the shortest way to release cloud value, and it has become a trend to build an AI platform based on Kubernetes on the cloud.

When faced with complex model training or a large amount of data, the computing power of a single machine is often unable to meet the requirements of computing power. By using the distributed training framework such as Ali's AiACC or community's horovod, only a few lines of code need to be modified to expand a stand-alone training task to support distributed training tasks. What is common on Kubernetes is that the tf-operator of the kubeflow community supports Tensorflow PS mode, or mpi-operator supports horovod's mpi allreduce mode.

Current situation

Kubernetes and cloud computing provide agility and scalability. We can set elastic policies for training tasks through components such as cluster-AutoScaler, and take advantage of the resilience of Kubernetes to create on-demand to reduce GPU equipment idling.

However, this stretching mode is still slightly inadequate for the offline task of training:

Fault tolerance is not supported. When part of the Worker fails due to device reasons, the whole task needs to be stopped and redone.

Generally speaking, the training task takes a long time, takes up a lot of calculation power, and the task lacks flexibility. When resources are insufficient, resources cannot be freed for other businesses on demand unless the task is terminated.

The training task takes a long time, does not support worker dynamic configuration, and cannot safely use preemptive instances to maximize cost performance on the cloud.

That is, to allow a training task to dynamically expand or reduce the capacity of the training worker in the process of execution, it will never cause the interruption of the training task.

If you are interested in the implementation principle of Elastic training, you can refer to this Elastic Horovod design document, which will not be described in detail in this article.

In mpi-operator, the Worker participating in training is designed and maintained as static resources. After supporting the flexible training mode, it not only increases the flexibility of tasks, but also brings challenges to the operation and maintenance layer, such as:

You must use horovordrun provided by horovod as the entrance. Launcher in horovod logs into worker through ssh. You need to open the landing tunnel between launcher and worker.

The Elastic Driver module responsible for calculating elasticity pulls or stops the worker instance by specifying the discover_host script to get the latest worker topology information. When the worker changes, the first step is to update the return value of the discover_host script.

In scenarios such as preemption or price calculation, it is sometimes necessary to specify worker reduction, and the native orchestration meta-language deployment,statefulset of K8s cannot meet the specified reduction scenarios.

Solution method

In order to solve the above problems, we design and develop et-operator, which provides TrainingJob CRD to describe training tasks, ScaleOut and ScaleIn CRD to describe capacity expansion and reduction operations. Through their combination, our training tasks are more flexible. Open source this program, you are welcome to raise needs, exchanges, complaints.

Open source solution address: https://github.com/AliyunContainerService/et-operator

Design

TrainingJob Controller has the following main functions:

Maintain the creation / deletion life cycle of TrainingJob, as well as sub-resource management.

Perform the capacity expansion operation.

Fault tolerant, when worker is expelled, create a new worker to add to the training.

1. Resource creation

The order in which TrainingJob child resources are created is as follows:

Create a workers, including service and pod, and mount the secret public key.

Create configmap, including discover_host scripts, hostfile files.

Create a launcher and mount the configmap. Because hostfile later changes with the topology, hostfile is copied from configmap to a separate directory through initcontainer alone.

TrainingJob related resources:

The configuration of TrainingJob CR is divided into Lanucher and Worker. Specify the image and startup execution of the task in Launcher. The default et-operator generates a hostfile file and discover_host script according to the allocation of worker. The discover_host script is mounted to the / etc/edl/discover_hosts.sh file of Launcher and specified by the-- host-discovery-script parameter in the horovodrun execution of the entry script. Specify the mirror and GPU occupancy of worker in the Worker settings, and you can specify the allowable range for the number of copies of workers through maxReplicas / minReplicas.

ApiVersion: kai.alibabacloud.com/v1alpha1kind: TrainingJobmetadata: name: elastic-training namespace: defaultspec: cleanPodPolicy: Running etReplicaSpecs: launcher: 1 template: spec: containers:-command:-sh-- c-horovodrun-np 2-- min-np 1-- max-np 9-- host-discovery-script / etc / edl/discover_hosts.sh python / examples/elastic/tensorflow2_mnist_elastic.py image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch2.4.0-mxnet-py3.6-gpu imagePullPolicy: Always name: mnist-elastic worker: maxReplicas: 9 minReplicas: 1 replicas: 2 template: spec: containers: -image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch2.4.0-mxnet-py3.6-gpu imagePullPolicy: Always name: mnist-elastic resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1" status : currentWorkers:-elastic-training-worker-0-elastic-training-worker-1-elastic-training-worker-2-elastic-training-worker-3 phase: Succeeded replicaStatuses: Launcher: active: 1 succeeded: 1 Worker: active: 4

2. Worker capacity expansion / reduction

In addition to TrainingJob, et-operator supports both ScaleOut and ScaleIn CRD, and issues capacity expansion and reduction operations for training tasks.

When a ScaleOut CR,ScaleOutController is sent to trigger Reconcile, the work here is very simple. According to the Selector field in ScaleOut CR, find the TrainingJob corresponding to Scaler and set it to the OwnerReferences of CR.

Take an example of a ScaleOut operation:

-apiVersion: kai.alibabacloud.com/v1alpha1 kind: ScaleOut metadata: creationTimestamp: "2020-11-04T13:54:26Z name: scaleout-ptfnk namespace: default ownerReferences:-apiVersion: kai.alibabacloud.com/v1alpha1 blockOwnerDeletion: true controller: true kind: TrainingJob name: elastic-training / / points to the expansion object TrainingJob uid: 075b9c4a-22f9-40ce-83c7-656b329a2b9e spec: selector: name: elastic-training toAdd: count: 2

In TrainingJobController, you hear that there is an update in the ScaleOut CR belonging to TrainingJob, trigger the Reconcile of TrainingJob, traverse and filter the ScaleIn and ScaleOut pointed by OwnerReference under TrainingJob, and determine the capacity expansion or reduction based on the creation time and status time.

ApiVersion: kai.alibabacloud.com/v1alpha1kind: TrainingJobmetadata: name: elastic-training namespace: defaultspec: / /. Launcher and Worker specstatus: currentScaler: ScaleIn:default/scaleout-ptfnk phase: Scaling currentWorkers:-elastic-training-worker-0-elastic-training-worker-1

ScaleOut Task CR:

ScaleIn Task CR:

Detailed working process:

Running

1. Install ET-Operator

Mkdir-p $(go env GOPATH) / src/github.com/aliyunContainerServicecd $(go env GOPATH) / src/github.com/aliyunContainerServicegit clone https://http://github.com/aliyunContainerService/et-operatorcd et-operatorkubectl create-f deploy/all_in_one.yaml

Detect the installation of crd:

# kubectl get crdNAME CREATED ATscaleins.kai.alibabacloud.com 2020-11-11T11:16:13Zscaleouts.kai.alibabacloud.com 2020-11-11T11:16:13Ztrainingjobs.kai.alibabacloud.com 2020-11-11T11:16:13Z

Check the running status of controller, which is installed in kube-ai by default:

# kubectl-n kube-ai get poNAME READY STATUS RESTARTS AGEet-operator-controller-manager-7877968489-c5kv4 0amp 2 ContainerCreating 0 5s

two。 Run TrainingJob

Run the pre-prepared example:

Kubectl apply-f examples/training_job.yaml

Check the running status:

# kubectl get trainingjobNAME PHASE AGEelastic-training Running 77s# kubectl get poNAME READY STATUS RESTARTS AGEelastic-training-launcher 1/1 Running 0 7selastic-training-worker-0 1/1 Running 0 10selastic-training-worker-1 1/1 Running 0 9s

3. Downsizing training task Worker

When performing capacity reduction, you can specify the worker of capacity reduction through the spec.toDelete.count or spec.toDelete.podNames field in ScaleIn CR.

If the number of capacity reduction is configured through count, the Worker from high to low scale is calculated through index.

ApiVersion: kai.alibabacloud.com/v1alpha1kind: ScaleInmetadata: name: scalein-workersspec: selector: name: elastic-training toDelete: count: 1

If you want to scale down a specific Worker, you can configure podNames:

ApiVersion: kai.alibabacloud.com/v1alpha1kind: ScaleInmetadata: name: scalein-workersspec: selector: name: elastic-training toDelete: podNames:-elastic-training-worker-1

Run a downsizing example to specify the number of downsizing 1 worker:

Kubectl create-f examples/scale_in_count.yaml

Check the status of downsizing execution and training tasks:

# kubectl get scaleinNAME PHASE AGEscalein-sample-t8jxd ScaleSucceeded 11s# kubectl get poNAME READY STATUS RESTARTS AGEelastic-training-launcher 1/1 Running 0 47selastic-training-worker-0 1/1 Running 0 50s

4. Capacity expansion training task

In ScaleOut CR, specify the number of worker for capacity expansion through the spec.toAdd.count field:

ApiVersion: kai.alibabacloud.com/v1alpha1 kind: ScaleOut metadata: name: elastic-training-scaleout-9dtmw namespace: default spec: selector: name: elastic-training timeout: 300 toAdd: count: 2

Run the example:

Kubectl create-f examples/scale_out.yaml

Check the status of downsizing execution and training tasks:

Kubectl get scaleoutNAME PHASE AGEelastic-training-scaleout-9dtmw ScaleSucceeded 30skubectl get poNAME READY STATUS RESTARTS AGEelastic-training-launcher 1/1 Running 0 2m5selastic-training-worker-0 1/1 Running 0 2m8selastic-training-worker-1 1 40selastic-training-worker-2 1 Running 0 40selastic-training-worker-2 1 Running 0 40s read the above content Do you have any further understanding of how to use Elastic Training Operator? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.