How to use Kubernetes and Helm for efficient hyperparametric tuning 07/02 Update SLTechnology News&Howtos

How to use Kubernetes and Helm for efficient hyperparametric tuning

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the knowledge of "how to use Kubernetes and Helm for efficient hyperparameter tuning". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The problems faced by Hyperparameter Sweep

In Hyperparameter Sweep, we need to carry out different training according to many different combinations of hyperparameters, and multiple training for the same model requires a lot of computing resources or a lot of time.

If the training is carried out in parallel according to different hyperparameters, it requires a lot of computing resources.

If the training corresponding to all different hyperparameter combinations is carried out sequentially on the fixed computing resources, it will take a lot of time to complete the training corresponding to all combinations.

So when landing, most people choose a relatively optimal combination through a very limited number of manual fine-tuning of their superparameters.

Kubernetes+Helm is a sharp weapon.

With Kubernetes and Helm, you can easily explore a very large hyperparameter space while maximizing cluster utilization to optimize costs.

Helm enables us to package applications into chart and easily parameterize them. In Hyperparameter Sweep, we can use the configuration of Helm chart values to generate a corresponding TFJobs in template for training deployment. At the same time, we can deploy a TensorBoard instance in chart to monitor all these TFJobs, so that we can quickly compare the results of all our hyperparameter combinations. For those hyperparameter combinations that are not effective, we can delete the corresponding training tasks as soon as possible. This will undoubtedly greatly save the computing resources of the cluster, thus reducing the cost.

Using Kubernetes+Helm for Hyperparameter Sweep DemoHelm Chart

We will use the example in Azure/kubeflow-labs/hyperparam-sweep to Demo.

First, use the following Dockerfile to create a mirror image of the training:

FROM tensorflow/tensorflow:1.7.0-gpuCOPY requirements.txt / app/requirements.txtWORKDIR / appRUN mkdir. / outputRUN mkdir. / logsRUN mkdir. / checkpointsRUN pip install-r requirements.txtCOPY. / * / app/ENTRYPOINT ["python", "/ app/main.py"]

The main.py training script is as follows:

Import clickimport tensorflow as tfimport numpy as npfrom skimage.data import astronautfrom scipy.misc import imresize, imsave, imreadimg = imread ('. / starry.jpg') img = imresize (img, (100,100)) save_dir = 'output'epochs = 2000def linear_layer (X, layer_size, layer_name): with tf.variable_scope (layer_name): W = tf.Variable (tf.random_uniform ([X.get_shape (). As_list () [1], layer_size], dtype=tf.float32)) Name='W') b = tf.Variable (tf.zeros ([layer_size]), name='b') return tf.nn.relu (tf.matmul (X, W) + b) @ click.command () @ click.option ("--learning-rate", default=0.01) @ click.option ("--hidden-layers", default=7) @ click.option ("--logdir") def main (learning_rate, hidden_layers) Logdir='./logs/1'): X = tf.placeholder (dtype=tf.float32, shape= (None, 2), name='X') y = tf.placeholder (dtype=tf.float32, shape= (None, 3), name='y') current_input = X for layer_id in range (hidden_layers): h = linear_layer (current_input, 20, 'layer {}' .format (layer_id)) current_input = h y_pred = linear_layer (current_input, 3) 'output') # loss will be distance between predicted and true RGB loss = tf.reduce_mean (tf.reduce_sum (tf.squared_difference (y, y_pred), 1)) tf.summary.scalar (' loss', loss) train_op = tf.train.AdamOptimizer (learning_rate) .minimize (loss) merged_summary_op = tf.summary.merge_all () res_img = tf.cast (tf.reshape (y_pred, (1) ) + img.shape), 0255), tf.uint8) img_summary = tf.summary.image ('out', res_img, max_outputs=1) xs, ys = get_data (img) with tf.Session () as sess: tf.global_variables_initializer () .run () train_writer = tf.summary.FileWriter (logdir +' / train') Sess.graph) test_writer = tf.summary.FileWriter (logdir +'/ test') batch_size = 50 for i in range (epochs): # Get a random sampling of the dataset idxs = np.random.permutation (range (len (xs) # The number of batches we have to iterate over n_batches = len (idxs) / / batch_size # Now iterate over our stochastic minibatches: for batch_i in range (n_batches): batch_idxs = idxs [batch _ I * batch_size: (batch_i + 1) * batch_size] sess.run ([train_op Loss, merged_summary_op], feed_dict= {X: Xs [batch _ idxs], y: ys [batch _ idxs]}) if batch_i% 100= = 0: C, summary = sess.run ([loss, merged_summary_op], feed_dict= {X: XS [batch _ idxs], y: ys [batch _ idxs]}) train_writer.add_summary (summary) (I * n_batches * batch_size) + batch_i) print ("epoch {}, (L2) loss {}" .format (I, c)) if I% 10 = 0: img_summary_res = sess.run (img_summary, feed_dict= {X: xs, y: ys}) test_writer.add_summary (img_summary_res I * n_batches * batch_size) def get_data (img): xs = [] ys = [] for row_i in range (img.shape [0]): for col_i in range (img.shape [1]): xs.append ([row_i, col_i]) ys.append (IMG [row _ I] Col_i]) xs = (xs-np.mean (xs)) / np.std (xs) return xs, np.array (ys) if _ _ name__ = "_ _ main__": main ()

When docker build creates an image, it packages the starry.jpg image in the root directory for main.py to read.

Main.py uses an Andrej Karpathy's Image painting demo-based model whose goal is to draw a new picture as close as possible to the original, Vincent Van Gogh's Starry Night.

The configuration in Helm chart values.yaml is as follows:

Image:ritazh / tf-paint:gpu useGPU:true hyperParamValues: learningRate:-0.001-0.010.1 hiddenLayers:-5-6-7

Image: configure the docker image corresponding to the training task, which is the image you created earlier.

UseGPU: Bool value. The default true means that gpu will be used for training. In the case of false, you need to use tensorflow/tensorflow:1.7.0 base image when making images.

HyperParamValues: the configuration of hyperparameters. Here we only configure two hyperparameters, learningRate and hiddenLayers.

In Helm chart, there are mainly the definition of TFJob, Deployment of Tensorboard and the definition of Service:

# First we copy the values of values.yaml in variable to make it easier to access them {{- $lrlist: = .Values.hyperParamValues.learningRate -}} {{- $nblayerslist: = .Values.hiddenLayers -}} {{- $image: = .Values.image -}} {{- $useGPU: = .Values.useGPU -}} {{- $chartname: = .Chart.Name -}} {{- $chartversion: = .Chart.Version -}} # Then we loop over every value of $lrlist (learning rate) and $nblayerslist (hidden layer depth) ) # This will result in create 1 TFJob for every pair of learning rate and hidden layer depth {{- range $I $lr: = $lrlist}} {{- range $j $nblayers: = $nblayerslist}} apiVersion: kubeflow.org/v1alpha1kind: TFJob # Each one of our trainings will be a separate TFJobmetadata: name: module8-tf-paint- {{$I}}-{{$j}} # We give a unique name to each training labels: chart: "{{$chartname}}-{{$chartversion | replace" + "_"} "spec: replicaSpecs:-template: spec: restartPolicy: OnFailure containers: -name: tensorflow image: {{$image}} env:-name: LC_ALL value: C.UTF-8 args: # Here we pass a unique learning rate and hidden layer count to each instance. # We also put the values between quotes to avoid potential formatting issues-learning-rate-{{$lr | quote}}-hidden-layers-{{$nblayers | quote}}-logdir-/ tmp/tensorflow/tf-paint-lr {{$lr}}-d-{ {$nblayers}} # We save the summaries in a different directory {{if $useGPU}} # We only want to request GPUs if we asked for it in values.yaml with useGPU resources: limits: nvidia.com/gpu: 1 {{end}} volumeMounts:-mountPath: / tmp/tensorflow subPath: module8-tf-paint # As usual we Want to save everything in a separate subdirectory name: azurefile volumes:-name: azurefile persistentVolumeClaim: claimName: azurefile--- {{- end}} {{- end}} # We only want one instance running for all our jobs And not 1 per job.apiVersion: v1kind: Servicemetadata: labels: app: tensorboard name: ports:-port: 80 targetPort: 6006 selector: app: tensorboard type: LoadBalancer---apiVersion: extensions/v1beta1kind: Deploymentmetadata: labels: app: tensorboard name: module8-tensorboardspec: template: metadata: labels: app: tensorboardspec: volumes:-name: azurefile persistentVolumeClaim: ClaimName: azurefile containers:-name: tensorboard command:-/ usr/local/bin/tensorboard-- logdir=/tmp/tensorflow-host=0.0.0.0 image: tensorflow/tensorflow ports:-containerPort: 6006 volumeMounts:-mountPath: / tmp/tensorflow subPath: module8-tf-paint name: azurefile

According to the above hyperparameter configuration, in helm install, 9 hyperparameter combinations will generate 9 TFJob, corresponding to all the 3 learningRate and 3 hiddenLayers combinations we specified.

The main.py training script has three parameters:

Argumentdescriptiondefault value--learning-rateLearning rate value0.001--hidden-layersNumber of hidden layers in our network.4--log-dirPath to save TensorFlow's summariesNoneHelm Install

You can easily complete the training deployment for all different hyperparameter combinations by executing the helm install command. Here we only use stand-alone training, or you can use distributed training.

Helm install .name: telling-buffaloLAST DEPLOYED: NAMESPACE: tfworkflowSTATUS: DEPLOYEDRESOURCES:== > v1/ServiceNAME TYPE CLUSTER-IP EXTERNAL-IP PORT (S) AGEmodule8-tensorboard LoadBalancer 10.0.142.217 80:30896/TCP 1s = > v1beta1/DeploymentNAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGEmodule8-tensorboard 1 110 1s = > v1alpha1/TFJobNAME AGEmodule8-tf-paint-0-0 1smodule8-tf-paint-1-0 1smodule8-tf-paint-1-1 1smodule8-tf-paint-2-1 1smodule8-tf-paint-2-2 1smodule8-tf-paint-0-1 1smodule8-tf-paint-0-2 1smodule8-tf-paint-1-2 1smodule8-tf-paint-2-00s = > v1/Pod (related) NAME READY STATUS RESTARTS AGEmodule8-tensorboard-7ccb598cdd- 6vg7h 0/1 ContainerCreating 0 1s

After deploying the chart, look at the created Pods, and you should see some columns of Pods corresponding to it, as well as a single TensorBoard instance monitoring all the panes:

$kubectl get podsNAME READY STATUS RESTARTS AGEmodule8-tensorboard-7ccb598cdd-6vg7h 1 Running 0 16smodule8-tf-paint-0-0-master-juc5-0-hw5cm 0 Pending 0 4smodule8-tf-paint-0-1-master-pu49-0-jp06r 1 1-master-pu49 1 Running 0 14smodule8 Color tfmert- 0-2-master-awhs-0-gfra0 0 6smodule8-tf-paint-1 1 Pending 0 6smodule8-tf-paint-1-0-master-5tfm-0-dhhhv 1 Running 0 16smodule8-tf-paint-1-1-master-be91-0-zw4gk 1 Running 0 16smodule8-tf-paint-1-2-master-r2nd-0-zhws1 0 stop 1 Pending 0 7s module 8- Tf-paint-2-0-master-7w37-0-ff0w9 0 13smodule8-tf-paint-2 1 Pending 0 13smodule8-tf-paint-2-1-master-260j-0-l4o7r 0 10smodule8-tf-paint-2 1 Pending 0 10smodule8-tf-paint-2-2-master-jtjb-0-5l84q 0 bank 1 Pending 0 9s

Note: some pod is waiting to be processed due to the GPU resources available in the cluster. If there are three GPU in the cluster, there can only be at most three TFJob (one gpu requested by each TFJob) parallel training at a given time.

Identify the optimal hyperparameter combination as early as possible by TensorBoard

TensorBoard Service is also created automatically when the Helm install is executed, and you can use the External-IP of the Service to connect to the TensorBoard.

$kubectl get serviceNAME TYPE CLUSTER-IP EXTERNAL-IP PORT (S) AGEmodule8-tensorboard LoadBalancer 10.0.142.217 80:30896/TCP 5m

Visit the Public IP address of TensorBoard through the browser, and you will see a page similar to the following (TensorBoard takes a while to display the image. )

Here we can see that some models corresponding to hyperparameters perform better than others. For example, all models with a learning rate of 0.1 produce all black images, and the effect of the model is very poor. A few minutes later, we can see that the two best performing hyperparameter combinations are:

Hidden layers = 5pm learning rate = 0.01mm

Hidden layers = 7 learning rate = 0.001

At this point, we can immediately Kill other poor model training and release valuable gpu resources.

That's all for the content of "how to use Kubernetes and Helm for efficient hyperparameter tuning". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.