How to use Kubeflow 10/21 Update SLTechnology News&Howtos

How to use Kubeflow

2025-10-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to use Kubeflow". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use Kubeflow.

Guidelines for using Kubeflow

This paper is translated according to https://github.com/openthings/kubeflow/blob/master/user_guide.md.

Https://my.oschina.net/u/2306127/blog/1808582 by openthings,2018.05.23.

Kubeflow (https://github.com/kubeflow) is a machine learning process tool based on Kubernetes (https://kubernets.io, container orchestration and management service software) and TensorFlow (https://tensorflow.org, deep learning library). Ksonnet is used to manage application packages.

This article briefly introduces the basic concepts and methods of Kubeflow deployment and interoperation. understanding of Kubernetes, Tensorflow and Ksonnet will be very helpful for understanding the content of this article. Click the link below to view the relevant content.

Kubernetes

Tensorflow

Ksonnet

For hand-held examples of deploying Kubeflow and running a simple training task, take a look at this tutorial (tutorial).

Environmental requirements

Kubernetes > = 1.8, see here, installation playbook https://my.oschina.net/u/2306127/blog/1628082

Ksonnet version 0.9.2. (see below for an explanation of why ksonnet is used)

Deploy Kubeflow

We will use Ksonnet to deploy kubeflow to the Kubernetes cluster, supporting local and GKE, Azure clusters.

Initialize a directory that contains ksonnet application.

Ks init my-kubeflow

Install Kubeflow packages into Ksonnet application.

Here are some installation scripts, https://github.com/openthings/kubernetes-tools/tree/master/kubeflow

# For a list of releases see:# https://github.com/kubeflow/kubeflow/releasesVERSION=v0.1.2cd my-kubeflowks registry add kubeflow github.com/kubeflow/kubeflow/tree/$ {VERSION} / kubeflowks pkg install kubeflow/core@$ {VERSION} ks pkg install kubeflow/tf-serving@$ {VERSION} ks pkg install kubeflow/tf-job@$ {VERSION}

Create Kubeflow core component. This core component includes:

JupyterHub

TensorFlow job controller

Ks generate core kubeflow-core-- name=kubeflow-core# Enable collection of anonymous usage metrics# Skip this step if you don't want to enable collection.# Or set reportUsage to false (the default). Ks param set kubeflow-core reportUsage trueks param set kubeflow-core usageId $(uuidgen)

Ksonnet allows the deployment of parameterized Kubeflow, which can be set up on demand. We define two environment variables: nocloud and cloud.

Ks env add nocloudks env add cloud

The environment variable nocloud is for minikube and other standard k8s clusters, and the environment variable cloud is for GKE and Azure.

If we use GKE, we configure the parameters of the cloud computing environment to use the characteristics of GCP, as follows:

Ks param set kubeflow-core cloud gke-env=cloud

If the cluster is created on Azure, use AKS/ACS:

Ks param set kubeflow-core cloud aks-env=cloud

If you use acs-engine instead of creating:

Ks param set kubeflow-core cloud acsengine-env=cloud

Then we set ${KF_ENV} to cloud or nocloud to reflect the environment we are using in this tutorial.

$KF_ENV=cloud | nocloud

In the case of a lack of water, Kubeflow does not persist what we do at Jupyter notebook.

If the container is destroyed or recreated, all contents, including notebooks and other files, will be deleted.

In order to persist these files, the user needs a default StorageClass, defined in persistent volumes.

You can run the following command to check if there is a storage class.

Kubectl get storageclass

With the default storage class-defined user, you can use the jupyterNotebookPVCMount parameter to create a volume that will be mounted into the notebook.

Ks param set kubeflow-core jupyterNotebookPVCMount / home/jovyan/work

Here we mount the volume to / home/jovyan/work because notebook is always executed as user jovyan.

The selected directory will be stored on the default storage of the cluster (typically a permanent disk).

Create the deployment namespace (namespace) and set it as part of the swap. You can set namespace to a name that better suits your own kubernetes cluster, as follows.

NAMESPACE=kubeflowkubectl create namespace ${NAMESPACE} ks env set ${KF_ENV}-- namespace ${NAMESPACE}

Then apply the components to our Kubernetes cluster.

Ks apply ${KF_ENV}-c kubeflow-core

At any time, you can use ks show to probe the object definition of a specific ksonnet component in kubernetes.

Ks show ${KF_ENV}-c kubeflow-core usage report (Usage Reporting)

When enabled, Kubeflow reports anonymous data using spartakus, a reporting tool of Kubernetes. Spartakus does not report any personal information. Check out here for more details. This is completely voluntary, or you can optionally turn it off, as follows:

Ks param set kubeflow-core reportUsage false# Delete any existing deployments of spartakuskubectl delete-n ${NAMESPACE} deploy spartakus-volunteer

To explicitly enable usage reporting, set reportUsage to true, as shown below:

Ks param set kubeflow-core reportUsage true# Delete any existing deployments of spartakuskubectl delete-n ${NAMESPACE} deploy spartakus-volunteer

Reporting data is one of your significant contributions to Kubeflow, so please consider turning it on. These data allow us to improve Kubeflow projects and help companies working on Kubeflow assess their ongoing investments.

You can improve data quality by giving each Kubeflow deployment a separate ID.

Ks param set kubeflow-core usageId $(uuidgen) Open Jupyter Notebook

Here the kubeflow-core component deploys the JupyterHub and the corresponding load balancer service. To check the status, use the following kubectl command line:

Kubectl get svc-NAMESPACE ${NAMESPACE} NAME TYPE CLUSTER-IP EXTERNAL-IP PORT (S) AGE...tf-hub-0 ClusterIP None 8000/TCP 1mtf-hub-lb ClusterIP 10.11.245.94 80/TCP 1m.

When there is a shortage of water, we use ClusterIPs to access JupyterHub UI, and the situation will change as follows:

NodePort (for non-cloud), by indicating:

Ks param set kubeflow-core jupyterHubServiceType NodePortks apply ${KF_ENV}

LoadBalancer (for cloud), by indicating:

Ks param set kubeflow-core jupyterHubServiceType LoadBalancerks apply ${KF_ENV}

However, this will open up Jupyter notebook to Internet networks (with potential security risks).

To connect locally to Jupyter Notebook, you can use:

PODNAME= `kubectl get pods-- namespace=$ {NAMESPACE}-- selector= "app=tf-hub"-- output=template-- template= "{{with index .items 0}} {{metadata.name}} {{end}}" `kubectl port-forward-- namespace=$ {NAMESPACE} $PODNAME 8000

Then, open http://127.0.0.1:8000 in the browser, and if a proxy is set, you need to turn it off for that address.

You will see a prompt window.

Click the "Start My Server" button and a dialog box will open.

Select the image as CPU or GPU, and the pre-built Docker images are listed in the Image menu. You can also enter the image name of Tensorflow directly for running.

Allocate memory, CPU, GPU, and other resources, depending on demand. (1 CPU and 2Gi memory is already a good starting point for the initial exercise.)

To assign GPUs, you need to make sure that the number of available GPUs,GPU in your cluster will be exclusively used by the container instance. If there are not enough resources, the instance will be suspended forever and will be in Pending status.

Check to see if there are enough nvidia gpus available: kubectl get nodes "- obtainable custom columnhouse name name. Metadata.name.GPUVOR. Status.allocatable.nvidia\ .com / gpu"

If GPUs is available, you can schedule your server to GPU node by specifying the following json in Extra Resource Limits section: {"nvidia.com/gpu": "1"}

Click Spawn

Here ${USERNAME} is the name you use when you login.

GKE users, if you have an IAP turned on the pod, the name will be different:

Jupyter-accounts-2egoogle-2ecom-3USER-40DOMAIN-2eEXT

If you log in as like USER@DOMAIN.EXT, pod is named:

The image is nearly 10 GBs, and it takes a long time to download, depending on the network.

Check the status of pod by:

Kubectl-n ${NAMESPACE} describe pods jupyter-$ {USERNAME}

When you are finished, the Jupyter Notebook initial interface will open.

The container image provided above can be used for Tensorflow models training and can be operated using Jupyter. This image contains all the required plugins, including Tensorboard, and can be used for rich visualization and exploratory analysis of the model.

To test the installation in the future, we will run a basic hello world application (from mnist_softmax.py)

From tensorflow.examples.tutorials.mnist import input_datamnist = input_data.read_data_sets ("MNIST_data/", one_hot=True) import tensorflow as tfx = tf.placeholder (tf.float32, [None, 784]) W = tf.Variable (tf.zeros ([784,10])) b = tf.Variable (tf.zeros ([10])) y = tf.nn.softmax (tf.matmul (x, W) + b) y= tf.placeholder (tf.float32, [None] ) cross_entropy = tf.reduce_mean (- tf.reduce_sum (y _ * tf.log (y), reduction_indices= [1]) train_step = tf.train.GradientDescentOptimizer. Minimize (cross_entropy) sess = tf.InteractiveSession () tf.global_variables_initializer (). Run () for _ in range (1000): batch_xs, batch_ys = mnist.train.next_batch (100) sess.run (train_step, feed_dict= {x: batch_xs (tf.reduce_mean: batch_ys}) correct_prediction = tf.equal (tf.argmax (correct_prediction 1), tf.argmax (tf.cast 1)) accuracy = tf.reduce_mean (correct_prediction, tf.float32) print (sess.run (accuracy, feed_dict= {x: mnist.test.images, y: mnist.test.labels})

Paste the above example into the new Python 3 Jupyter notebook, and then shift+enter executes the code. The accuracy of 0.9014 results based on test data will be obtained here.

It is important to note that when running most cloud providers, public IP address will be exposed to internet, and the default is endpoint with no security controls. For production deployment, you need to use SSL and authentication, refer to documentation.

Using TensorFlow Serving to provide model services

We use each deployed model as a component in APP.

Create a component of a model in cloud computing:

MODEL_COMPONENT=serveInceptionMODEL_NAME=inceptionMODEL_PATH=gs://kubeflow-models/inceptionks generate tf-serving ${MODEL_COMPONENT}-name=$ {MODEL_NAME} ks param set ${MODEL_COMPONENT} modelPath ${MODEL_PATH}

(or) create a component of model on nfs, learn about and refer to components/k8s-model-server, as follows:

MODEL_COMPONENT=serveInceptionNFSMODEL_NAME=inception-nfsMODEL_PATH=/mnt/var/nfs/general/inceptionMODEL_STORAGE_TYPE=nfsNFS_PVC_NAME=nfsks generate tf-serving ${MODEL_COMPONENT}-name=$ {MODEL_NAME} ks param set ${MODEL_COMPONENT} modelPath ${MODEL_PATH} ks param set ${MODEL_COMPONENT} modelStorageType ${MODEL_STORAGE_TYPE} ks param set ${MODEL_COMPONENT} nfsPVC ${NFS_PVC_NAME}

Deploy model component. Ksonnet will select the parameters that already exist in your environment (e.g. Cloud, nocloud), and then customize the results to deploy as appropriate:

Ks apply ${KF_ENV}-c ${MODEL_COMPONENT}

Previously, some pods and services have been created in your cluster. You can query kubernetes to get the service endpoint:

Kubectl get svc inception-NAMESPACE ${NAMESPACE} NAME TYPE CLUSTER-IP EXTERNAL-IP PORT (S) AGE...inception LoadBalancer 10.35.255.136 ww.xx.yy.zz 9000:30936/TCP 28m.

Here, the inception_client you can use is ww.xx.yy.zz:9000

The model in gs://kubeflow-models/inception is publicly accessible. However, if your environment is not configured with google cloud credential,TF serving, you will not be able to read model. Check issue for samples. To set up google cloud credential, you need the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the credential file, or run gcloud auth login. Check doc for more detailed instructions.

Model through Seldon service

Seldon-core provides the deployment of any machine learning runtime and packaged in a Docker container it.

Install seldon package:

Ks pkg install kubeflow/seldon

Create a core components:

Ks generate seldon seldon

Seldon allows complex runtime graphs for the deployment of model reasoning. For an example of end-to-end integration, see kubeflow-seldon example. See seldon-core documentation for more details.

Submit a TensorFlow training task

Note: before submitting a training task, you first need to have a deployed kubeflow to your cluster. When submitting training tasks, first make sure that TFJob custom resource is available.

We treat each TensorFlow job as a component in APP.

A. create a work task

Create a component for the training task:

JOB_NAME=myjobks generate tf-job ${JOB_NAME}-name=$ {JOB_NAME}

In order to configure this job, you need to set a series of parameters. To view the parameter list, run:

Ks prototype describe tf-job

Parameters are set using ks param, and Docker image is set to use:

IMAGE=gcr.io/tf-on-k8s-dogfood/tf_sample:d4ef871-dirty-991dde4ks param set ${JOB_NAME} image ${IMAGE}

You can edit the params.libsonnet file and set the parameters directly.

Warning: the current command line setting parameters do not work due to escaping sequence problems (see ksonnet/ksonnet/issues/235). Therefore, setting the parameters requires editing the params.libsonnet file directly.

B. Run work tasks ks apply ${KF_ENV}-c ${JOB_NAME} C, monitor work tasks

To monitor the execution of tasks, see TfJob docs.

D. Delete the work task ks delete ${KF_ENV}-c ${JOB_NAME} run routine-TfCnn

Kubeflow comes with a ksonnet prototype that is suitable for running TensorFlow CNN Benchmarks.

Create a component:

CNN_JOB_NAME=mycnnjobks generate tf-cnn ${CNN_JOB_NAME}-name=$ {CNN_JOB_NAME}

Submit tasks:

Ks apply ${KF_ENV}-c ${CNN_JOB_NAME}

Check the running status (note that tf-cnn job is also tfjobs. Refer to TfJob docs)

Kubectl get-o yaml tfjobs ${CNN_JOB_NAME}

Delete Task:

Ks delete ${KF_ENV}-c ${CNN_JOB_NAME}

The prototype provides a series of parameters to control the running of tasks (such as using GPUs, distributed running, etc.). View the parameter run:

Ks prototype describe tf-cnn submits PyTorch training task

Note: before submitting tasks, you need to have a cluster with Kubeflow deployed (see deployed kubeflow to your cluster). Submit to make sure PyTorchJob custom resource is available.

We think of each PyTorch task as a component in APP.

Create a component for the work task.

JOB_NAME=myjobks generate pytorch-job ${JOB_NAME}-name=$ {JOB_NAME}

In order to configure work tasks, you need to set a series of parameters. Display parameters use:

Ks prototype describe pytorch-job

Parameters are set using ks param, and Docker image is set to use:

IMAGE=ks param set ${JOB_NAME} image ${IMAGE}

You can also edit the file params.libsonnet to set the parameters directly.

Run work tasks:

Ks apply ${KF_ENV}-c ${JOB_NAME}

Delete a work task:

Ks delete ${KF_ENV}-c ${JOB_NAME} Advanced customization

Data scientists often require an POSIX-compatible file system:

For example, most HDF5 libraries requires POSIX, and object store for GCS or S3 does not work.

When a shared POSIX file system is mounted to a notebook environment, data scientists can work together on the same dataset.

Here we will show how to deploy Kubeflow to meet this requirement.

Set disk parameters, separated by semicolons, and set the Google persistent disks you want to mount.

These disks must be on the same zone in your cluster.

These disks need to be created manually through gcloud or Cloud console.

These disks cannot be referenced to any existing VM or POD.

Create a disk:

Gcloud-- project=$ {PROJECT} compute disks create-- zone=$ {ZONE} ${PD_DISK1}-- description= "PD to back NFS storage on GKE."-- size=1TB gcloud-- project=$ {PROJECT} compute disks create-- zone=$ {ZONE} ${PD_DISK2}-description= "PD to back NFS storage on GKE."-size=1TB

Configure the environment to use these disks:

Ks param set-- env=cloud kubeflow-core disks ${PD_DISK1}, ${PD_DISK2}

Deploy the environment.

Ks apply cloud

Start Juptyer and you will see that your NFS volumes mount is / mnt/$ {DISK_NAME}. To run in Juptyer cell:

! df

You will see the following output:

Https://github.com/jlewi/deepvariant_on_k8sFilesystem 1K-blocks Used Available Use% Mounted onoverlay 98884832 8336440 90532008 9% / tmpfs 15444244 0 15444244 / devtmpfs 15444244 0 15444244 / sys/fs/cgroup10.11.254.34:/export/pvc-d414c86a-e0db-11e7-a056-42010af00205 105584128077824100205977776% / mnt/jlewi-kubeflow-test110.11.242.82:/export/pvc-33f0a5b3-e0dc-11e7-a056-42010af00205 1055841280778241002059776 1% / mnt/jlewi-kubeflow-test2/dev/sda1 98884832 8336440 90532008 9% / etc/hostsshm 65536 065536% / dev/shmtmpfs 0 0 0 15444244 / sys/firmware

Here jlewi-kubeflow-test1 and jlewi-kubeflow-test2 are the names of PDs.

Problem solving Minikube

In Minikube, Virtualbox/VMware drivers is a known problem between KVM/KVM2 driver and TensorFlow Serving. The issue is tracked in kubernetes/minikube#2377.

We recommend that the total amount of resources allocated by Minikube be increased as follows:

Minikube start-cpus 4-memory 8096-disk-size=40g

Minikube allocates 2048Mb RAM to virtual machines by default, which is not enough for JupyterHub.

The maximum disk capacity needs to meet Kubeflow's Jupyter images and contains additional libraries of more than 10G.

If you encounter jupyter-xxxx pod entering Pending status, get the description information:

Warning FailedScheduling 8s (x22 over 5m) default-scheduler 0 nodes are available: 1 Insufficient memory.

Then try to recreate the Minikube cluster (re-use Ksonnet to apply Kubeflow) and specify more resources.

RBAC clusters

If the cluster you are running has RBAC enabled (see RBAC enabled), you may encounter the following error when running Kubeflow:

ERROR Error updating roles kubeflow-test-infra.jupyter-role: roles.rbac.authorization.k8s.io "jupyter-role" is forbidden: attempt to grant extra privileges: [PolicyRule {Resources: ["*"], APIGroups: ["*"], Verbs: ["*"]}] user=& {your-user@acme.com [system:authenticated] map []} ownerrules= [PolicyRule {Resources: ["selfsubjectaccessreviews"], APIGroups: ["authorization.k8s.io"] Verbs: ["create"]} PolicyRule {NonResourceURLs: ["/ api"/ api/*"/ apis"/ apis/*"/ healthz"/ swagger-2.0.0.pb-v1"/ swagger.json"/ swaggerapi"/ swaggerapi/*"/ version"], Verbs: ["get"]}] ruleResolutionErrors= []

The error indicates that there are not sufficient permissions. In most cases, this problem is solved by creating the appropriate clusterrole binding and then redeploying the kubeflow:

Kubectl create clusterrolebinding default-admin-clusterrole=cluster-admin-user=your-user@acme.com

Replace your-user@acme.com with the user name prompted in the error message.

If you use GKE, you can refer to GKE's RBAC docs to learn how to set up RBAC, which is achieved through IAM on GCP.

The problem with spawning Jupyter pods

If you have trouble with spawning jupyter notebooks, check to see if the pod has been scheduled to run:

Kubectl-n ${NAMESPACE} get pods

View the name of the pod that launched juypter.

If you use username/password auth,Jupyter pod, it will be named:

Jupyter-$ {USERNAME}

If you use IAP on GKE,pod, it will be named:

Jupyter-accounts-2egoogle-2ecom-3USER-40DOMAIN-2eEXT

Here USER@DOMAIN.EXT is the Google account when using IAP.

Once you know the name of pod:

Kubectl-n ${NAMESPACE} describe pods ${PODNAME}

If you look at events, you can see the reason for the error trying to schedule pod.

A common reason for not being able to schedule pod is that there are not enough resources available on the cluster.

OpenShift

If you deploy Kubeflow in an OpenShift environment (which is an encapsulation of Kubernetes), you need to adjust the security contexts for ambassador and jupyter-hub deployment to run.

Oc adm policy add-scc-to-user anyuid-z ambassadoroc adm policy add-scc-to-user anyuid-z jupyter-hub

Once the security policy is set, you need to delete the failed pods and then allow it to be recreated in the event of project deployment.

You need to adjust the permissions of tf-job-operator service so that TFJobs can run. Run a TFJobs as follows:

Oc adm policy add-role-to-user cluster-admin-z tf-job-operatorDocker for Mac

Docker for Mac Community Edition comes with Kubernetes support (1.9.2) and can be enabled from edge channel. If you decide to use Kubernetes environment on Mac privately, you may encounter the following problems when deploying Kubeflow:

Ks apply default-c kubeflow-coreERROR Attempting to deploy to environment 'default' at' https://127.0.0.1:8443', but cannot locate a server at that address

This error is because the default cluster set when Docker for Mac is installed is https://localhost:6443., and one option is to directly edit the created environments/default/spec.json file to set the "server" variable to the correct location, and then retry the deployment. However, a better way is to use the desired kube config to create the Ksonnet app.

Kubectl config use-context docker-for-desktopks init my-kubeflow403 API rate limit out of error

Because ksonnet uses Github to pull kubeflow, unless the user specifies Github API token, it will quickly consume the maximum API anonymous call limit. To solve this problem, you can create a Github API token, refer to guide, here, and assign the token to the GITHUB_TOKEN environment variable.

Export GITHUB_TOKEN= > ks apply produces error "Unknown variable: env"

Kubeflow requires version 0.9.2 or higher, check out see here. If you run the older version of ksonnet used by ks apply, you will get the error Unknown variable: env, as shown below:

Ks apply ${KF_ENV}-c kubeflow-coreERROR Error reading / Users/xxx/projects/devel/go/src/github.com/kubeflow/kubeflow/my-kubeflow/environments/nocloud/main.jsonnet: / Users/xxx/projects/devel/go/src/github.com/kubeflow/kubeflow/my-kubeflow/components/kubeflow-core.jsonnet:8:49-52 Unknown variable: env namespace: if params.namespace = = "null" then env.namespace else params.namespace

Check the ksonnet version as follows:

Ks version

If the ksonnet version is earlier than v0.9.2, upgrade and recreate the app according to user_guide.

Why does Kubeflow use Ksonnet?

Ksonnet is a command-line tool that makes it easier to manage complex deployments with multiple parts, each designed to perform specific operations with kubectl.

Ksonnet allows us to create Kubernetes manifests from parameterized templates. This makes it easy to use parameterized Kubernetes manifests for specific scenarios. In the above example, we created a manifests for TfServing to provide a customized URI for model.

One of the reasons I like ksonnet is that I treat environment (such as dev, test, staging, prod) as a first-class concept. For each environment, we deploy the same components with only minor customization changes to a particular environment. We think this is a very friendly mapping for normal workflows. For example, this feature allows you to run tasks locally without GPU, make the code run through, and then move it to a cloud computing environment with a lot of GPU scalability.

At this point, I believe that you have a deeper understanding of "the use of Kubeflow", you might as well come to the actual operation! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.