How to understand Kubeflow 07/19 Update SLTechnology News&Howtos

How to understand Kubeflow

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article shows you how to understand Kubeflow, the content is concise and easy to understand, can definitely brighten your eyes, through the detailed introduction of this article, I hope you can gain something.

When it comes to machine learning, especially deep learning, you may be familiar with tools such as Tensorflow,Pytorch,Caffee. But in fact, in the actual life cycle of machine learning, the training model (the main problem solved by the above tools) is only a small part of the whole life cycle of machine learning.

How to prepare the data? How to deploy the model after training? How to go to the cloud? How to scale Scale? Wait, the challenge comes. With the wide application of machine learning, many tools are created in response to solve the problem of model deployment. For example:

Graphpipe of Oracle

Mlflow of Databricks

Kubeflow of Google

Let's take a look at Kubeflow launched by Google today. Kubeflow, as its name implies, is Kubernetes + Tensorflow, an open source platform developed by Google to support the deployment of its own Tensorflow. Of course, it also supports other machine learning engines such as Pytorch and Python-based SKlearn. Compared with other products, because it is built on top of powerful Kubernetes, the future and ecosystem of Kubeflow are more promising.

Kukeflow mainly provides the function of simple large-scale deployment of machine learning models in production systems. With Kubernetes, it can do:

Simple, repeatable, portable deployment

Use microservices to provide loosely coupled deployment and management

Scale up as needed

Kubeflow is a K8S-based machine learning tool set that provides a series of scripts and configurations to manage K8S components. Kubeflow is based on K8s micro-service architecture, and its core components include:

Jupyterhub Multi-tenant Nootbook Service

Tensorflow/Pytorch/MPI/MXnet/Chainer 's main machine learning engine

Seldon provides deployment of machine learning models on K8s

Argo Workflow engine based on K8s

Ambassador API Gateway

Istio provides micro-service management, Telemetry collection

Ksonnet K8s deployment tool

Based on K8s, it is very convenient to extend other capabilities, and other extensions provided by Kubeflow include:

Pachyderm based on container and K8s data pipeline (git for data)

Configuration Management of Weaveworks flux based on git

......

We can see that K8sPermy Kubeflow uses the existing ecosystem to construct micro services, which can be said to fully reflect the high scalability of micro services.

Let's take a look at how Kubeflow integrates these components to provide the capabilities of the machine learning model deployment.

JupyterHub

Jupyter Notebook is a popular development tool for data scientists. It provides excellent interaction and real-time feedback. JupyterHub provides a multi-user environment that uses Juypter Notebook and contains the following components:

Multi-user Hub

Configurable HTTP proxy

Multiple but user Notebook server

Run the following command to access jyputer hub through port-forward

Kubectl port-forward tf-hub-0 8000R 8000-n

On the first visit, you can create an instance of notebook. Different images can be selected for the created instance to support GPU. At the same time, you need to select the parameters for configuring resources.

The interface of the created jupyterlab (JupyterLab is the new generation of Juypter Notebook) is as follows:

But I'm still used to the traditional notebook interface. The advantage of Lab is that it can drive Console, which is good. (Lab also supports opening the traditional notebook interface)

Kubeflow integrates Tensorboard in the notebook image, which makes it easy to visualize and debug the tensflow program.

In the Console of jyputerlab, type the following command to turn on Tensorboard:

Tensorboard-- logdir $tensorboard-- logdir / tmp/logs2018-09-15 20 logdir 30 logdir: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMAW0915 20 30 logdir 21.204606 Reloader tf_logging.py:121] Found more than one graph event per run, or there was a metagraph containing agraph _ def, as well as one or more graph events. Overwriting the graph with the newest event.W0915 20:30:21.204929 Reloader tf_logging.py:121] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.W0915 20:30:21.205569 Reloader tf_logging.py:121] Found more than one graph event per run, or there was a metagraph containing agraph _ def, as well as one or more graph events. Overwriting the graph with the newest event.TensorBoard 1.8.0 at http://jupyter-admin:6006 (Press CTRL+C to quit)

Port-forward is also required to access tensorboard, where user is the user name that creates the notebook, and kubeflow creates a Pod for an instance. The default port for tensorboard is 6006.

Kubectl port-forward jupyter- 6006 Phantom 6006-n

Tensorflow training

In order to support distributed Tensorflow training in Kubernete, Kubeflow developed CDR,TFJob (tf-operater) for K8s.

As shown in the figure above, distributed Tensorflow supports 0 to more than one of the following processes:

Chief is responsible for coordinating training tasks.

Ps Parameter servers, a parameter server, provides distributed data storage for the model

Worker is responsible for the actual training of the model. In some cases, worker 0 can act as the responsibility of Chief.

Evaluator is responsible for performance evaluation during training.

The following yaml configuration is an example of CNN Benchmarks provided by Kubeflow.

-apiVersion: kubeflow.org/v1alpha2kind: TFJobmetadata: labels: ksonnet.io/component: mycnnjob name: mycnnjob namespace: kubeflowspec: tfReplicaSpecs: Ps: template: spec: containers:-args:-python-tf_cnn_benchmarks.py-- batch_size=32-model=resnet50-- -variable_update=parameter_server-flush_stdout=true-num_gpus=1-local_parameter_device=cpu-device=cpu-data_format=NHWC image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3 name: tensorflow workingDir: / opt/tf-benchmarks/ Scripts/tf_cnn_benchmarks restartPolicy: OnFailure tfReplicaType: PS Worker: replicas: 1 template: spec: containers:-args:-python-tf_cnn_benchmarks.py-batch_size=32-model=resnet50-variable_update=parameter_server -flush_stdout=true-num_gpus=1-local_parameter_device=cpu-device=cpu-data_format=NHWC image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3 name: tensorflow workingDir: / opt/tf-benchmarks/scripts/tf_cnn_benchmarks restartPolicy: OnFailure

Running this example in Kubeflow creates a TFjob. You can use Kubectl to manage and monitor the operation of this Job.

# Monitoring current status kubectl get-o yaml tfjobs-n # View event kubectl describe tfjobs-n # View Operation Log kubectl logs mycnnjob- [ps | worker]-0-n Tensoflow Service (Serving)

Serving means that when the model is trained, it provides a stable interface for users to call to apply the model.

Based on the Serving function of Tensorflow, Kubeflow provides a Ksonnet module of Tensorflow Model Server (model server) to provide the function of model service.

Once the model is deployed, the model is accessed and used through the endpoint exposed by API Gateway.

Http:///seldon//api/v0.1/predictions

Machine learning can also be abstracted into one or more workflows. Kubeflow inherits Argo as its workflow engine for machine learning.

You can access Argo UI in Kubeflow through Kubectl proxy. Http://localhost:8001/api/v1/namespaces/kubeflow/services/argo-ui/proxy/workflows

At this stage, there is no actual Argo workflow to run machine learning examples. But Kubeflow is using Argo to build its own CICD system.

Pychyderm is a containerized data pool that provides data version system management like git and a data pipeline to build your data science project.

Kubeflow uses Google's two powerful weapons, Kubernete and Tensorflow, to provide a toolkit and deployment platform for data science. We can see that he has many advantages:

Cloud Optimization-based on K8s, it can be said that all functions can be easily extended on the cloud. Such as multi-tenancy, dynamic extension, support for AWS/GCP, etc.

The use of micro-service architecture, strong scalability, container-based, it is very easy to add experience components

Excellent DevOps and CICD support, using Ksonnet/argo, deployment and management of components and CICD become very easy

Multi-core support, in addition to the deep learning engine we mentioned in this article, Kubeflow can easily extend new engines, such as Caffe2, which is under development.

GPU support

At the same time, we can also see some problems with Kubeflow:

There are many components, lack of coordination, more like a collection of tools. Hope to have an integrated and smooth workflow that can unify the steps.

The document needs to be improved.

Of course, the current version of kubeflow is 0.2.5, and I believe that Kubeflow will have a good development in the future.

The above content is how to understand Kubeflow. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.