How to build Machine Learning system on Kubernetes 10/24 Update SLTechnology News&Howtos

How to build Machine Learning system on Kubernetes

2025-10-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of "how to build a machine learning system on Kubernetes". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "how to build a machine learning system on Kubernetes" can help you solve the problem.

What is Kubeflow Pipelines?

Kubeflow Pipelines platforms include:

Administrative console that can run and track experiments

Workflow engine (Argo) capable of performing multiple machine learning steps

SDK used to customize workflows. Currently, only Python is supported.

The goal of Kubeflow Pipelines is:

End-to-end task scheduling: supports the choreography and organization of complex machine learning workflows, which can be triggered directly, regularly, events, or even by changes in data

Simple experiment management: help data scientists experiment with a wide range of ideas and frameworks, as well as manage a variety of experiments. And realize the easy transition from experiment to production.

Easy reuse through componentization: quickly create end-to-end solutions by reusing Pipelines and components without having to rebuild each time starting at 0.

Run Kubeflow Pipelines on Ali Cloud

When you see the ability of Kubeflow Piplines, do you want to see it? But at present, there are two challenges in using Kubeflow Pipeline in China:

Pipelines needs to be deployed through Kubeflow; while Kubeflow has too many default components, it is also complicated to deploy Kubeflow through Ksonnet

Pipelines itself is deeply coupled to Google's cloud platform and cannot run on other cloud platforms or bare metal servers.

In order to facilitate domestic users to install Kubeflow Pipelines, Aliyun CCS team provides a Kustomize-based Kubeflow Pipelines deployment solution. Unlike normal Kubeflow basic services, Kubeflow Pipelines needs to rely on stateful services such as mysql and minio, which means you need to consider how to persist and back up data. In this example, we use Ali Cloud SSD cloud disk as a data persistence solution to automatically create SSD cloud disks for mysql and minio, respectively. You can try to deploy the latest version of Kubeflow Pipelines separately on Aliyun.

prerequisite

You need to install kustomize

In Linux and Mac OS environments, you can execute

Opsys=linux # or darwin, or windowscurl-s https://api.github.com/repos/kubernetes-sigs/kustomize/releases/latest |\ grep browser_download |\ grep $opsys |\ cut-d'"'- f 4 |\ xargs curl-O-Lmv kustomize_*_$ {opsys} _ amd64 / usr/bin/kustomizechmod uplix / usr/bin/kustomize

In the Windows environment, you can download kustomize_2.0.3_windows_amd64.exe

To create a Kubernetes cluster in Ali Cloud CCS, please refer to the documentation.

Deployment process

Access the Kubernetes cluster through ssh. For more information, please see the documentation.

Download the source code

Yum install-y gitgit clone-- recursive https://github.com/aliyunContainerService/kubeflow-aliyun

Security configuration

Configure the TLS certificate. If you do not have an TLS certificate, you can generate it with the following command

Yum install-y openssldomain= "pipelines.kubeflow.org" openssl req-x509-nodes-days 365-newkey rsa:2048-keyout kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.key-out kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.crt-subj "/ CN=$domain/O=$domain"

If you have a TLS certificate, please save the private key and certificate to kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.key and kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.crt respectively

3.2 configure the login password for admin

Yum install-y httpd-toolshtpasswd-c kubeflow-aliyun/overlays/ack-auto-clouddisk/auth adminNew password:Re-type new password:Adding password for user admin

First, use kustomize to generate deployment yaml

Cd kubeflow-aliyun/kustomize build overlays/ack-auto-clouddisk > / tmp/ack-auto-clouddisk.yaml

Check the region and availability zone where the Kubernetes cluster node is located, and replace the availability zone according to its node. Assuming that your cluster is located in cn-hangzhou-g, you can execute the following command

Sed-i.bak 's/regionid: cn-beijing/regionid: cn-hangzhou/g'\ / tmp/ack-auto-clouddisk.yamlsed-i.bak' s/zoneid: cn-beijing-e/zoneid: cn-hangzhou-g/g'\ / tmp/ack-auto-clouddisk.yaml

It is recommended that you check whether the / tmp/ack-auto-clouddisk.yaml modification has been set.

Replace the container image address from gcr.io to registry.aliyuncs.com

Sed-i.bak 's tmp/ack-auto-clouddisk.yaml gcr.iomax registry.aliyuncs.com

It is recommended that you check whether the / tmp/ack-auto-clouddisk.yaml modification has been set.

Adjust the amount of disk space used, for example, you need to adjust the disk space to 200g

Sed-i.bak 's/storage: 100Gi/storage: 200GG'\ / tmp/ack-auto-clouddisk.yaml

Verify the yaml file of pipelines

Kubectl create-validate=true-dry-run=true-f / tmp/ack-auto-clouddisk.yaml

Deploy pipelines with kubectl

Kubectl create-f / tmp/ack-auto-clouddisk.yaml

To see how we access the pipelines, we expose the pipelines service through ingress. In this example, the access to IP is 112.124.193.271. The link to the Pipelines Management console is: https://112.124.193.271/pipeline/

Kubectl get ing-n kubeflowNAME HOSTS ADDRESS PORTS AGEml-pipeline-ui * 112.124.193.271 80,443 11m

Access the pipelines Management console

If you use a self-issued certificate, you will be prompted that this link is not a private link, please click to display details, and click to visit this website. Please enter the user name admin and the password you set in step 2.2.

At this point, you can use pipelines to manage and run training tasks.

Quan A

Why is Aliyun's SSD cloud disk used here?

This is because Aliyun's SSD cloud disk can be automatically backed up on a regular basis to ensure that the metadata in pipelines will not be lost.

How to perform cloud disk backup?

If you want to back up the contents of the cloud disk, you can manually create a snapshot for the cloud disk or set an automatic snapshot policy for the hard disk to create snapshots automatically on time.

How do I clean up my Kubeflow Piplines deployment?

The cleaning up here is divided into two parts:

Remove components of Kubeflow Pipelines

Kubectl delete-f / tmp/ack-auto-clouddisk.yaml

Release the corresponding two cloud disks for mysql and minio storage respectively by releasing cloud disks

This is the end of the introduction to "how to build a machine learning system on Kubernetes". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.