How to run apache spark on kubernetes 04/28 Update SLTechnology News&Howtos

How to run apache spark on kubernetes

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to run apache spark on kubernetes". In daily operation, I believe many people have doubts about how to run apache spark on kubernetes. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "how to run apache spark on kubernetes"! Next, please follow the editor to study!

1. Introduction of Data Mechanic platform

This is an introduction to the data mechanics platform. First of all, it is a serverless platform, that is, a fully hosted platform, where users do not have to care about their machines. All applications are launched and expanded at this level of seconds. Then, for this kind of local development to this online deployment, it is a seamless transformation. Then, it can also provide this configuration of automatic spark parameters tuning, the whole pipeline will be done very fast, has been very stable.

Then they are equivalent to deploying the entire platform in the K8S cluster in the user's account. In this way, it is a very great guarantee for the safety of the whole. Then the data permissions, including the operation permissions, are all in the same account.

II. Spark on K8s (1) core concepts

First of all, the combination of k8s and spark appeared after spark version 2.3, and there were several ways before that. The first is Standalone, which is not used very much. The second is Apache mesos, which is used more abroad, but the market scale is also gradually shrinking. The third is Yarn, the vast majority of our enterprises are now running in the Yarn cluster. The fourth is Kubernetes, and now people are gradually running spark on K8s.

The architecture of Spark on k8s is shown in the following figure

There are two ways to submit an application. One is Spark submit, the other is Spark-on-k8s operator. Their respective characteristics are shown in the following figure:

Then let's compare the dependency management of Yarn and K8s. This is a place where the distinguishing point is larger. Yarn provides a global spark version, including python versions, global package dependencies, and lack of environment isolation. While K8s is completely isolated from the environment, each application can run in a completely different environment, version, etc. Yarn's package management solution is to upload dependent packages to HDFS. The package management solution of K8s is to manage the image repository, put dependency packages into image, support package dependency management, upload packages to OSS/HDFS, distinguish different levels of tasks, and mix the above two modes.

(II) configuration and performance

Then let's talk about the pit with spark executors. For example, assuming that k8s node is the ECS of 16G Muhammore, the following configuration will not be able to apply for an executor! This is shown in the following figure.

What is the reason? that is to say, Spark pod can only apply for resources according to a certain percentage of node resources, while spark.executor.cores=4 takes up all node cores resources. As shown in the following figure, assuming that we calculate that the available resources are 85%, then we should configure resources like this, spark.kubernetes.executor.request.cores=3400m.

Then this is a more important feature, that is, dynamic resources. Full support for dynamic resources is not currently available. For example, Kill a pod,shuffle file will be lost, will bring recalculation.

This one is Cluster autoscaling and dynamic allocation. Above, we saw a page of PPT with a solid frame and a dashed frame. In fact, K8s cluster autoscaler: extend the node node when the pod is in the pending state and cannot be allocated resources. Then, Autoscaling and dynamic resources work together, that is, when resources are available, executor will register with driver within 10 seconds. When there are no resources, first add ECS resources through autoscaling, and then apply for executors. Approximately complete the executor application process within 1min~2min.

In fact, this is also a better guarantee of our run-time flexibility, there is my own understanding, a more interesting way to play, that is to say, more cost-effective. Spot instance will reduce costs by 75%. It is a resource that can be preempted, running some low-SLA and price-sensitive applications. It is architecturally designed as a whole to minimize the cost. If executor is kill, it will recover. If driver is kill, it will be very dangerous. Configure node selector and affinities to control Driver scheduling in non-preemptive node,executor scheduling in spot instance.

Then, the next issue is the IO problem of object storage. Sparkonk8s is usually for object storage, and rename,list,commit task/job can be very time-consuming. If there is no S3A committers,Jindofs JobCommitter, spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 should be set.

And Shuffle performance. The performance of spark2.x O is the key point of shuffle bound workload, and docker file system is used in spark2.x version. Docker file system is very slow and needs to be replaced by volume.

(3) Future work

In the future work, I think what is more important is the promotion of shuffle and the separation of intermediate data storage and computing. This is a big job. In addition, there is Node decommission, which supports uploading python dependent files and so on.

We choose the advantages and disadvantages of k8s, as shown in the following figure:

The specific steps for deploying spark on k8s are shown in the following figure:

III. Cloud native thinking and practice of EMR team (1) overall architecture

This is our overall architecture, as shown in the following figure:

(2) dynamic Resources & shuffle Promotion

Shuffle service addresses core issues:

▪ solves the problem of dynamic resources

▪ solves the pain point of mounting cloud disk which is expensive and uncertain in advance.

▪ solves the scalability and performance issues of NAS as a central storage

▪ avoids the problem of task recalculating due to fetch failure and improves the stability of medium and large jobs.

▪ improves job performance with Tiered storage

(3) EMR Spark Cloud Native Planning

EMR product system, create an EMR cluster with a type of ON ACK:

▪ JindoSpark Mirror

▪ JindoFSService/JindoFSSDK enhances access to OSS data Lake

▪ JindoJobCommitter enhances the ability to submit OSS jobs

Enhanced dynamic resource capabilities of ▪ JindoShuffleService&

▪ ACK cluster connects to the old EMR cluster, and can access the old cluster table and HDFS data.

▪ Operator enhancements, Dependency management, providing one-stop control commands

▪ cloud native log and monitoring one-stop platform

Keywords: kubernetes,apache spark, cloud origin, data mechanics,spark on k8s

At this point, the study on "how to run apache spark on kubernetes" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.