How to analyze Kubernetes-based Spark deployment 07/03 Update SLTechnology News&Howtos

How to analyze Kubernetes-based Spark deployment

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to analyze Kubernetes-based Spark deployment, the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Yarn used to be the default resource orchestration management platform for Hadoop. However, the situation has changed recently, especially for Spark in Hadoop, because it is well integrated with other storage platforms such as S3, but not closely related to other components in the Hadoop ecology, Kubernetes is rapidly replacing Yarn as the default orchestration management platform for object-based storage Spark systems. In this article, we will take an in-depth look at how to build and deploy Spark containers on a Kubernetes cluster. Since the operation of Spark depends on data, we will configure the Spark cluster for storage operations through S3 API.

The first step in deploying an application on Kubernetes is to create a container. Although some projects provide official container images, as of this writing, Apache Spark does not provide official images. So we'll create our own Spark container, let's start with Dockerfile.

FROM java:openjdk-8-jdk ENV hadoop_ver 2.8.2 ENV spark_ver 2.4.4 RUN mkdir-p / opt & &\ cd / opt &\ curl http://archive.apache.org/dist/hadoop/common/hadoop-${hadoop_ver}/hadoop-${hadoop_ver}.tar.gz |\ tar-zx & &\ ln-s hadoop-$ {hadoop_ver} hadoop & &\ echo Hadoop ${hadoop_ver} installed in / opt RUN mkdir-p / opt & &\ cd / opt & &\ curl http://archive.apache.org/dist/spark/spark-${spark_ver}/spark-${spark_ver}-bin-without-hadoop.tgz |\ tar-zx & &\ ln-s spark-$ {spark_ver}-bin-without-hadoop spark & &\ echo Spark ${spark_ver} installed in / opt ENV SPARK_HOME=/opt/spark ENV PATH=$PATH:$SPARK_HOME/bin ENV HADOOP _ HOME=/opt/hadoop ENV PATH=$PATH:$HADOOP_HOME/bin ENV LD_LIBRARY_PATH=$HADOOP_HOME/lib/native RUN curl http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar-o / opt/spark/jars/hadoop-aws-2.8.2.jar RUN curl http://central.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.3/httpclient-4.5 .3.jar-o / opt/spark/jars/httpclient-4.5.3.jar RUN curl http://central.maven.org/maven2/joda-time/joda-time/2.9.9/joda-time-2.9.9.jar-o / opt/spark/jars/joda-time-2.9.9.jar RUN curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.11.712/aws-java-sdk-core -1.11.712.jar-o / opt/spark/jars/aws-java-sdk-core-1.11.712.jar RUN curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.712/aws-java-sdk-1.11.712.jar-o / opt/spark/jars/aws-java-sdk-1.11.712.jar RUN curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk -kms/1.11.712/aws-java-sdk-kms-1.11.712.jar-o / opt/spark/jars/aws-java-sdk-kms-1.11.712.jar RUN curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.11.712/aws-java-sdk-s3-1.11.712.jar-o / opt/spark/jars/aws-java-sdk-s3-1.11.712.jar ADD Start-common.sh start-worker start-master / ADD core-site.xml / opt/spark/conf/core-site.xml ADD spark-defaults.conf / opt/spark/conf/spark-defaults.conf ENV PATH $PATH:/opt/spark/bin

In this Dockerfile, we first download Apache Spark and Hadoop from the official address, and then get the associated jar package from Maven. After all the associated files have been downloaded and unzipped to a specific directory, we add these important configuration files to the image.

In the process, you can easily add your own environment-specific configuration.

We could have skipped the above steps and directly used a pre-built image, but by interpreting these steps, we can let our readers see the contents inside the Spark container, and advanced users can modify them to meet their special needs.

The Dockerfile and other associated configuration files used in the above example can be obtained from this GitHub repository. If you want to use the contents of this repository, first clone it locally using the following command:

Git clone git@github.com:devshlabs/spark-kubernetes.git

Now, you can make any changes in your environment as needed, then build the image and upload it to the container registry you use. In the example in this article, I use Dockerhub as the container registry with the following command:

Cd spark-kubernetes/spark-container docker build. -t mydockerrepo/spark:2.4.4 docker push mydockerrepo/spark:2.4.4

Remember to replace the mydockerrepo with your actual registry name.

Deploy Spark on Kubernetes

At this point, the Spark container image has been built and can be pulled and used. Let's use this image to deploy Spark Master and Worker. The first step is to create a Spark Master. We will use Kubernetes ReplicationController to create the Spark Master. In the example in this article, I only use a single instance to create a Spark Master. In a production environment with HA requirements, you may need to set the number of copies to 3 or more.

Kind: ReplicationController apiVersion: v1 metadata: name: spark-master-controller spec: replicas: 1 selector: component: spark-master template: metadata: labels: component: spark-master spec: hostname: spark-master-hostname subdomain: spark-master-headless containers:-name: spark-master image: mydockerrepo/spark:2.4.4 imagePullPolicy: Always command: ["/ start-master"] ports:-containerPort: 7077 -containerPort: 8080 resources: requests: cpu: 100m

In order for the Spark Worker node to discover the Spark Master node, we also need to create a headless service.

After you have cloned from the GitHub repository and entered the spark-kubernetes directory, you can start the Spark Master service with the following command:

Kubectl create-f spark-master-controller.yaml kubectl create-f spark-master-service.yaml

Now, make sure that the Master node and all the services are running properly, and then you can start deploying the Worker node. The number of copies of Spark Worker is set to 2, which you can modify as needed. The Worker startup command is as follows:

Kubectl create-f spark-worker-controller.yaml

Finally, confirm that all services are running properly with the following command:

Kubectl get all

Execute the above command, and you should see something similar to the following:

NAME READY STATUS RESTARTS AGE po/spark-master-controller-5rgz2 1/1 Running 0 9m po/spark-worker-controller-0pts6 1/1 Running 0 9m po/spark-worker-controller-cq6ng 1/1 Running 0 9m NAME DESIRED CURRENT READY AGE rc/spark -master-controller 1 1 1 9 m rc/spark-worker-controller 2 2 29 m NAME CLUSTER-IP EXTERNAL-IP PORT (S) AGE svc/spark-master 10.108.94.160 7077/TCP 8080/TCP 9m

Submit Job to Spark cluster

Now let's submit a Job to see if it works properly. Before that, however, you need a valid AWS S3 account and a bucket with sample data. I used Kaggle to download sample data, which can be downloaded from https://www.kaggle.com/datasna. S.csv is obtained. After obtaining, you need to upload it to the bucket of S3. Assuming the bucket name is s3-data-bucket, the sample data file is located in s3-data-bucket/data.csv.

Once the data is ready, it is loaded into a Spark master pod for execution. Take the name spark-master-controller-5rgz2 of Pod as an example, the command is as follows:

Kubectl exec-it spark-master-controller-v2hjb / bin/bash

If you log in to the Spark system, you can run Spark Shell:

Export SPARK_DIST_CLASSPATH=$ (hadoop classpath) spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel (newLevel). For SparkR, use setLogLevel (newLevel). Spark context Web UI available at http://192.168.132.147:4040 Spark context available as' sc' (master = spark://spark-master:7077, app id = app-20170405152342-0000). Spark session available as' spark'. Welcome to _ / _ / / _\ / _ _\ / _ _ / `/ _ _ / / _ /. _ _ / _ _ / / _ _ / /\ _ version 2.4.4 / _ / Using Scala version 2.11.12 (Java HotSpot (TM) 64-Bit Server VM) Java 1.8.0mm 221) Type in expressions to have them evaluated. Type: help for more information. Scala >

Now let's tell Spark Master,S3 the details of the storage and enter the following configuration at the Scale prompt shown above:

Sc.hadoopConfiguration.set ("fs.s3a.endpoint", "https://s3.amazonaws.com") sc.hadoopConfiguration.set (" fs.s3a.access.key "," s3-access-key ") sc.hadoopConfiguration.set (" fs.s3a.secret.key "," s3-secret-key ")

Now, simply paste the following into the Scala prompt to submit the Spark Job (remember to modify the S3 related fields):

Import org.apache.spark._ import org.apache.spark.rdd.RDD import org.apache.spark.util.IntParam import org.apache.spark.sql.SQLContext import org.apache.spark.graphx._ import org.apache.spark.graphx.util.GraphGenerators import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.mllib.util .MLUtils val conf = new SparkConf () .setAppName ("YouTube") val sqlContext = new SQLContext (sc) import sqlContext.implicits._ import sqlContext._ val youtubeDF = spark.read.format ("csv") .option ("sep") ","). Option ("inferSchema", "true"). Option ("header", "true"). Load ("s3a://s3-data-bucket/data.csv") youtubeDF.registerTempTable ("popular") val fltCountsql = sqlContext.sql ("select s.title from popular s") fltCountsql.show ()

Finally, you can update the Spark deployment using the kubectl patch command command. For example, you can add more worker nodes when the load is high, and then delete them when the load drops.

On how to analyze Kubernetes-based Spark deployment is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.