How to deploy Spark in Kubernetes 04/24 Update SLTechnology News&Howtos

How to deploy Spark in Kubernetes

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to deploy Spark in Kubernetes, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

Spark is a new generation of distributed memory computing framework, Apache open source top-level project. Compared with the Hadoop Map-Reduce computing framework, Spark keeps the intermediate computing results in memory and increases the speed by 10 to 100 times; at the same time, it also provides richer operators and uses elastic distributed data sets (RDD) to achieve iterative computing, which is better suitable for data mining and machine learning algorithms, and greatly improves the efficiency of development.

Build a Spark container

The first step in deploying an application on Kubernetes is to create a container. Although some projects provide official container images, as of this writing, Apache Spark does not provide official images. So we'll create our own Spark container, let's start with Dockerfile.

FROM java:openjdk-8-jdkENV hadoop_ver 2.8.2ENV spark_ver 2.4.4RUN mkdir-p / opt & &\ cd / opt & &\ curl http://archive.apache.org/dist/hadoop/common/hadoop-${hadoop_ver}/hadoop-${hadoop_ver}.tar.gz |\ tar-zx & &\ ln-s hadoop-$ {hadoop_ver} hadoop & &\ echo Hadoop ${hadoop_ver} installed in / optRUN mkdir-p / Opt & &\ cd / opt & &\ curl http://archive.apache.org/dist/spark/spark-${spark_ver}/spark-${spark_ver}-bin-without-hadoop.tgz |\ tar-zx & &\ ln-s spark-$ {spark_ver}-bin-without-hadoop spark & &\ echo Spark ${spark_ver} installed in / optENV SPARK_HOME=/opt/sparkENV PATH=$PATH:$SPARK_HOME/binENV HADOOP_HOME=/opt/hadoopENV PATH=$PATH:$HADOOP _ HOME/binENV LD_LIBRARY_PATH=$HADOOP_HOME/lib/nativeRUN curl http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar-o / opt/spark/jars/hadoop-aws-2.8.2.jarRUN curl http://central.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.3/httpclient-4.5.3.jar-o / opt/spark/jars/ Httpclient-4.5.3.jarRUN curl http://central.maven.org/maven2/joda-time/joda-time/2.9.9/joda-time-2.9.9.jar-o / opt/spark/jars/joda-time-2.9.9.jarRUN curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.11.712/aws-java-sdk-core-1.11.712.jar-o / opt/spark/ Jars/aws-java-sdk-core-1.11.712.jarRUN curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.712/aws-java-sdk-1.11.712.jar-o / opt/spark/jars/aws-java-sdk-1.11.712.jarRUN curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-kms/1.11.712/aws-java-sdk-kms- 1.11.712.jar-o / opt/spark/jars/aws-java-sdk-kms-1.11.712.jarRUN curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.11.712/aws-java-sdk-s3-1.11.712.jar-o / opt/spark/jars/aws-java-sdk-s3-1.11.712.jarADD start-common.sh start-worker start-master / ADD core-site.xml / Opt/spark/conf/core-site.xmlADD spark-defaults.conf / opt/spark/conf/spark-defaults.confENV PATH $PATH:/opt/spark/bin

In this Dockerfile, we first download Apache Spark and Hadoop from the official address, and then get the associated jar package from Maven. After all the associated files have been downloaded and unzipped to a specific directory, we add these important configuration files to the image.

In the process, you can easily add your own environment-specific configuration.

We could have skipped the above steps and directly used a pre-built image, but by interpreting these steps, we can let our readers see the contents inside the Spark container, and advanced users can modify them to meet their special needs.

The Dockerfile and other associated configuration files used in the above example can be obtained from this GitHub repository. If you want to use the contents of this repository, first clone it locally using the following command:

Git clone git@github.com:devshlabs/spark-kubernetes.git

Now, you can make any changes in your environment as needed, then build the image and upload it to the container registry you use. In the example in this article, I use Dockerhub as the container registry with the following command:

Cd spark-kubernetes/spark-containerdocker build. -t mydockerrepo/spark:2.4.4docker push mydockerrepo/spark:2.4.4

Remember to replace the mydockerrepo with your actual registry name.

Deploy Spark on Kubernetes

At this point, the Spark container image has been built and can be pulled and used. Let's use this image to deploy Spark Master and Worker. The first step is to create a Spark Master. We will use Kubernetes ReplicationController to create the Spark Master. In the example in this article, I only use a single instance to create a Spark Master. In a production environment with HA requirements, you may need to set the number of copies to 3 or more.

Kind: ReplicationControllerapiVersion: v1metadata:name: spark-master-controllerspec:replicas: 1selector:component: spark-mastertemplate:metadata: labels: component: spark-masterspec: spark-master-hostname subdomain: spark-master-headless containers:-name: spark-master image: mydockerrepo/spark:2.4.4 imagePullPolicy: Always command: ["/ start-master"] ports:-containerPort: 7077-containerPort: 8080 resources: requests: cpu: 100m

In order for the Spark Worker node to discover the Spark Master node, we also need to create a headless service. After you have cloned from the GitHub repository and entered the spark-kubernetes directory, you can start the Spark Master service with the following command:

Kubectl create-f spark-master-controller.yamlkubectl create-f spark-master-service.yaml

Now, make sure that the Master node and all the services are running properly, and then you can start deploying the Worker node. The number of copies of Spark Worker is set to 2, which you can modify as needed. The Worker startup command is as follows: kubectl create-f spark-worker-controller.yaml finally, confirm whether all services are running properly with the following command: kubectl get all executes the above command, and you should see something similar to the following:

NAME READY STATUS RESTARTS AGEpo/spark-master-controller-5rgz2 1/1 Running 0 9mpo/spark-worker-controller-0pts6 1/1 Running 0 9mpo/spark-worker-controller-cq6ng 1/1 Running 0 9mNAME DESIRED CURRENT READY AGErc/spark-master-controller 1 1 1 9mrc/spark-worker-controller 2 2 2 9mNAME CLUSTER-IP EXTERNAL-IP PORT (S) AGEsvc/spark-master 10.108.94.160 7077/TCP 8080/TCP 9m submits Job to Spark cluster

Now let's submit a Job to see if it works properly. Before that, however, you need a valid AWS S3 account and a bucket with sample data. I used Kaggle to download sample data, which can be downloaded from https://www.kaggle.com/datasna. S.csv is obtained. After obtaining, you need to upload it to the bucket of S3. Assuming the bucket name is s3-data-bucket, the sample data file is located in s3-data-bucket/data.csv. Once the data is ready, it is loaded into a Spark master pod for execution. Take Pod named spark-master-controller-5rgz2 as an example, the command is as follows: kubectl exec-it spark-master-controller-v2hjb / bin/bash if you log in to the Spark system, you can run Spark Shell:

Export SPARK_DIST_CLASSPATH=$ (hadoop classpath) spark-shellSetting default log level to "WARN" .to adjust logging level use sc.setLogLevel (newLevel). For SparkR, use setLogLevel (newLevel). Spark context Web UI available at http://192.168.132.147:4040Spark context available as' sc' (master = spark://spark-master:7077) App id = app-20170405152342-0000). Spark session available as' spark'.Welcome to _ / _ / / _\ / / _ _\ / _ `/ _ _ /'/ _ /. _ _ /\ _, _ / _ /\ _ version 2.4.4 / _ / Using Scala version 2.11.12 (Java HotSpot (TM) 64-Bit Server VM) Java 1.8.0mm 221) Type in expressions to have them evaluated.Type: help for more information.scala >

Now let's tell Spark Master,S3 the details of the storage and enter the following configuration at the Scale prompt shown above:

Sc.hadoopConfiguration.set ("fs.s3a.endpoint", "https://s3.amazonaws.com")sc.hadoopConfiguration.set("fs.s3a.access.key"," s3-access-key ") sc.hadoopConfiguration.set (" fs.s3a.secret.key "," s3-secret-key ")

Now, simply paste the following into the Scala prompt to submit the Spark Job (remember to modify the S3 related fields):

Import org.apache.spark._import org.apache.spark.rdd.RDDimport org.apache.spark.util.IntParamimport org.apache.spark.sql.SQLContextimport org.apache.spark.graphx._import org.apache.spark.graphx.util.GraphGeneratorsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.tree.DecisionTreeimport org.apache.spark.mllib.tree.model.DecisionTreeModelimport org.apache.spark.mllib.util.MLUtilsval conf = new SparkConf () .setAppName ("YouTube") val sqlContext = new SQLContext (sc) import sqlContext.implicits._import sqlContext._val youtubeDF = spark.read.format ("csv") .option ("sep") ","). Option ("inferSchema", "true"). Option ("header", "true"). Load ("s3a://s3-data-bucket/data.csv") youtubeDF.registerTempTable ("popular") val fltCountsql = sqlContext.sql ("select s.title from popular s") fltCountsql.show ()

Finally, you can update the Spark deployment using the kubectl patch command command. For example, you can add more worker nodes when the load is high, and then delete them when the load drops.

These are all the contents of the article "how to deploy Spark in Kubernetes". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.