How to build a highly available Spark Cluster with Spark+Zookeeper 04/24 Update SLTechnology News&Howtos

How to build a highly available Spark Cluster with Spark+Zookeeper

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces Spark+Zookeeper how to build a high-availability Spark cluster, the content is very detailed, interested friends can refer to, I hope it can be helpful to you.

Comparison of three distributed deployment modes of Spark

Currently, Apache Spark supports three distributed deployment methods, namely standalone, spark on mesos and spark on YARN. For more information, please see.

Spark standalone mode distributed deployment

Environment introduction hostname application tvm11zookeepertvm12zookeepertvm13zookeeper, spark (master), spark (slave), Scalatvm14spark (backup), spark (slave), Scalatvm15spark (slave), Scala description

Rely on scala:

Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0. Support for Scala 2.10 was removed as of 2.3.0. Support for Scala 2.11 is deprecated as of Spark 2.4.1 and will be removed in Spark 3.0.

Zookeeper: Master node has a single point of failure, so it is necessary to start at least two Master nodes to achieve high availability with the help of zookeeper. The configuration scheme is relatively simple.

Install scala

As can be seen from the above instructions, spark is strictly dependent on the scala version, and spark-2.4.5 depends on scala-2.12.x, so you need to install scala-2.12.x first, and choose scala-2.12.10 here. Use binary installation:

Download the installation package

Decompression is ready to use.

$wget https://downloads.lightbend.com/scala/2.12.10/scala-2.12.10.tgz$ tar zxvf scala-2.12.10.tgz-C / path/to/scala_install_dir

If the system environment wants to use the same version of scala, you can add it to the user environment variable (.bashrc or .bash _ profile).

Install spark

Get through the work user ssh channel of three spark machines

Now install the package to the master machine: tvm13

Download address

Pay attention to the prompt, and the Hadoop version (match the existing environment, if not, choose the non-precompiled version to compile it yourself).

Unzip it to the installation directory.

Configure spark

There are two main spark service profiles: spark-env.sh and slaves.

Spark-evn.sh: configure spark to run related environment variables

Slaves: specify the worker server

Configure spark-env.sh:cp spark-env.sh.template spark-env.sh

Export JAVA_HOME=/data/template/j/java/jdk1.8.0_201export SCALA_HOME=/data/template/s/scala/scala-2.12.10export SPARK_WORKER_MEMORY=2048mexport SPARK_WORKER_CORES=2export SPARK_WORKER_INSTANCES=2export SPARK_DAEMON_JAVA_OPTS= "- Dspark.deploy.recoveryMode=ZOOKEEPER-Dspark.deploy.zookeeper.url=tvm11:2181,tvm12:2181 Tvm13:2181-Dspark.deploy.zookeeper.dir=/data/template/s/spark "# about the meaning of SPARK_DAEMON_JAVA_OPTS parameters: #-Dspark.deploy.recoverMode=ZOOKEEPER # represents failure using zookeeper service #-Dspark.depoly.zookeeper.url=master.hadoop,slave1.hadoop Slave1.hadoop # name of hostname #-Dspark.deploy.zookeeper.dir=/spark # spark save directory when you want to write data on zookeeper # other parameter meaning: https://blog.csdn.net/u010199356/article/details/89056304

Configure slaves:cp slaves.template slaves

# A Spark Worker will be started on each of the machines listed below.tvm13tvm14tvm15

Configure spark-default.sh, which is mainly used for spark to perform tasks (can be specified by command action):

# http://spark.apache.org/docs/latest/configuration.html#configuring-logging# spark-defaults.shspark.app.name YunTuSparkspark.driver.cores 2spark.driver.memory 2gspark.master spark://tvm13:7077 Tvm14:7077spark.eventLog.enabled truespark.eventLog.dir hdfs://cluster01/tmp/event/logs spark.serializer org.apache.spark.serializer.KryoSerializerspark.serializer.objectStreamReset 100spark.executor.logs.rolling.time.interval dailyspark.executor.logs.rolling.maxRetainedFiles 30spark.ui.enabled Truespark.ui.killEnabled truespark.ui.liveUpdate.period 100msspark.ui.liveUpdate.minFlushPeriod 3sspark.ui.port 4040spark.history.ui.port 18080spark.ui.retainedJobs 100spark.ui.retainedStages 100spark.ui.retainedTasks 1000spark.ui.showConsoleProgress truespark.worker.ui.retainedExecutors 100spark.worker.ui.retainedDrivers 100spark.sql.ui.retainedExecutions 100spark.streaming.ui.retainedBatches 100spark.ui.retainedDeadExecutors 10 spark.executor.extraJavaOptions-XX:+PrintGCDetails-Dkey=value-Dnumbers= "one two three" hdfs resource preparation

Because spark.eventLog.dir is specified as hdfs storage, you need to pre-create the corresponding directory file in hdfs:

Hdfs dfs-mkdir-p hdfs://cluster01/tmp/event/logs configure system environment variables

Edit ~ / .bashrc:

Export SPARK_HOME=/data/template/s/spark/spark-2.4.5-bin-hadoop2.7export PATH=$SPARK_HOME/bin/:$PATH distribution

After the above configuration is complete, distribute / path/to/spark-2.4.5-bin-hadoop2.7 to each slave node and configure the environment variables for each node.

Start

Start all services on the master node first:. / sbin/start-all.sh

Then start the master service separately on the backup node:. / sbin/start-master.sh

View statu

After startup, go to web to check:

Master (port 8081): Status: ALIVE

Backup (port 8080): Status: STANDBY

Done!

This is the end of how to build a high-availability Spark cluster for Spark+Zookeeper. I hope the above content can be of some help and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.