Deployment Plan for Spark HA 07/02 Update SLTechnology News&Howtos

Deployment Plan for Spark HA

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

The main content of this article is to explain "the deployment plan of Spark HA". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the deployment plan of Spark HA.

Catalogue

Prepare the server environment

1.1. Server planning

1.2, software version

Second, install ZooKeeper Cluster

Third, install Hadoop2 HA Cluster

Fourth, install HBase HA Cluster

5. Steps to install Spark HA Cluster

5.1. Initialize the configuration

5.2. install Spark

5.3 configure environment variables (need to switch Root command execution, be sure to switch to ordinary users after execution)

5.3.1 modify system environment variables (append)

5.3.2 modify the environment variable of spark

5.3.3 modify slave node

5.4 install other cluster machines

5.5 start spark

5.6 start HA mode

5.7 check whether it is started

5.8Publishing stop Dirver

5.9. Start and disable the historical monitoring of Driver

5.10. Hive on Spark configuration

VI. Description of relevant parameters

Prepare the server environment

1.1. Server planning

ZooKeeper Cluster

HostIPzookeeper1192.168.50.228zookeeper2192.168.50.229zookeeper3192.168.50.230Spark HA Cluster

HOSTIPMasterWorkernn1192.168.50.221YN

Nn2192.168.50.222YNdn1192.168.50.225NYdn2192.168.50.226NYdn3192.168.50.227NY

1.2, software version

Linux: CentOS2.6.32-431.el6.x86_64

Hadoop:2.6.0

ZooKeeper:3.4.6

JDK/JRE: 1.7.0mm 75 (64bit)

Spark-1.3.0-bin-hadoop2.4

Second, install ZooKeeper Cluster by referring to "Zookeeper deployment document _ V1.0"

Spark relies on ZooKeeper for elections, so you need to deploy ZooKeeper first

Third, install Hadoop2 HA Cluster reference "Hadoop2 HA Cluster deployment V1.0"

Spark stand-alone mode, can not use HDFS, if there is a cluster distribution of yarn, you need to deploy

4. Install HBase HA Cluster by referring to "HBase HA deployment document V1.0"

If Spark does not store data to HBase, it can not be deployed

5.1. Initialize the configuration

1. Modify the host name. The first step is to modify temporarily.

# hostname nn1

The second step is to modify the permanent modification to prevent it from being reset after the next restart.

Modify hostname in / etc/sysconfig/network

NETWORKING=yes

HOSTNAME= nn1

The third step is to do DNS mapping. You can directly access the local IP address that the hostname accesses.

Modify / etc/hosts file

Add a line at the end, such as

192.168.50.221 nn1

Step 4 restart the computer

After reboot, ping nn1

If ping is successful, the configuration is complete.

Other machines can be configured in turn.

2. Turn off the firewall command: service iptables stop

At the same time, turn off the firewall self-starting service: chkconfig iptables off

Check whether self-startup is turned off: chkconfig-- list | grep iptables. If all are off, they are all turned off.

View status: service iptables status

# service iptables status

Firewall is stopped.

3. Create application accounts and groups (optional, it is recommended to create a new dedicated user)

For the sake of system security, it is recommended that each external application create a separate account and group. For specific creation methods, please search online.

# create a new group

[root@nn2 ~] # groupadd bdata

# add users and groups

[root@nn2 ~] # useradd-g bdata bdata

# set password

[root@nn2 ~] # passwd bdata

Changing password for user bdata.

New password:

BAD PASSWORD: it does not contain enough DIFFERENT characters

BAD PASSWORD: is a palindrome

Retype new password:

Passwd: all authentication tokens updated successfully.

4. Set ssh

Modify / etc/ssh/sshd_config root account modification

# vim / etc/ssh/sshd_config

Cancel the following comments

RSAAuthentication yes

PubkeyAuthentication yes

AuthorizedKeysFile .ssh / authorized_keys

After modification, restart ssh: service sshd restart

Switch to ordinary users to make settings.

Generate a SSH communication key. With this certificate, you can log in without a password when hadoop starts.

First, execute ssh-keygen-t rsa-P "" on each machine. The password is empty, mainly because authentication is required every time hadoop starts. If there is a password that needs to be entered every time, it is really troublesome, so leave the password blank and enter. After execution, two files will be generated, which are located in the ~ / .ssh folder.

B, and then execute ssh-copy-id userName@machineName to each other, this function can quickly send their own public key to each other, and automatically append

[root@nn1 ~] # ssh nn1

After entering, exit can exit SSH.

Note here, sometimes you still need to enter your password. Here may be the permission problem of authorized_keys. We can further set the permissions, chmod 600authorized_keys.

If you need to send your public key to other computers, you can log in without a password and authenticate.

5. Install JDK/JRE

Spark is a software developed in Scala language, so you must install JRE or JDK before you can run it. For testing convenience, it is recommended to install JDK (just install JRE in production environment), JDK installation steps (brief)

5.2. download address for installing Spark:

Http://mirrors.cnnic.cn/apache/spark/spark-1.3.0/spark-1.3.0-bin-hadoop2.4.tgz

Since spark is written based on scale, the scale library is required

Download address:

Http://downloads.typesafe.com/scala/2.10.5/scala-2.10.5.tgz?_ga=1.126203213.1556481652.1427182777

5.3.Configuring environment variables (you need to switch to Root command execution, and be sure to switch to ordinary users after execution)

5.3.1. Modify system environment variables (append)

# vim / etc/profile

Export JAVA_HOME=/home/utoken/software/jdk1.7.0_75

Export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

Export SCALA_HOME=/home/utoken/software/scala-2.10.5

Export SPARK_HOME=/home/utoken/software/spark-1.3.0-bin-hadoop2.4

Export PATH=$PATH:$JAVA_HOME/bin: $SCALA_HOME/bin:$SPARK_HOME/bin

5.3.2. Modify the environment variables of spark

5.3.2.1. Modify the environment variable configuration file of Spark

# vim spark-env.sh

# Import the environment variables that spark runs

Export JAVA_HOME=/home/utoken/software/jdk1.7.0_75

# explanation of why we need to configure environment variables here: although we have configured environment variables on all machines (~ / .bash_profile), start-all.sh logs in to the slave machine through ssh and then starts the spark worker process, so ~ / .base_profile must be executed after the user logs in, and ssh login is non-login login will not trigger the execution of .base _ profile, so the worker machine will not find JAVA_HOME at startup Solution: copy the environment change to the .baserc configuration, which is executed when the shell script is started

# in the case of multi-Master, the attribute of Spark_Master_IP cannot be defined, otherwise multiple Master cannot be started, and the definition of this attribute can be defined in Application

# export SPARK_MASTER_IP=nn1

# specify the amount of memory required for each Worker (global)

Export SPARK_WORKER_MEMORY=5g

# Spark's files that have been performing tasks

Export SPARK_WORK_DIR=/home/utoken/datadir/spark/work

# Spark shuffle and other small files, temporary files, you can observe the number of open handles

Export SPARK_LOCAL_DIRS=/home/utoken/datadir/spark/tmp

# use Zookeeper to guarantee HA and import corresponding environment variables

Export SPARK_DAEMON_JAVA_OPTS= "- Dsun.io.serialization.extendedDebugInfo=true-Dspark.deploy.recoveryMode=ZOOKEEPER-Dspark.deploy.zookeeper.url=zookeeper1:2181,zookeeper2:2181,zookeeper3:2181-Dspark.deploy.zookeeper.dir=/spark"

Or use another way of import.

# specify the Spark recovery model. Here, the Zookeeper mode is used. The default is NONE.

Export-Dspark.deploy.recoveryMode=ZOOKEEPER

Export-Dspark.deploy.zookeeper.url=zookeeper1:2181,zookeeper2:2181,zookeeper3:2181

Export-Dspark.deploy.zookeeper.dir=/spark

Options:

Spark.deploy.recoveryMode NONE recovery model (Master restart mode), there are three: 1, ZooKeeper, 2, FileSystem, 3 NONE

Server address of spark.deploy.zookeeper.url ZooKeeper

Spark.deploy.zookeeper.dir / spark ZooKeeper the file directory that holds cluster metadata information, including Worker,Driver and Application.

# the cluster mode combined with Spark On Yarn needs to be configured, while the stand-alone cluster mode does not need to be configured

Export HADOOP_HOME=/home/utoken/software/hadoop-2.5.2

Export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Export YARN_CONF_DIR=$HADOOP_HOME/etc/Hadoop

5.3.2.2. Modify the global configuration of Spark for each Driver application

# vim spark-defaults.conf (valid for each Driver)

Spark-defaults.conf file description

Scope of influence: edit the spark-defaults.conf on the machine where the driver resides. This file will affect the application submitted and run by the driver, and the startup parameters of the executor that provides computing resources for the application. So you only need to edit the file on the same machine as driver, not on the machine on which worker or master is running.

Configuration example:

Spark.executor.extraJavaOptions-XX:MaxPermSize=1000m

Spark.executor.memory 1g

Spark.serializer org.apache.spark.serializer.KryoSerializer

Spark.cores.max 32

Spark.shuffle.manager SORT

Spark.driver.memory 2g

Spark.shuffle.consolidateFiles true

Details:

Spark.executor.extraJavaOptions sets the extended Java option, which specifies a permanent maximum memory of 1000m

Spark.executor.memory specifies that when the executor that the application provides computing resources starts, the heap memory needs 1g (occupied Worker memory). The default is 512m.

Spark.serializer specifies the class that is serialized, in this case org.apache.spark.serializer.KryoSerializer

Spark.cores.max specifies the maximum number of cores for each Driver to run. Here, 3G is specified, and all cores owned by the default system

Spark.shuffle.manager specifies the management method to be used in shuffle. Here, SORT is specified. Sorting can reduce the generation of temporary files, but it consumes a little bit of performance.

Spark.driver.memory specifies the amount of memory used by Driver itself, where 2G

The above attributes can also be configured separately in SparkConf and optimized according to the performance of each machine.

Spark.shuffle.consolidateFiles specifies that when merging small files, there will be many small files in map. In fact, there is only one result after reduce, while small files will be left. The setting below is that the small files after map will be merged into one file.

The priority of Spark attribute is: SparkConf mode > command line parameter mode > file configuration mode.

5.3.3. Modify the slave node

# vim slaves

Dn1

Dn2

Dn3

5.4. Install other cluster machines

Scala Quick Distribution:

Scp-r / home/utoken/software/scala-2.10.5 utoken@dn1:/home/utoken/software/

Spark Quick Distribution:

Scp-r spark-1.3.0-bin-hadoop2.4 utoken@dn1:/home/utoken/software/

5.5.Boot / shutdown spark Startup and shutdown Cluster

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/start-all.sh

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/stop-all.sh

Start and shut down Master separately

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/start-master.sh

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/stop-master.sh

Start Slaves separately

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/start-slaves.sh

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/stop-slaves.sh

Start a Worker in the Slaves separately and specify the port (not required if there is no port conflict)

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077

-webui-port 8082

Step 1 of starting HA: start the entire cluster:

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/start-all.sh

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/stop-all.sh

Step 2: start Standby,Spark

[utoken@nn2 spark-1.3.0-bin-hadoop2.4] $. / sbin/start-master.sh

[utoken@nn2 spark-1.3.0-bin-hadoop2.4] $. / sbin/stop-master.sh

5.7. Check whether the primary node is started

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $jps

31246 Master

Slaves node

[utoken@dn2 software] $jps

7734 Worker

5.8.1. Issue / stop Driver application 5.8.1. Issue application commands:

. / bin/spark-submit\

-- class

-- master\

-- deploy-mode\

-- conf =\

... # other options

[application-arguments]

Parameter description:

-- class specifies the main class that you need to execute

Such as-- class com.pride.market.spark.streaming.Bootstrap

-- master specifies the master address, such as spark://nn1:7077

-- deploy-mode specifies the mode in which Driver runs. The default is Client. There are two options: client and cluster.

-- whether to restart Driver after supervise failure is limited to Spark StandAlong mode.

-- jar indicates the dependency packages of your application. Multiple dependency packages can be separated by commas.

Application-jar represents the jar package of your application. The jar package must be visible on all nodes. You can upload it to the HDFS file system. If you use hdfs://.jar mode, especially cluster mode, you must be globally visible. If you use client, you only need to follow jar at the end of the command to automatically upload to each worker node.

Application-arguments indicates the parameter main (args) that your main class needs to pass.

If you need to see where the configuration options come from, you can use the Open-- verbose option to generate more detailed running information for reference

5.8.2. Stop the task (Application/Driver)

1. Stop the application published in Client:

[utoken@nn1 ~] $jps

18376 SparkSubmit

[utoken@nn1] $kill-9 18376

2. How to stop the application published in Cluster mode:

. / bin/spark-class org.apache.spark.deploy.Client kill

Where Master URL represents the cluster Master address, such as spark://nn1:7077

Driver ID needs to be viewed through port 8080 of Master, address: http://nn1:8080

Find the Submission ID here in Running Drivers.

Such as:

[utoken@nn2 sbin] $spark-class org.apache.spark.deploy.Client kill spark://nn1:7077 driver-20150505151651-0014

View yarn task list

# yarn application-list

[bdata@nn1 ~] $yarn application-kill applicationId

Detailed orders and examples can be found on the official website:

Http://spark.apache.org/docs/latest/submitting-applications.html

5.9. Enable / disable historical monitoring of Driver

Each SparkContext launches a web UI, which by default displays useful information about the application on port 4040, including:

List of scheduler phases and tasks

Overview of RDD size and memory usage

Environmental information.

Information about running actuators

5.9.1. Modify the spark-default.conf configuration (if the configuration is not made, the log will not be persisted. Once the operation is completed, the log cannot be viewed)

Add the following options at the end

# whether to enable event logging, which is enabled here

Spark.eventLog.enabled true

# Log generation directory where the Driver task runs

Spark.eventLog.dir hdfs://dfscluster/sparktst/eventslog

# the directory to be monitored on the monitoring page. You need to enable and specify the event log directory in conjunction with the above two items.

Spark.history.fs.logDirectory hdfs://dfscluster/sparktst/eventslog

Special note: hdfs://dfscluster/sparktst/eventslog, this directory is the directory of HDFS, please create it in advance, and the cluster name of HADOOP HA mode dfscluster is used here, so we need to copy the configuration file hdfs-site.xml of HADOOP to the conf directory of Spark, so that we will not report the problem that the cluster name dfscluster cannot find. For more information, please see question 12.

For more detailed documentation, please see the official documentation:

Http://spark.apache.org/docs/latest/monitoring.html

5.9.2. Start / shut down the history service

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $sbin/start-history-server.sh

[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $sbin/stop-history-server.sh

5.10. Configuration of hive on spark

5.10.1. According to the "Hive installation configuration" document, install and configure Hive

5.10.2. Copy the configuration file hive-site.xml of Hive to the configuration directory conf of Spark, and then distribute it to the entire cluster of Spark

[bdata@bdata4 hadoop] $for i in {34 done 35, 36, 37, 38}; do scp hive-site.xml 192.168.10.$i:/u01/spark-1.5.1/conf/; done

5.10.4. Testing

# start spark-shell in cluster mode

[bdata@bdata4 sbin] $spark-shell-- master spark://bdata4:7077

# build hiveContext

Scala > val sqlContext = new org.apache.spark.sql.hive.HiveContext (sc)

16-01-13 17:30:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable

SqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@2f6c736

# perform the query operation of sql. The table YHJHK_IY02 here is created in advance in hive and the data has been loaded, so you can query it directly here.

Scala > sqlContext.sql ("select * from YHJHK_IY02 where AAC002 = '510922197303151981'"). Collect (). Foreach (println)

Data outline

Other startup methods

Start directly based on spark-sql mode

[bdata@bdata4 sbin] $. / spark-sql-- master spark://bdata4:7077

Based on YARN mode startup, integration with Hive only supports yarn-client mode, not yarn-cluster

[bdata@bdata4 sbin] $. / spark-sql-- master yarn-client

The above two methods can specify execution parameters, such as parameters followed by parameters (note that some parameters take effect differently in different clusters)

-- driver-memory 1G-- driver-cores 2-- executor-memory 4G or-- driver-memory 1G-- driver-cores 2-- executor-cores 4-- num-executors 8-- executor-memory 4G

Note: if the total memory of the publishing task exceeds the total memory of the physical machine, the task will not be executed, so be sure to check the surplus of the total memory size of the cluster, regardless of the number of system cores.

VI. Description of relevant parameters

For more information, please see the configuration table on the official website:

Http://spark.apache.org/docs/latest/configuration.html

Driver Application setting (this content can also be set in spark-default.conf):

# allow a permissible message size. Because the default message is too small to cause communication failure, you can set a larger value (this value represents the maximum capacity of Spark to communicate messages using Actor in the AKKA framework (such as the output of task task, because the messaging of the entire spark cluster is carried out through Actor. The default is 10m, which can be adjusted when processing large-scale data).

Spark.akka.frameSize=10

# number of partitions (can be understood as the number of parallel threads). It is recommended to set it to 2-3 times the number of cores.

Spark.default.parallelism=9

# whether to automatically clear RDD that do not need to be used again (optimize memory storage)

Spark.streaming.unpersist=true

# whether to merge multiple suffle small files into one large file (optimize disk storage)

Spark.shuffle.consolidateFiles=true

# put the stream data received within the specified millisecond into a block file (optimize disk storage)

Spark.streaming.blockInterval=1000

# spark storage mode takes effect only when serialized storage is set, serialization class

Spark.serializer=org.apache.spark.serializer.KryoSerializer

Spark-env.conf

#-SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)

#-SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1)

Performance optimization

1. There are two important indicators of performance optimization in spark: data compression and serialization.

Data compression:

Tools: at present, LZF and Snappy are generally used. LZF compression ratio is high, Snappy compression time is long.

Configuration:

Spark-env.sh

Export SPARK_JAVA_OPTS= "- Dspark.broadcast.compress"

Program configuration: conf.set ("spark.broadcast.compress", true)

Configure compression method: spark.io.compression.codec

Serialization:

Purpose: interprocess communication and data persistence to disk

Tools: Java itself serialization and Kyro serialization, Kyro has a compact, fast, lightweight, good scalability, strong customization and other points

Several special parameters that need to be set for applications published to yarn

-- num-executors controls how many Executor are allocated for this application, limited to Spark on Yarn mode

-- executor-memory controls the memory size of each Executor to which the application is allocated. By default, 1G supports Spark standalone, Yarn, and Mesos.

-- executor-cores controls the number of CPU cores per Executor to which the application is assigned

Through the above parameters, you can limit the application submitted by the user not to consume too much system resources.

Other

-- whether to restart Driver after supervise failure is limited to Spark standalone mode.

-- the number of CPU used by driver-cores NUM Driver programs is limited to Spark standalone mode.

-- the total number of cores used by total-executor-cores NUM executor, which is limited to Spark standalone and Spark on Mesos modes

2. Lineage (pedigree)

Narrow dependency: one parent partition corresponds to one child partition or multiple parent partitions correspond to only one child partition

Wide dependency: one parent partition corresponds to multiple child partitions or multiple parent partitions corresponds to multiple child partitions

Here, we give priority to the use of Narrow Dependency, which will check that the performance consumption of recalculation is relatively low, while the Wide method will lead to overcalculation. Because there are more dependent parent and child partitions and complex chains, the performance consumed by calculation is also more. At the same time, because the data needs to be recalculated through the parent partition, the unlost child partition data will also be recalculated, resulting in a lot of redundant data recalculation overhead.

3 、 Checkpoint

That is, it is better to have a cached RDD in memory, otherwise it will cause recalculation problems and consume performance.

4 、 Shuffle

The original intention is to shuffle and shuffle, to break up and recombine a group of regular data into a set of irregular random data partitions. In Spark, Shuffle means to convert a set of irregular data into a set of data with certain rules as far as possible, which is just the opposite of its original intention.

5. Reduce duplicate Jar package replication and distribution

Because every time spark submits a task, it will submit the corresponding jar package to the machine where the task resides, so after the same task is submitted many times, repeated jar package submission will take up disk space. In order to reduce the repeated submission of jar package, we need to upload the jar package that the application task needs to use to HDFS for address reference, so that there will not be the last jar package for each task release. It achieves the goal that multiple submissions refer to the same address of the jar package, and reduces the occupation of system disks.

6. Monitor the performance of the java virtual machine (Linux)

Recommended: yourkit, a friendly Java Profile tool

Special instructions (Spark, Flink):

1. Streaming computing: Flink and Storm support millisecond computing. Spark currently (V1.5.1) only supports second-level computing and can run on clusters with more than 100 nodes. Storm currently has the smallest latency around 100ms.

Flink is a row-by-row processing, while Spark is based on a set of slices (RDD) for small batch processing, so Spark in streaming processing, inevitably increase some delay.

2. HADOOP compatibility: Flink is more compatible with Hadoop, such as TableMapper and TableReducer that can support native HBase. The only disadvantage is that only the old version of MapReduce method is supported, the new version of MapReduce method cannot be supported, and Spark does not support TableMapper and TableReducer methods.

3. SQL support: Spark supports SQL more widely than Flink. In addition, Spark supports optimization of SQL, while Flink supports optimization of API level.

4. Flink supports automatic optimization and iterative calculation.

5. Spark follow-up advantage Spark SQL

6. Spark Streaming's Lineage fault-tolerant mode, the data is multi-redundant and fault-tolerant. If the data comes from HDFS, then the default HDFS is three backups. If the data comes from the network, it will copy two copies of each data stream to other machines, redundant fault tolerance.

At this point, I believe that everyone has a deeper understanding of the "Spark HA deployment plan", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.