In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
The main content of this article is to explain "the deployment plan of Spark HA". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the deployment plan of Spark HA.
Catalogue
Prepare the server environment
1.1. Server planning
1.2, software version
Second, install ZooKeeper Cluster
Third, install Hadoop2 HA Cluster
Fourth, install HBase HA Cluster
5. Steps to install Spark HA Cluster
5.1. Initialize the configuration
5.2. install Spark
5.3 configure environment variables (need to switch Root command execution, be sure to switch to ordinary users after execution)
5.3.1 modify system environment variables (append)
5.3.2 modify the environment variable of spark
5.3.3 modify slave node
5.4 install other cluster machines
5.5 start spark
5.6 start HA mode
5.7 check whether it is started
5.8Publishing stop Dirver
5.9. Start and disable the historical monitoring of Driver
5.10. Hive on Spark configuration
VI. Description of relevant parameters
*
Prepare the server environment
1.1. Server planning
ZooKeeper Cluster
HostIPzookeeper1192.168.50.228zookeeper2192.168.50.229zookeeper3192.168.50.230Spark HA Cluster
HOSTIPMasterWorkernn1192.168.50.221YN
Nn2192.168.50.222YNdn1192.168.50.225NYdn2192.168.50.226NYdn3192.168.50.227NY
1.2, software version
Linux: CentOS2.6.32-431.el6.x86_64
Hadoop:2.6.0
ZooKeeper:3.4.6
JDK/JRE: 1.7.0mm 75 (64bit)
Spark-1.3.0-bin-hadoop2.4
Second, install ZooKeeper Cluster by referring to "Zookeeper deployment document _ V1.0"
Spark relies on ZooKeeper for elections, so you need to deploy ZooKeeper first
Third, install Hadoop2 HA Cluster reference "Hadoop2 HA Cluster deployment V1.0"
Spark stand-alone mode, can not use HDFS, if there is a cluster distribution of yarn, you need to deploy
4. Install HBase HA Cluster by referring to "HBase HA deployment document V1.0"
If Spark does not store data to HBase, it can not be deployed
5.1. Initialize the configuration
1. Modify the host name. The first step is to modify temporarily.
# hostname nn1
The second step is to modify the permanent modification to prevent it from being reset after the next restart.
Modify hostname in / etc/sysconfig/network
NETWORKING=yes
HOSTNAME= nn1
The third step is to do DNS mapping. You can directly access the local IP address that the hostname accesses.
Modify / etc/hosts file
Add a line at the end, such as
192.168.50.221 nn1
Step 4 restart the computer
After reboot, ping nn1
If ping is successful, the configuration is complete.
Other machines can be configured in turn.
2. Turn off the firewall command: service iptables stop
At the same time, turn off the firewall self-starting service: chkconfig iptables off
Check whether self-startup is turned off: chkconfig-- list | grep iptables. If all are off, they are all turned off.
View status: service iptables status
# service iptables status
Firewall is stopped.
3. Create application accounts and groups (optional, it is recommended to create a new dedicated user)
For the sake of system security, it is recommended that each external application create a separate account and group. For specific creation methods, please search online.
# create a new group
[root@nn2 ~] # groupadd bdata
# add users and groups
[root@nn2 ~] # useradd-g bdata bdata
# set password
[root@nn2 ~] # passwd bdata
Changing password for user bdata.
New password:
BAD PASSWORD: it does not contain enough DIFFERENT characters
BAD PASSWORD: is a palindrome
Retype new password:
Passwd: all authentication tokens updated successfully.
4. Set ssh
Modify / etc/ssh/sshd_config root account modification
# vim / etc/ssh/sshd_config
Cancel the following comments
RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh / authorized_keys
After modification, restart ssh: service sshd restart
Switch to ordinary users to make settings.
Generate a SSH communication key. With this certificate, you can log in without a password when hadoop starts.
First, execute ssh-keygen-t rsa-P "" on each machine. The password is empty, mainly because authentication is required every time hadoop starts. If there is a password that needs to be entered every time, it is really troublesome, so leave the password blank and enter. After execution, two files will be generated, which are located in the ~ / .ssh folder.
B, and then execute ssh-copy-id userName@machineName to each other, this function can quickly send their own public key to each other, and automatically append
[root@nn1 ~] # ssh nn1
After entering, exit can exit SSH.
Note here, sometimes you still need to enter your password. Here may be the permission problem of authorized_keys. We can further set the permissions, chmod 600authorized_keys.
If you need to send your public key to other computers, you can log in without a password and authenticate.
5. Install JDK/JRE
Spark is a software developed in Scala language, so you must install JRE or JDK before you can run it. For testing convenience, it is recommended to install JDK (just install JRE in production environment), JDK installation steps (brief)
5.2. download address for installing Spark:
Http://mirrors.cnnic.cn/apache/spark/spark-1.3.0/spark-1.3.0-bin-hadoop2.4.tgz
Since spark is written based on scale, the scale library is required
Download address:
Http://downloads.typesafe.com/scala/2.10.5/scala-2.10.5.tgz?_ga=1.126203213.1556481652.1427182777
5.3.Configuring environment variables (you need to switch to Root command execution, and be sure to switch to ordinary users after execution)
5.3.1. Modify system environment variables (append)
# vim / etc/profile
Export JAVA_HOME=/home/utoken/software/jdk1.7.0_75
Export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
Export SCALA_HOME=/home/utoken/software/scala-2.10.5
Export SPARK_HOME=/home/utoken/software/spark-1.3.0-bin-hadoop2.4
Export PATH=$PATH:$JAVA_HOME/bin: $SCALA_HOME/bin:$SPARK_HOME/bin
5.3.2. Modify the environment variables of spark
5.3.2.1. Modify the environment variable configuration file of Spark
# vim spark-env.sh
# Import the environment variables that spark runs
Export JAVA_HOME=/home/utoken/software/jdk1.7.0_75
# explanation of why we need to configure environment variables here: although we have configured environment variables on all machines (~ / .bash_profile), start-all.sh logs in to the slave machine through ssh and then starts the spark worker process, so ~ / .base_profile must be executed after the user logs in, and ssh login is non-login login will not trigger the execution of .base _ profile, so the worker machine will not find JAVA_HOME at startup Solution: copy the environment change to the .baserc configuration, which is executed when the shell script is started
# in the case of multi-Master, the attribute of Spark_Master_IP cannot be defined, otherwise multiple Master cannot be started, and the definition of this attribute can be defined in Application
# export SPARK_MASTER_IP=nn1
# specify the amount of memory required for each Worker (global)
Export SPARK_WORKER_MEMORY=5g
# Spark's files that have been performing tasks
Export SPARK_WORK_DIR=/home/utoken/datadir/spark/work
# Spark shuffle and other small files, temporary files, you can observe the number of open handles
Export SPARK_LOCAL_DIRS=/home/utoken/datadir/spark/tmp
# use Zookeeper to guarantee HA and import corresponding environment variables
Export SPARK_DAEMON_JAVA_OPTS= "- Dsun.io.serialization.extendedDebugInfo=true-Dspark.deploy.recoveryMode=ZOOKEEPER-Dspark.deploy.zookeeper.url=zookeeper1:2181,zookeeper2:2181,zookeeper3:2181-Dspark.deploy.zookeeper.dir=/spark"
Or use another way of import.
# specify the Spark recovery model. Here, the Zookeeper mode is used. The default is NONE.
Export-Dspark.deploy.recoveryMode=ZOOKEEPER
Export-Dspark.deploy.zookeeper.url=zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
Export-Dspark.deploy.zookeeper.dir=/spark
Options:
Spark.deploy.recoveryMode NONE recovery model (Master restart mode), there are three: 1, ZooKeeper, 2, FileSystem, 3 NONE
Server address of spark.deploy.zookeeper.url ZooKeeper
Spark.deploy.zookeeper.dir / spark ZooKeeper the file directory that holds cluster metadata information, including Worker,Driver and Application.
# the cluster mode combined with Spark On Yarn needs to be configured, while the stand-alone cluster mode does not need to be configured
Export HADOOP_HOME=/home/utoken/software/hadoop-2.5.2
Export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Export YARN_CONF_DIR=$HADOOP_HOME/etc/Hadoop
5.3.2.2. Modify the global configuration of Spark for each Driver application
# vim spark-defaults.conf (valid for each Driver)
Spark-defaults.conf file description
Scope of influence: edit the spark-defaults.conf on the machine where the driver resides. This file will affect the application submitted and run by the driver, and the startup parameters of the executor that provides computing resources for the application. So you only need to edit the file on the same machine as driver, not on the machine on which worker or master is running.
Configuration example:
Spark.executor.extraJavaOptions-XX:MaxPermSize=1000m
Spark.executor.memory 1g
Spark.serializer org.apache.spark.serializer.KryoSerializer
Spark.cores.max 32
Spark.shuffle.manager SORT
Spark.driver.memory 2g
Spark.shuffle.consolidateFiles true
Details:
Spark.executor.extraJavaOptions sets the extended Java option, which specifies a permanent maximum memory of 1000m
Spark.executor.memory specifies that when the executor that the application provides computing resources starts, the heap memory needs 1g (occupied Worker memory). The default is 512m.
Spark.serializer specifies the class that is serialized, in this case org.apache.spark.serializer.KryoSerializer
Spark.cores.max specifies the maximum number of cores for each Driver to run. Here, 3G is specified, and all cores owned by the default system
Spark.shuffle.manager specifies the management method to be used in shuffle. Here, SORT is specified. Sorting can reduce the generation of temporary files, but it consumes a little bit of performance.
Spark.driver.memory specifies the amount of memory used by Driver itself, where 2G
The above attributes can also be configured separately in SparkConf and optimized according to the performance of each machine.
Spark.shuffle.consolidateFiles specifies that when merging small files, there will be many small files in map. In fact, there is only one result after reduce, while small files will be left. The setting below is that the small files after map will be merged into one file.
The priority of Spark attribute is: SparkConf mode > command line parameter mode > file configuration mode.
5.3.3. Modify the slave node
# vim slaves
Dn1
Dn2
Dn3
5.4. Install other cluster machines
Scala Quick Distribution:
Scp-r / home/utoken/software/scala-2.10.5 utoken@dn1:/home/utoken/software/
Spark Quick Distribution:
Scp-r spark-1.3.0-bin-hadoop2.4 utoken@dn1:/home/utoken/software/
5.5.Boot / shutdown spark Startup and shutdown Cluster
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/start-all.sh
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/stop-all.sh
Start and shut down Master separately
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/start-master.sh
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/stop-master.sh
Start Slaves separately
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/start-slaves.sh
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/stop-slaves.sh
Start a Worker in the Slaves separately and specify the port (not required if there is no port conflict)
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
-webui-port 8082
Step 1 of starting HA: start the entire cluster:
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/start-all.sh
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $. / sbin/stop-all.sh
Step 2: start Standby,Spark
[utoken@nn2 spark-1.3.0-bin-hadoop2.4] $. / sbin/start-master.sh
[utoken@nn2 spark-1.3.0-bin-hadoop2.4] $. / sbin/stop-master.sh
5.7. Check whether the primary node is started
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $jps
31246 Master
Slaves node
[utoken@dn2 software] $jps
7734 Worker
5.8.1. Issue / stop Driver application 5.8.1. Issue application commands:
. / bin/spark-submit\
-- class
-- master\
-- deploy-mode\
-- conf =\
... # other options
\
[application-arguments]
Parameter description:
-- class specifies the main class that you need to execute
Such as-- class com.pride.market.spark.streaming.Bootstrap
-- master specifies the master address, such as spark://nn1:7077
-- deploy-mode specifies the mode in which Driver runs. The default is Client. There are two options: client and cluster.
-- whether to restart Driver after supervise failure is limited to Spark StandAlong mode.
-- jar indicates the dependency packages of your application. Multiple dependency packages can be separated by commas.
Application-jar represents the jar package of your application. The jar package must be visible on all nodes. You can upload it to the HDFS file system. If you use hdfs://.jar mode, especially cluster mode, you must be globally visible. If you use client, you only need to follow jar at the end of the command to automatically upload to each worker node.
Application-arguments indicates the parameter main (args) that your main class needs to pass.
If you need to see where the configuration options come from, you can use the Open-- verbose option to generate more detailed running information for reference
5.8.2. Stop the task (Application/Driver)
1. Stop the application published in Client:
[utoken@nn1 ~] $jps
18376 SparkSubmit
[utoken@nn1] $kill-9 18376
2. How to stop the application published in Cluster mode:
. / bin/spark-class org.apache.spark.deploy.Client kill
Where Master URL represents the cluster Master address, such as spark://nn1:7077
Driver ID needs to be viewed through port 8080 of Master, address: http://nn1:8080
Find the Submission ID here in Running Drivers.
Such as:
[utoken@nn2 sbin] $spark-class org.apache.spark.deploy.Client kill spark://nn1:7077 driver-20150505151651-0014
View yarn task list
# yarn application-list
[bdata@nn1 ~] $yarn application-kill applicationId
Detailed orders and examples can be found on the official website:
Http://spark.apache.org/docs/latest/submitting-applications.html
5.9. Enable / disable historical monitoring of Driver
Each SparkContext launches a web UI, which by default displays useful information about the application on port 4040, including:
List of scheduler phases and tasks
Overview of RDD size and memory usage
Environmental information.
Information about running actuators
5.9.1. Modify the spark-default.conf configuration (if the configuration is not made, the log will not be persisted. Once the operation is completed, the log cannot be viewed)
Add the following options at the end
# whether to enable event logging, which is enabled here
Spark.eventLog.enabled true
# Log generation directory where the Driver task runs
Spark.eventLog.dir hdfs://dfscluster/sparktst/eventslog
# the directory to be monitored on the monitoring page. You need to enable and specify the event log directory in conjunction with the above two items.
Spark.history.fs.logDirectory hdfs://dfscluster/sparktst/eventslog
Special note: hdfs://dfscluster/sparktst/eventslog, this directory is the directory of HDFS, please create it in advance, and the cluster name of HADOOP HA mode dfscluster is used here, so we need to copy the configuration file hdfs-site.xml of HADOOP to the conf directory of Spark, so that we will not report the problem that the cluster name dfscluster cannot find. For more information, please see question 12.
For more detailed documentation, please see the official documentation:
Http://spark.apache.org/docs/latest/monitoring.html
5.9.2. Start / shut down the history service
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $sbin/start-history-server.sh
[utoken@nn1 spark-1.3.0-bin-hadoop2.4] $sbin/stop-history-server.sh
5.10. Configuration of hive on spark
5.10.1. According to the "Hive installation configuration" document, install and configure Hive
5.10.2. Copy the configuration file hive-site.xml of Hive to the configuration directory conf of Spark, and then distribute it to the entire cluster of Spark
[bdata@bdata4 hadoop] $for i in {34 done 35, 36, 37, 38}; do scp hive-site.xml 192.168.10.$i:/u01/spark-1.5.1/conf/; done
5.10.4. Testing
# start spark-shell in cluster mode
[bdata@bdata4 sbin] $spark-shell-- master spark://bdata4:7077
# build hiveContext
Scala > val sqlContext = new org.apache.spark.sql.hive.HiveContext (sc)
16-01-13 17:30:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable
SqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@2f6c736
# perform the query operation of sql. The table YHJHK_IY02 here is created in advance in hive and the data has been loaded, so you can query it directly here.
Scala > sqlContext.sql ("select * from YHJHK_IY02 where AAC002 = '510922197303151981'"). Collect (). Foreach (println)
Data outline
Other startup methods
Start directly based on spark-sql mode
[bdata@bdata4 sbin] $. / spark-sql-- master spark://bdata4:7077
Based on YARN mode startup, integration with Hive only supports yarn-client mode, not yarn-cluster
[bdata@bdata4 sbin] $. / spark-sql-- master yarn-client
The above two methods can specify execution parameters, such as parameters followed by parameters (note that some parameters take effect differently in different clusters)
-- driver-memory 1G-- driver-cores 2-- executor-memory 4G or-- driver-memory 1G-- driver-cores 2-- executor-cores 4-- num-executors 8-- executor-memory 4G
Note: if the total memory of the publishing task exceeds the total memory of the physical machine, the task will not be executed, so be sure to check the surplus of the total memory size of the cluster, regardless of the number of system cores.
VI. Description of relevant parameters
For more information, please see the configuration table on the official website:
Http://spark.apache.org/docs/latest/configuration.html
Driver Application setting (this content can also be set in spark-default.conf):
# allow a permissible message size. Because the default message is too small to cause communication failure, you can set a larger value (this value represents the maximum capacity of Spark to communicate messages using Actor in the AKKA framework (such as the output of task task, because the messaging of the entire spark cluster is carried out through Actor. The default is 10m, which can be adjusted when processing large-scale data).
Spark.akka.frameSize=10
# number of partitions (can be understood as the number of parallel threads). It is recommended to set it to 2-3 times the number of cores.
Spark.default.parallelism=9
# whether to automatically clear RDD that do not need to be used again (optimize memory storage)
Spark.streaming.unpersist=true
# whether to merge multiple suffle small files into one large file (optimize disk storage)
Spark.shuffle.consolidateFiles=true
# put the stream data received within the specified millisecond into a block file (optimize disk storage)
Spark.streaming.blockInterval=1000
# spark storage mode takes effect only when serialized storage is set, serialization class
Spark.serializer=org.apache.spark.serializer.KryoSerializer
Spark-env.conf
#-SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
#-SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1)
Performance optimization
1. There are two important indicators of performance optimization in spark: data compression and serialization.
Data compression:
Tools: at present, LZF and Snappy are generally used. LZF compression ratio is high, Snappy compression time is long.
Configuration:
Spark-env.sh
Export SPARK_JAVA_OPTS= "- Dspark.broadcast.compress"
Program configuration: conf.set ("spark.broadcast.compress", true)
Configure compression method: spark.io.compression.codec
Serialization:
Purpose: interprocess communication and data persistence to disk
Tools: Java itself serialization and Kyro serialization, Kyro has a compact, fast, lightweight, good scalability, strong customization and other points
Several special parameters that need to be set for applications published to yarn
-- num-executors controls how many Executor are allocated for this application, limited to Spark on Yarn mode
-- executor-memory controls the memory size of each Executor to which the application is allocated. By default, 1G supports Spark standalone, Yarn, and Mesos.
-- executor-cores controls the number of CPU cores per Executor to which the application is assigned
Through the above parameters, you can limit the application submitted by the user not to consume too much system resources.
Other
-- whether to restart Driver after supervise failure is limited to Spark standalone mode.
-- the number of CPU used by driver-cores NUM Driver programs is limited to Spark standalone mode.
-- the total number of cores used by total-executor-cores NUM executor, which is limited to Spark standalone and Spark on Mesos modes
2. Lineage (pedigree)
Narrow dependency: one parent partition corresponds to one child partition or multiple parent partitions correspond to only one child partition
Wide dependency: one parent partition corresponds to multiple child partitions or multiple parent partitions corresponds to multiple child partitions
Here, we give priority to the use of Narrow Dependency, which will check that the performance consumption of recalculation is relatively low, while the Wide method will lead to overcalculation. Because there are more dependent parent and child partitions and complex chains, the performance consumed by calculation is also more. At the same time, because the data needs to be recalculated through the parent partition, the unlost child partition data will also be recalculated, resulting in a lot of redundant data recalculation overhead.
3 、 Checkpoint
That is, it is better to have a cached RDD in memory, otherwise it will cause recalculation problems and consume performance.
4 、 Shuffle
The original intention is to shuffle and shuffle, to break up and recombine a group of regular data into a set of irregular random data partitions. In Spark, Shuffle means to convert a set of irregular data into a set of data with certain rules as far as possible, which is just the opposite of its original intention.
5. Reduce duplicate Jar package replication and distribution
Because every time spark submits a task, it will submit the corresponding jar package to the machine where the task resides, so after the same task is submitted many times, repeated jar package submission will take up disk space. In order to reduce the repeated submission of jar package, we need to upload the jar package that the application task needs to use to HDFS for address reference, so that there will not be the last jar package for each task release. It achieves the goal that multiple submissions refer to the same address of the jar package, and reduces the occupation of system disks.
6. Monitor the performance of the java virtual machine (Linux)
Recommended: yourkit, a friendly Java Profile tool
Special instructions (Spark, Flink):
1. Streaming computing: Flink and Storm support millisecond computing. Spark currently (V1.5.1) only supports second-level computing and can run on clusters with more than 100 nodes. Storm currently has the smallest latency around 100ms.
Flink is a row-by-row processing, while Spark is based on a set of slices (RDD) for small batch processing, so Spark in streaming processing, inevitably increase some delay.
2. HADOOP compatibility: Flink is more compatible with Hadoop, such as TableMapper and TableReducer that can support native HBase. The only disadvantage is that only the old version of MapReduce method is supported, the new version of MapReduce method cannot be supported, and Spark does not support TableMapper and TableReducer methods.
3. SQL support: Spark supports SQL more widely than Flink. In addition, Spark supports optimization of SQL, while Flink supports optimization of API level.
4. Flink supports automatic optimization and iterative calculation.
5. Spark follow-up advantage Spark SQL
6. Spark Streaming's Lineage fault-tolerant mode, the data is multi-redundant and fault-tolerant. If the data comes from HDFS, then the default HDFS is three backups. If the data comes from the network, it will copy two copies of each data stream to other machines, redundant fault tolerance.
At this point, I believe that everyone has a deeper understanding of the "Spark HA deployment plan", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
Http://linux5588.blog.51cto.com/65280/1293677
© 2024 shulou.com SLNews company. All rights reserved.