In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "the method of building Spark cluster". In the daily operation, I believe that many people have doubts about the method of building Spark cluster. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "the method of building Spark cluster". Next, please follow the editor to study!
Spark is a unified computing engine for large-scale data processing. It is suitable for a variety of scenarios that previously require a variety of different distributed platform processing, including batch processing, iterative computing, interactive query, stream processing. Various processes are integrated through a unified framework.
1 unzip the file
Upload the spark-3.0.0-bin-hadoop3.2.tgz.tgz file to Linux and extract it to the specified location
Tar-zxvf spark-3.0.0-bin-hadoop3.2.tgz-C / opt/modulecd / opt/module mv spark-3.0.0-bin-hadoop3.2 spark-standalone2 modify configuration file
\ 1) enter the conf directory of the decompressed path, and modify the slaves.template file name to slaves
Mv slaves.template slaves
\ 2) modify slaves file and add work node
Hadoop102hadoop103hadoop104
\ 3) modify the spark-env.sh.template file name to spark-env.sh
Mv spark-env.sh.template spark-env.sh
\ 4) modify the spark-env.sh file to add the JAVA_HOME environment variable and the master node corresponding to the cluster
Export JAVA_HOME=/opt/module/jdk1.8.0_212SPARK_MASTER_HOST=hadoop102SPARK_MASTER_PORT=7077
Note: Port 7077 is equivalent to port 8020 of hadoop3.x internal communication. The port here needs to confirm its own virtual machine configuration.
\ 5) distribute the spark-standalone directory
Xsync spark-standalone3 starts the cluster
\ 1) execute script commands:
Sbin/start-all.sh
\ 2) View the running process of three servers
= hadoop102=3330 Jps3238 Worker3163 Master=hadoop103=2966 Jps2908 Worker=hadoop104=2978 Worker3036 Jps
\ 3) View the Web UI interface of Master resource monitoring: http://hadoop102:8080
4 submit application bin/spark-submit\-- class org.apache.spark.examples.SparkPi\-- master spark://hadoop102:7077\. / examples/jars/spark-examples_2.12-3.0.0.jar\ 10
-class indicates the main class of the program to be executed
-master spark://hadoop102:7077 standalone deployment mode, connecting to Spark cluster
Spark-examples_2.12-the jar package where the 3.0.0.jar runs the class
The number 10 represents the entry parameter of the program, which is used to set the number of tasks currently applied
When a task is executed, multiple Java processes are generated
Big data Spark running environment: detailed description of Standalone mode and configuration when executing tasks, the total number of cores of server cluster nodes is used by default, with 1024m memory per node.
5 configure History Service
Since the cluster monitoring hadoop102:4040 page cannot see the running of the historical task after the spark-shell is stopped, the history server is configured to record the running of the task during development.
\ 1) modify the spark-defaults.conf.template file name to spark-defaults.conf
Mv spark-defaults.conf.template spark-defaults.conf
\ 2) modify the spark-default.conf file and configure the log storage path
Spark.eventLog.enabled truespark.eventLog.dir hdfs://hadoop102:8020/directory
Note: the hadoop cluster needs to be started and the directory directory on HDFS needs to exist ahead of time.
Sbin/start-dfs.shhadoop fs-mkdir / directory
\ 3) modify spark-env.sh file and add log configuration
Export SPARK_HISTORY_OPTS= "- Dspark.history.ui.port=18080-Dspark.history.fs.logDirectory=hdfs://hadoop102:8020/directory-Dspark.history.retainedApplications=30"
Note: write in one line! Separate the spaces!
Parameter 1 means that the port number for WEB UI access is 18080.
Parameter 2 meaning: specify the history server log storage path
Parameter 3 means: specify the number of Application history records to be saved. If this value is exceeded, the old application information will be deleted. This is the number of applications in memory, not the number of applications displayed on the page.
\ 4) distribute configuration files
Xsync conf
\ 5) restart the cluster and history services
Sbin/start-all.shsbin/start-history-server.sh
\ 6) reexecute the task
Bin/spark-submit\-class org.apache.spark.examples.SparkPi\-master spark://hadoop102:7077\. / examples/jars/spark-examples_2.12-3.0.0.jar\ 10
\ 7) View history service: http://hadoop102:18080
6 configure High availability (HA)
The so-called high availability is because there is only one Master node in the current cluster, so there will be a single point of failure. Therefore, in order to solve the problem of single point of failure, it is necessary to configure multiple Master nodes in the cluster. Once the active Master fails, the backup Master provides services to ensure that the job can continue to execute. The high availability here is generally set by Zookeeper.
Cluster planning:
\ 1) stop the cluster
Sbin/stop-all.sh
\ 2) start Zookeeper
\ 3) modify the spark-env.sh file to add the following configuration
Note as follows: # SPARK_MASTER_HOST=hadoop102#SPARK_MASTER_PORT=7077 adds the following: the default access port of # Master monitoring page is 8080, but it conflicts with Zookeeper, so it can be changed to 8989 or can be customized. When visiting the UI monitoring page, please pay attention to SPARK_MASTER_WEBUI_PORT=8989export SPARK_DAEMON_JAVA_OPTS= "- Dspark.deploy.recoveryMode=ZOOKEEPER-Dspark.deploy.zookeeper.url=hadoop102,hadoop103,hadoop104-Dspark.deploy.zookeeper.dir=/spark" Note: write one line! Separate the spaces!
\ 4) distribute configuration files
Xsync conf/
\ 5) start the cluster
Sbin/start-all.sh
\ 6) start the individual Master node of hadoop103, and the hadoop103 node Master state is in standby state.
[bigdata@hadoop103 spark-standalone] $sbin/start-master.sh
\ 7) submit the application to the high availability cluster
Bin/spark-submit\-class org.apache.spark.examples.SparkPi\-master spark://hadoop102:7077,hadoop103:7077\. / examples/jars/spark-examples_2.12-3.0.0.jar\ 10
\ 8) stop the Master resource monitoring process of hadoop102
\ 9) check the Master resource monitoring Web UI of hadoop103, and after a short period of time, the Master status of the hadoop103 node is promoted to active.
At this point, the study on "the method of building Spark cluster" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.