Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The method of Building Spark Cluster

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "the method of building Spark cluster". In the daily operation, I believe that many people have doubts about the method of building Spark cluster. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "the method of building Spark cluster". Next, please follow the editor to study!

Spark is a unified computing engine for large-scale data processing. It is suitable for a variety of scenarios that previously require a variety of different distributed platform processing, including batch processing, iterative computing, interactive query, stream processing. Various processes are integrated through a unified framework.

1 unzip the file

Upload the spark-3.0.0-bin-hadoop3.2.tgz.tgz file to Linux and extract it to the specified location

Tar-zxvf spark-3.0.0-bin-hadoop3.2.tgz-C / opt/modulecd / opt/module mv spark-3.0.0-bin-hadoop3.2 spark-standalone2 modify configuration file

\ 1) enter the conf directory of the decompressed path, and modify the slaves.template file name to slaves

Mv slaves.template slaves

\ 2) modify slaves file and add work node

Hadoop102hadoop103hadoop104

\ 3) modify the spark-env.sh.template file name to spark-env.sh

Mv spark-env.sh.template spark-env.sh

\ 4) modify the spark-env.sh file to add the JAVA_HOME environment variable and the master node corresponding to the cluster

Export JAVA_HOME=/opt/module/jdk1.8.0_212SPARK_MASTER_HOST=hadoop102SPARK_MASTER_PORT=7077

Note: Port 7077 is equivalent to port 8020 of hadoop3.x internal communication. The port here needs to confirm its own virtual machine configuration.

\ 5) distribute the spark-standalone directory

Xsync spark-standalone3 starts the cluster

\ 1) execute script commands:

Sbin/start-all.sh

\ 2) View the running process of three servers

= hadoop102=3330 Jps3238 Worker3163 Master=hadoop103=2966 Jps2908 Worker=hadoop104=2978 Worker3036 Jps

\ 3) View the Web UI interface of Master resource monitoring: http://hadoop102:8080

4 submit application bin/spark-submit\-- class org.apache.spark.examples.SparkPi\-- master spark://hadoop102:7077\. / examples/jars/spark-examples_2.12-3.0.0.jar\ 10

-class indicates the main class of the program to be executed

-master spark://hadoop102:7077 standalone deployment mode, connecting to Spark cluster

Spark-examples_2.12-the jar package where the 3.0.0.jar runs the class

The number 10 represents the entry parameter of the program, which is used to set the number of tasks currently applied

When a task is executed, multiple Java processes are generated

Big data Spark running environment: detailed description of Standalone mode and configuration when executing tasks, the total number of cores of server cluster nodes is used by default, with 1024m memory per node.

5 configure History Service

Since the cluster monitoring hadoop102:4040 page cannot see the running of the historical task after the spark-shell is stopped, the history server is configured to record the running of the task during development.

\ 1) modify the spark-defaults.conf.template file name to spark-defaults.conf

Mv spark-defaults.conf.template spark-defaults.conf

\ 2) modify the spark-default.conf file and configure the log storage path

Spark.eventLog.enabled truespark.eventLog.dir hdfs://hadoop102:8020/directory

Note: the hadoop cluster needs to be started and the directory directory on HDFS needs to exist ahead of time.

Sbin/start-dfs.shhadoop fs-mkdir / directory

\ 3) modify spark-env.sh file and add log configuration

Export SPARK_HISTORY_OPTS= "- Dspark.history.ui.port=18080-Dspark.history.fs.logDirectory=hdfs://hadoop102:8020/directory-Dspark.history.retainedApplications=30"

Note: write in one line! Separate the spaces!

Parameter 1 means that the port number for WEB UI access is 18080.

Parameter 2 meaning: specify the history server log storage path

Parameter 3 means: specify the number of Application history records to be saved. If this value is exceeded, the old application information will be deleted. This is the number of applications in memory, not the number of applications displayed on the page.

\ 4) distribute configuration files

Xsync conf

\ 5) restart the cluster and history services

Sbin/start-all.shsbin/start-history-server.sh

\ 6) reexecute the task

Bin/spark-submit\-class org.apache.spark.examples.SparkPi\-master spark://hadoop102:7077\. / examples/jars/spark-examples_2.12-3.0.0.jar\ 10

\ 7) View history service: http://hadoop102:18080

6 configure High availability (HA)

The so-called high availability is because there is only one Master node in the current cluster, so there will be a single point of failure. Therefore, in order to solve the problem of single point of failure, it is necessary to configure multiple Master nodes in the cluster. Once the active Master fails, the backup Master provides services to ensure that the job can continue to execute. The high availability here is generally set by Zookeeper.

Cluster planning:

\ 1) stop the cluster

Sbin/stop-all.sh

\ 2) start Zookeeper

\ 3) modify the spark-env.sh file to add the following configuration

Note as follows: # SPARK_MASTER_HOST=hadoop102#SPARK_MASTER_PORT=7077 adds the following: the default access port of # Master monitoring page is 8080, but it conflicts with Zookeeper, so it can be changed to 8989 or can be customized. When visiting the UI monitoring page, please pay attention to SPARK_MASTER_WEBUI_PORT=8989export SPARK_DAEMON_JAVA_OPTS= "- Dspark.deploy.recoveryMode=ZOOKEEPER-Dspark.deploy.zookeeper.url=hadoop102,hadoop103,hadoop104-Dspark.deploy.zookeeper.dir=/spark" Note: write one line! Separate the spaces!

\ 4) distribute configuration files

Xsync conf/

\ 5) start the cluster

Sbin/start-all.sh

\ 6) start the individual Master node of hadoop103, and the hadoop103 node Master state is in standby state.

[bigdata@hadoop103 spark-standalone] $sbin/start-master.sh

\ 7) submit the application to the high availability cluster

Bin/spark-submit\-class org.apache.spark.examples.SparkPi\-master spark://hadoop102:7077,hadoop103:7077\. / examples/jars/spark-examples_2.12-3.0.0.jar\ 10

\ 8) stop the Master resource monitoring process of hadoop102

\ 9) check the Master resource monitoring Web UI of hadoop103, and after a short period of time, the Master status of the hadoop103 node is promoted to active.

At this point, the study on "the method of building Spark cluster" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report