Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to implement the Standalone mode and configuration in big data's Spark running environment

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to carry out the Standalone mode and configuration in the running environment of big data Spark, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

Big data Spark operating environment: Standalone mode and related configuration

Standalone mode

Here we take a look at the cluster mode that runs only with Spark's own nodes, which is what we call the stand-alone deployment (Standalone) mode. Spark's Standalone mode embodies the classic master-slave mode.

Cluster planning:

1 unzip the file

Upload the spark-3.0.0-bin-hadoop3.2.tgz.tgz file to Linux and extract it to the specified location

Tar-zxvf spark-3.0.0-bin-hadoop3.2.tgz-C / opt/module cd / opt/module mv spark-3.0.0-bin-hadoop3.2 spark-standalone2 modify configuration file

1) enter the conf directory of the decompressed path, and modify the slaves.template file name to slaves

Mv slaves.template slaves

2) modify slaves file and add work node

Hadoop102hadoop103hadoop104

3) modify the spark-env.sh.template file name to spark-env.sh

Mv spark-env.sh.template spark-env.sh

4) modify the spark-env.sh file to add the JAVA_HOME environment variable and the master node corresponding to the cluster

Export JAVA_HOME=/opt/module/jdk1.8.0_212 SPARK_MASTER_HOST=hadoop102SPARK_MASTER_PORT=7077

Note: Port 7077 is equivalent to port 8020 of hadoop3.x internal communication. The port here needs to confirm its own virtual machine configuration.

5) distribute the spark-standalone directory

Xsync spark-standalone3 starts the cluster

1) execute script commands:

Sbin/start-all.sh

2) View the running process of three servers

= hadoop102= 3330 Jps 3238 Worker 3163 Master = hadoop103= 2966 Jps 2908 Worker = hadoop104= 2978 Worker 3036 Jps

3) View the Web UI interface of Master resource monitoring: http://hadoop102:8080

4 submit application bin/spark-submit\-- class org.apache.spark.examples.SparkPi\-- master spark://hadoop102:7077\. / examples/jars/spark-examples_2.12-3.0.0.jar\ 10

-- class represents the main class of the program to execute

-- master spark://hadoop102:7077 standalone deployment mode, connecting to the Spark cluster

Spark-examples_2.12-the jar package where the 3.0.0.jar runs the class

The number 10 represents the entry parameter of the program, which is used to set the number of tasks currently applied

When a task is executed, multiple Java processes are generated

When performing tasks, the total number of cores of server cluster nodes is used by default, and each node has 1024m of memory.

5 configure History Service

Since the cluster monitoring hadoop102:4040 page cannot see the running of the historical task after the spark-shell is stopped, the history server is configured to record the running of the task during development.

1) modify the spark-defaults.conf.template file name to spark-defaults.conf

Mv spark-defaults.conf.template spark-defaults.conf

2) modify the spark-default.conf file and configure the log storage path

Spark.eventLog.enabled true spark.eventLog.dir hdfs://hadoop102:8020/directory

Note: the hadoop cluster needs to be started and the directory directory on HDFS needs to exist ahead of time.

Sbin/start-dfs.sh hadoop fs-mkdir / directory

3) modify spark-env.sh file and add log configuration

Export SPARK_HISTORY_OPTS= "- Dspark.history.ui.port=18080-Dspark.history.fs.logDirectory=hdfs://hadoop102:8020/directory-Dspark.history.retainedApplications=30"

Note: write in one line! Separate the spaces!

Parameter 1 means that the port number for WEB UI access is 18080.

Parameter 2 meaning: specify the history server log storage path

Parameter 3 means: specify the number of Application history records to be saved. If this value is exceeded, the old application information will be deleted. This is the number of applications in memory, not the number of applications displayed on the page.

4) distribute configuration files

Xsync conf

5) restart the cluster and history services

Sbin/start-all.sh sbin/start-history-server.sh

6) resume the task

Bin/spark-submit\-class org.apache.spark.examples.SparkPi\-master spark://hadoop102:7077\. / examples/jars/spark-examples_2.12-3.0.0.jar\ 10

7) View History Service: http://hadoop102:18080

6 configure High availability (HA)

The so-called high availability is because there is only one Master node in the current cluster, so there will be a single point of failure. Therefore, in order to solve the problem of single point of failure, it is necessary to configure multiple Master nodes in the cluster. Once the active Master fails, the backup Master provides services to ensure that the job can continue to execute. The high availability here is generally set by Zookeeper.

Cluster planning:

1) stop the cluster

Sbin/stop-all.sh

2) start Zookeeper

3) modify the spark-env.sh file to add the following configuration

Note the following: # SPARK_MASTER_HOST=hadoop102#SPARK_MASTER_PORT=7077 adds the following: # Master monitoring page default access port is 8080, but it conflicts with Zookeeper, so change it to 8989, or you can customize it. Please pay attention to SPARK_MASTER_WEBUI_PORT=8989 export SPARK_DAEMON_JAVA_OPTS= "- Dspark.deploy.recoveryMode=ZOOKEEPER-Dspark.deploy.zookeeper.url=hadoop102,hadoop103,hadoop104-Dspark.deploy.zookeeper.dir=/spark" when accessing the UI monitoring page.

Note: write in one line! Separate the spaces!

4) distribute configuration files

Xsync conf/

5) start the cluster

Sbin/start-all.sh

6) start the individual Master node of hadoop103, and the Master state of hadoop103 node is in standby state

[bigdata@hadoop103 spark-standalone] $sbin/start-master.sh

7) submit the application to the highly available cluster

Bin/spark-submit\-class org.apache.spark.examples.SparkPi\-master spark://hadoop102:7077,hadoop103:7077\. / examples/jars/spark-examples_2.12-3.0.0.jar\ 10

8) stop the Master resource monitoring process of hadoop102

9) check the Master resource monitoring Web UI of hadoop103, and after a short period of time, the Master status of the hadoop103 node is promoted to active.

After reading the above, do you have any further understanding of how to implement the Standalone mode and configuration in big data's Spark runtime environment? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report