How to use Spark in Hadoop 07/02 Update SLTechnology News&Howtos

How to use Spark in Hadoop

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to use Spark in Hadoop. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

What is Spark

Spark is a general distributed parallel computing framework like Hadoop MapReduce, which is open source by UC Berkeley AMP lab. Spark has the advantages of hadoop MapReduce, but the biggest difference between Spark and MapReduce is that Spark is an iterative calculation based on memory-the intermediate output of Spark's Job processing can be saved in memory, so there is no need to read or write HDFS. In addition, a MapReduce has only two stages in the computing process, map and reduce, and the processing ends after processing, while in Spark's computing model, it can be divided into n stages, because it is memory iterative. After we have finished dealing with one stage, we can continue to deal with many stages, not just two stages.

Therefore, Spark is more suitable for MapReduce algorithms that need iteration, such as data mining and machine learning. It not only realizes the operator map function and reduce function and computing model of MapReduce, but also provides more abundant operators, such as filter, join, groupByKey and so on. Is a platform for fast and common cluster computing.

Spark is a platform for implementing fast and general-purpose cluster computing. It extends the widely used MapReduce computing model and efficiently supports more computing models, including interactive query and stream processing. Speed is very important when dealing with large data sets. An important feature of Spark is that it can be calculated in memory, so it is faster. Even with complex calculations on disk, Spark is still more efficient than MapReduce.

II. Installation of Scala (all nodes)

Download the installation package

Wget https://downloads.lightbend.com/scala/2.11.7/scala-2.11.7.tgz

Extract the installation package

Tar xf scala-2.11.7.tgz mv scala-2.11.7 / usr/local/scala

Configure the scala environment variable / etc/profile.d/scala.sh

# Scala ENVexport SCALA_HOME=/usr/local/scalaexport PATH=$PATH:$SCALA_HOME/bin

Make the scala environment variable effective

Source / etc/profile.d/scala.sh III, Spark installation (all nodes) 1, download and install # download installation package wget https://mirrors.aliyun.com/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz# extract installation package tar xf spark-2.3.1-bin-hadoop2.7.tgz mv spark-2.3.1-bin-hadoop2.7 / usr/local/spark2, configure Spark environment variables

Edit the file / etc/profile.d/spark.sh and modify it as follows:

# Spark ENVexport SPARK_HOME=/usr/local/sparkexport PATH=$PATH:$SPARK_HOME/bin:

Effective environment variable

Source / etc/profile.d/spark.sh IV, Spark configuration (namenode01) 1, configuration spark-env.sh

Edit the file / usr/local/spark/conf/spark-env.sh to read as follows:

Export JAVA_HOME=/usr/java/defaultexport SCALA_HOME=/usr/local/scalaexport HADOOP_HOME=/usr/local/hadoopexport HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoopexport SPARK_MASTER_IP=namenode01export SPARK_WORKER_MEMORY=4gexport SPARK_WORKER_CORES=2export SPARK_WORKER_INSTANCES=12, configure slaves

Edit the file / usr/local/spark/conf/slaves to read as follows:

Datanode01datanode02datanode033, synchronize configuration files to other nodes scp / usr/local/spark/conf/* datanode01:/usr/local/spark/conf/scp / usr/local/spark/conf/* datanode02:/usr/local/spark/conf/scp / usr/local/spark/conf/* datanode03:/usr/local/spark/conf/4, start Spark cluster

The Spark service uses only hadoop's hdfs cluster.

/ usr/local/spark/sbin/start- all.sh V, check 1, JPS [root@namenode01 ~] # jps14512 NameNode23057 RunJar14786 ResourceManager30355 Jps15894 HMaster30234 Master [root@datanode01 ~] # jps3509 DataNode3621 NodeManager1097 QuorumPeerMain9930 RunJar15514 Worker15581 Jps3935 HRegionServer [root@datanode02 ~] # jps3747 HRegionServer14153 Worker3322 DataNode3434 NodeManager1101 QuorumPeerMain14221 Jps [root@datanode03 ~] # jps3922 DataNode4034 NodeManager19186 Worker19255 Jps1102 QuorumPeerMain4302 HRegionServer2, Spark WEB interface

Visit http://192.168.1.200:8080/

3 、 spark-shell

At the same time, because shell is running, we can also visit WebUI to see the currently executed tasks through 192.168.1.200 virtual 4040.

Thank you for reading! This is the end of the article on "how to use Spark in Hadoop". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.