Construction of big data distributed platform Hadoop2.7.7 + Spark2.2.2 04/28 Update SLTechnology News&Howtos

Construction of big data distributed platform Hadoop2.7.7 + Spark2.2.2

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Apache Spark is a fast and general computing engine specially designed for large-scale data processing. Spark is a general parallel framework like Hadoop MapReduce opened by UC Berkeley AMP lab (AMP Lab of the University of California, Berkeley). Spark has the advantages of Hadoop MapReduce; but what is different from MapReduce is that the intermediate output of Job can be saved in memory, so it is no longer necessary to read and write HDFS, so Spark can be better applied to iterative MapReduce algorithms such as data mining and machine learning.

Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two. These useful differences make Spark superior in some workloads. In other words, Spark enables in-memory distributed datasets to optimize iterative workloads in addition to interactive queries.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as local collection objects.

Although Spark was created to support iterative jobs on distributed datasets, it is actually a complement to Hadoop and can be run in parallel in the Hadoop file system. This behavior can be supported through a third-party cluster framework called Mesos. Developed by AMP Lab (Algorithms, Machines, and People Lab) at the University of California, Berkeley, Spark can be used to build large, low-latency data analysis applications.

1. Preparatory work (1), 3 machines, which can be three virtual machines in VM I have three Centos6.7, namely: 192.168.174.141 hd1 master192.168.174.142 hd2 slave1192.168.174.143 hd3 slave2 (2), java environment: jdk1.8.0_73 (3), and a new ordinary user useradd hadooppasswd hadoopNew password: Retype new password: authorize root permissions Add a hadoop ALL= (ALL) ALL# of hadoop under root to modify permissions chmod 777 / etc/sudoersvim / etc/sudoers## Allow root to run any commands anywhere root ALL= (ALL) ALLhadoop ALL= (ALL) ALL# restore permissions chmod 440 / etc/sudoers (4), configure ssh secret-free login # to enter my home directory, su-hadoopssh-keygen-t rsa (press four consecutive returns) # after executing this command Will generate two files id_rsa (private key) and id_rsa.pub (public key) # copy the public key to the machine to be logged in without secret ssh-copy-id hd2ssh-copy-id hd3 II, install the hadoop cluster # create a new apps directory on the h2Powerh3 home/hadoop/apps/hadoopcd h4 machine to store hadoop and the spark installation package mkdir-p / home/hadoop/apps/hadoopcd / home/hadoop/apps/hadoop# download hadoop2.7.7 on the hd1 machine (hd2 Hd3, etc. after hd1 changes the relevant configuration of hadoop, scp sends it to) wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gztar-zxvf hadoop-2.7.7.tar.gz# configure environment variable sudo vim / etc/profile# to add HADOOP_HOMEexport HADOOP_HOME=/home/hadoop/apps/hadoop/hadoop-2.7.7export PATH=$PATH:$HADOOP_HOME/bin: $HADOOP_HOME/sbin# refresh environment variable source / etc/profile# View hadoop version hadoop version# configuration Hadoop JAVA_HOMEcd / home/hadoop/apps/hadoop/hadoop-2.7.7/etc/hadoopvim hadoop-env.sh# is about 25 lines Add export JAVA_HOME=/opt/soft/java/jdk1.8.0_73# to modify configuration file 1, modify core-site.xmlvim core-site.xml fs.defaultFS hdfs://hd1:9000 hadoop.tmp.dir file:/home/hadoop/apps/hadoop/hadoop-2.7.7/tmp 2, Modify hdfs-site.xmlvim hdfs-site.xml dfs.namenode.secondary.http-address hd1:50090 dfs.replication 2 dfs.namenode.name.dir file:/home/hadoop/apps/hadoop/hadoop-2.7.7/tmp/dfs/name dfs.datanode.data.dir file:/home/hadoop/apps/hadoop/hadoop-2.7.7/tmp/dfs/data 3, Modify that this file is not available in the mapred-site.xml# directory Make a copy of cp mapred-site.xml.template mapred-site.xmlvim mapred-site.xml mapreduce.framework.name yarn mapreduce.jobhistory.address hd1:10020 mapreduce.jobhistory.webapp.address hd1:19888 4. Modify yarn-site.xmlvim yarn-site.xml yarn.resourcemanager.hostname hd1 yarn.nodemanager.aux-services mapreduce_shuffle yarn.log-aggregation-enable true yarn.log.server.url http://hd1:19888/jobhistory/logs yarn.nodemanager.pmem-check-enabled false yarn.nodemanager.vmem-check-enabled false5, modify slaves file content This file specifies which server nodes are datanode nodes, and deletes all configuration files above localhostcd / home/hadoop/apps/hadoop/hadoop-2.7.7/etc/hadoopvim slaveshd1hd2hd3# in it. Copy the configured hadoop-2.7.7 directory to hd2 on hd1 Hd3 same directory cd / home/hadoop/apps/hadoopscp-r hadoop-2.7.7 hadoop@hd2:/home/hadoop/apps/hadoop/scp-r hadoop-2.7.7 hadoop@hd3:/home/hadoop/apps/hadoop/scp / etc/profile root@hd2:/etc/ and execute on hd2: source / etc/profilescp / etc/profile root@hd3:/etc/ and on hd3: source / etc/profile# format cluster operation # format namenode and datanode and start (it can be executed on hd1 (master) and does not need to be executed on (hd2,hd3) slave) hdfs namenode-format# closes all machine firewalls service iptables stop# starts hadoop cluster # execute two commands in sequence # start hdfsstart-dfs.sh# and then start start-yarn.sh# directly use one command to verify whether the startup is successful, the absence of any of the following processes indicates an error # use the jps command in hd1,hd2,hd3 You can see the display of 56310 NameNode56423 DataNode56809 ResourceManager56921 NodeManager56634 SecondaryNameNode# hd2 in # hd1, the display of 16455 NodeManager16348 DataNode#hd3, the display of 13716 DataNode13823 NodeManager#, the view of cluster web page, hdfs page: http://hd1:50070/ or http://192.168.174.141:50070/.

Yarn page: http://hd1:8088/ or http://192.168.174.141:8088/

# stop cluster command: stop_dfs.sh and stop_yarn.sh or stop_all.sh

The above Hadoop cluster building process has been completed!

3. Install and build Spark cluster dependency environment: ScalaSpark is written in Scala. Writing Spark tasks in Scala can operate distributed datasets like local collection objects. RDD installing Scalla is the same as installing jdk. Here I give the download address of scala: after installing scala, you can check the version scala-version#. Here we focus on the installation of Spark. Compared to the hadoop installation is a little easier, and the steps are similar, without saying much, start! # on the hd1 machine, use the hadoop user to create the spark directory cd / home/hadoop/appsmkdir sparkcd spark# to download the spark installation package wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.2/spark-2.2.2-bin-hadoop2.7.tgz#, extract tar-zxvf / spark-2.2.2-bin-hadoop2.7.tgz# and rename mv spark-2.2.2-bin-hadoop2.7 spark-2.2. Modify the environment variable vim / etc/profileexport SPARK_HOME=/home/hadoop/apps/spark/spark-2.2.2export PATH=$PATH:$SPARK_HOME/bin# reload environment source / etc/profile# modify the configuration file cd / home/hadoop/apps/spark/spark-2.2.2/confmv spark-env.sh.template spark-env.shvim spark-env.sh# here are two deployment modes of spark One is standalone mode, the other is spark on yarn mode. You can choose any configuration # 1, the port number export SPARK_MASTER_PORT=7077#2 of the standalone mode export JAVA_HOME=/opt/soft/java/jdk1.8.0_73#Spark master node IPexport SPARK_MASTER_IP=hd1#Spark master node, Spark on yarn configuration export JAVA_HOME=/opt/soft/java/jdk1.8.0_73export HADOOP_CONF_DIR=/home/hadoop/apps/hadoop/hadoop-2.7.7/etc/hadoop/# modify slaves file cd / home/hadoop/apps/spark/spark-2.2.2/confvim slaveshd2hd3# copy spark in hd1 to hd2 and cd / home/hadoop/apps/sparkscp-r spark-2.2.2/ hadoop@hd2:/home/hadoop/apps/sparkscp in hd3 machines -r spark-2.2.2/ hadoop@hd3:/home/hadoop/apps/spark# configuration environment variable: modify hd2 separately The hd3 environment variable or copy the / etc/profile file on hd1 directly to hd2 and hd3. Vim / etc/profileexport SPARK_HOME=/home/hadoop/apps/spark/spark-2.2.2export PATH=$PATH:$SPARK_HOME/binsource / etc/profile# at this point, the Spark cluster is configured and the Spark cluster is started. # start the spark cluster before starting the hadoop cluster. # Spark Cluster Startup cd / home/hadoop/apps/spark/spark-2.2.2/sbin./start-all.sh# Test whether the Spark Cluster starts normally # execute jps in hd1,hd2,hd3 Display in hd1: Master63124 Jps56310 NameNode56423 DataNode63064 Master56809 ResourceManager56921 NodeManager56634 SecondaryNameNode in hd2, hd3: Worker18148 Jps16455 NodeManager16348 DataNode18079 Worker# test spark-shell and page cd / home/hadoop/apps/spark/spark-2.2.2/bin./spark-shell# access page address: http://hd1:8080/ or: http://192.168.174.141:8080/

Http://hd1:4040/jobs/ or http://192.168.174.141:4040/jobs/

Fourth, complete the construction!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.