Install Hadoop pseudo-distributed mode (Single Node Cluster) 10/16 Update SLTechnology News&Howtos

Install Hadoop pseudo-distributed mode (Single Node Cluster)

2025-10-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Purpose

This document describes how to install a single-node hadoop cluster so that you can understand and use hadoop's HDFS and MapReduce.

Environment:

Os: CentOS release 6.5 (Final)

Ip: 172.16.101.58

User:root

Hadoop-2.9.0.tar.gz

SSH password-less login configuration

Because this document is installed by root users, you need to configure the root user ssh to log in to the local node without a password

[root@sht-sgmhadoopdn-01] # ssh-keygen-t rsa

[root@sht-sgmhadoopdn-01 .ssh] # cat ~ / .ssh/id_dsa.pub > > ~ / .ssh/authorized_keys

[root@sht-sgmhadoopdn-01 ~] # ssh localhost

Java installation and configuration

[root@sht-sgmhadoopdn-01 ~] # cd / usr/java

[root@sht-sgmhadoopdn-01 java] # tar xf jdk-8u111-linux-x64.tar.gz

[root@sht-sgmhadoopdn-01 java] # chown-R root:root jdk1.8.0_111/

[root@sht-sgmhadoopdn-01 bin] # / usr/java/jdk1.8.0_111/bin/java-version

Java version "1.8.0,111"

[root@sht-sgmhadoopdn-01] # vim ~ / .bash_profile

Export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop

Export JAVA_HOME=/usr/java/jdk1.8.0_111

Export PATH=$JAVA_HOME/bin:$PATH:$HOME/bin

Export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

Export LD_LIBRARY_PATH=/home/bduser/hadoop/hadoop-2.7.3/lib/native/:$LD_LIBRARY_PATH

[root@sht-sgmhadoopdn-01 ~] # source .bash _ profile

[root@sht-sgmhadoopdn-01 ~] # which java

/ usr/java/jdk1.8.0_111/bin/java

Download and extract hadoop

[root@sht-sgmhadoopdn-01 local] # wget http://www-us.apache.org/dist/hadoop/common/hadoop-2.9.0/hadoop-2.9.0.tar.gz

[root@sht-sgmhadoopdn-01 local] # tar xf hadoop-2.9.0.tar.gz

[root@sht-sgmhadoopdn-01 ~] # vim .bash _ profile

Export HADOOP_HOME=/usr/local/hadoop-2.9.0

Export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$PATH

[root@sht-sgmhadoopdn-01 ~] # source .bash _ profile

[root@sht-sgmhadoopdn-01 ~] # which hadoop

/ usr/local/hadoop-2.9.0/bin/hadoop

[root@sht-sgmhadoopdn-01 local] # hadoop version

Hadoop 2.9.0

Hadoop jar command parsing

Jar run a jar file, if it is yarn, you need to use hadoop yarn jar

Take all the files in the input folder as input, filter the words that conform to the regular expression dfs [a Mel z.] + and count the number of occurrences, and finally output the results to the output folder:

Regular expression:

[amurz] means to match any character contained in amurz.

+ indicates that the item before matching is one or more times

[root@sht-sgmhadoopdn-01 ~] # cd / usr/local/hadoop-2.9.0

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # cp etc/hadoop/*.xml input/

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar grep input output 'dfs [a Murray z.] +'

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # cat output/*

1dfsadmin

Hadoop profile description

(1) the operation mode of Hadoop is determined by the configuration file (the configuration file is read when running Hadoop), so if you need to switch from pseudo-distributed mode to non-distributed mode, you need to delete the configuration items in core-site.xml.

(2) although pseudo-distribution only needs to be configured with fs.defaultFS and dfs.replication to run (as is the case with official tutorials), if the hadoop.tmp.dir parameter is not configured, the default temporary directory is / tmp/hadoo-hadoop, and this directory may be cleaned up by the system when rebooted, resulting in the need to re-execute format. So we set it up and also specify dfs.namenode.name.dir and dfs.datanode.data.dir, otherwise we may make an error in the next step

Modify the configuration file

Hadoop can be run in a pseudo-distributed manner on a single node, the Hadoop daemon runs as a separate Java process, the node acts as both NameNode and DataNode, and reads files in HDFS.

The configuration file for Hadoop is located in / usr/local/hadoop/etc/hadoop/, and the pseudo-distribution needs to modify two configuration files, core-site.xml and hdfs-site.xml. The configuration file for Hadoop is in xml format, and each configuration is implemented by declaring name and value for property.

[root@sht-sgmhadoopdn-01 hadoop] # cat / usr/local/hadoop-2.9.0/etc/hadoop/core-site.xml

Hadoop.tmp.dir

/ usr/local/hadoop-2.9.0/tmp

Abase for other temporary directories.

Fs.defaultFS

Hdfs://localhost:9000

[root@sht-sgmhadoopdn-01 hadoop] # cat / usr/local/hadoop-2.9.0/etc/hadoop/hdfs-site.xml

Dfs.replication

one

Dfs.namenode.name.dir

File:/usr/local/hadoop-2.9.0/tmp/dfs/name

Dfs.datanode.data.dir

File:/usr/local/hadoop-2.9.0/tmp/dfs/data

[root@sht-sgmhadoopdn-01 hadoop] # vim / usr/local/hadoop-2.9.0/etc/hadoop/hadoop-env.sh

# export JAVA_HOME=$ {JAVA_HOME}

Export JAVA_HOME=/usr/java/jdk1.8.0_111

Start the hadoop cluster

# formatting NameNode:

[root@sht-sgmhadoopdn-01 hadoop] # hdfs namenode-format

# start the NameNode and DataNode daemons (this step will start three processes, namely namenode,datanode,secondarynamenode)

[root@sht-sgmhadoopdn-01 hadoop] # / usr/local/hadoop-2.9.0/sbin/start-dfs.sh

# check process number and process name through jps command

[root@sht-sgmhadoopdn-01 logs] # jps

12704 DataNode

14273 Jps

12580 NameNode

27988-process information unavailable

13015 SecondaryNameNode

27832-process information unavailable

# you can also stop the daemon through stop-dfs.sh (the next time you start hadoop, you don't need to initialize NameNode, you just need to run start-dfs.sh)

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # / usr/local/hadoop-2.9.0/sbin/stop-dfs.sh

After successfully starting the process, you can access it through a browser to view NameNode and Datanode information, and you can also view files in HDFS online:

NameNode http://172.16.101.58:50070

Run hadoop pseudo-distribution instance MapReduce Job

# create the hdfs directory / user/root/input, and copy the local files to hdfs

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # hdfs dfs-mkdir-p / user/root/input

[root@sht-sgmhadoopdn-01 ~] # hdfs dfs-ls

Drwxr-xr-x-root supergroup 0 2017-12-24 15:20 input

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # hdfs dfs-put / usr/local/hadoop-2.9.0/etc/hadoop/*.xml / user/root/input

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # hdfs dfs-ls / user/root/input

Found 8 items

-rw-r--r-- 1 root supergroup 7861 2017-12-24 15:20 / user/root/input/capacity-scheduler.xml

-rw-r--r-- 1 root supergroup 1040 2017-12-24 15:20 / user/root/input/core-site.xml

-rw-r--r-- 1 root supergroup 10206 2017-12-24 15:20 / user/root/input/hadoop-policy.xml

-rw-r--r-- 1 root supergroup 1091 2017-12-24 15:20 / user/root/input/hdfs-site.xml

-rw-r--r-- 1 root supergroup 620 2017-12-24 15:20 / user/root/input/httpfs-site.xml

-rw-r--r-- 1 root supergroup 3518 2017-12-24 15:20 / user/root/input/kms-acls.xml

-rw-r--r-- 1 root supergroup 5939 2017-12-24 15:20 / user/root/input/kms-site.xml

-rw-r--r-- 1 root supergroup 690 2017-12-24 15:20 / user/root/input/yarn-site.xml

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # hadoop jar / usr/local/hadoop-2.9.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar grep input output 'dfs [a Murz] +'

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # hdfs dfs-cat output/*

1dfsadmin

# the result file is not overwritten by default, so running the above example again will prompt an error: hdfs://localhost:9000/user/root/output already exists, you need to delete output first.

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # hdfs dfs-rm-r / user/root/output

Deleted / user/root/output

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # hadoop jar / usr/local/hadoop-2.9.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar grep input output 'dfs [a Murray z.] +'

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # hdfs dfs-cat output/*

1dfsadmin

1dfs.replication

1dfs.namenode.name.dir

1dfs.datanode.data.dir

# you can also copy files from hdfs to local

[root@sht-sgmhadoopdn-01 hadoop-2.9.0] # hdfs dfs-get / user/root/output / usr/local/hadoop-2.9.0/

Run YARN on a single node

(1) the new version of Hadoop uses the new MapReduce framework (MapReduce V2, also known as YARN,Yet Another Resource Negotiator).

(2) YARN is separated from MapReduce and is responsible for resource management and task scheduling. YARN runs on top of MapReduce and provides high availability and scalability

The above starts Hadoop through. / sbin/start-dfs.sh, which only starts the MapReduce environment. We can start YARN and let YARN be responsible for resource management and task scheduling.

(3) if you don't want to start YARN, be sure to rename the configuration file mapred-site.xml to mapred-site.xml.template, and just change it back when you need it. Otherwise, if the configuration file exists and YARN is not turned on, the running program will prompt the error "Retrying connect to server: 0.0.0.0max 0.0.0.0pur8032", which is why the initial file name of the configuration file is mapred-site.xml.template.

(4) however, the main purpose of YARN is to provide better resource management and task scheduling for the cluster, but this does not show value on a single machine, but makes the program run a little slower. Therefore, whether or not to enable YARN on a single machine depends on the actual situation.

[root@sht-sgmhadoopdn-01 hadoop] # mv / usr/local/hadoop-2.9.0/etc/hadoop/mapred-site.xml.template mapred-site.xml

[root@sht-sgmhadoopdn-01 hadoop] # cat mapred-site.xml

Yarn.nodemanager.aux-services

Mapreduce_shuffle

[root@sht-sgmhadoopdn-01 hadoop] # cat yarn-site.xml

Yarn.nodemanager.aux-services

Mapreduce_shuffle

[root@sht-sgmhadoopdn-01 hadoop] # jps

27988-process information unavailable

30341 DataNode

32663 Jps

27832-process information unavailable

30188 NameNode

30525 SecondaryNameNode

# only if you have already started it using the start-dfs.sh script

[root@sht-sgmhadoopdn-01 hadoop] # / usr/local/hadoop-2.9.0/sbin/start-yarn.sh

# there are more ResourceManager and NodeManager processes than using MapReduce

[root@sht-sgmhadoopdn-01 hadoop] # jps

27988-process information unavailable

30341 DataNode

32758 ResourceManager

855 Jps

27832-process information unavailable

411 NodeManager

30188 NameNode

30525 SecondaryNameNode

# you can access it through a browser after startup:

ResourceManager-http://172.16.101.58:8088

Stop the hadoop cluster

[root@sht-sgmhadoopdn-01 hadoop] # / usr/local/hadoop-2.9.0/sbin/stop-yarn.sh

[root@sht-sgmhadoopdn-01 hadoop] # / usr/local/hadoop-2.9.0/sbin/stop-dfs.sh

[root@sht-sgmhadoopdn-01 hadoop] # / usr/local/hadoop-2.9.0/sbin/mr-jobhistory-daemon.sh stop historyserver

No historyserver to stop

Reference link:

Http://www.powerxing.com/install-hadoop/

Http://hadoop.apache.org/docs/r2.9.0/hadoop-project-dist/hadoop-common/SingleCluster.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.