Hadoop single-node & pseudo-distributed installation notes 07/04 Update SLTechnology News&Howtos

Hadoop single-node & pseudo-distributed installation notes

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Experimental environment

CentOS 6.X

Hadoop 2.6.0

JDK 1.8.0_65

Purpose

The purpose of this document is to help you quickly complete the installation and use of Hadoop on a stand-alone computer so that you can have some experience with Hadoop distributed File system (HDFS) and Map-Reduce framework, such as running sample programs or simple jobs on HDFS.

precondition

Support platform

GNU/Linux is a platform for product development and operation. Hadoop has been verified on a cluster system consisting of 2000 nodes of GNU/Linux hosts.

Win32 platform is supported as a development platform. Since distributed operations have not been fully tested on the Win32 platform, they are not supported as a production platform.

Install softwar

If your cluster has not yet installed the required software, you have to install them first.

Take CentOS as an example:

# yum install ssh rsync-y

# ssh must install and keep sshd running so that Hadoop scripts can be used to manage remote Hadoop daemons.

Create a user

# useradd-m hadoop-s / bin/bash # create a new user hadoop

Hosts parsing

# cat / etc/hosts | grep ocean-lab

192.168.9.70 ocean-lab.ocean.org ocean-lab

Install jdk

JDK-http://www.oracle.com/technetwork/java/javase/downloads/index.html

Install the JAVA environment first

# wget-no-cookies-no-check-certificate-header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie"http://download.oracle.com/otn-pub/java/jdk/8u65-b17/jdk-8u65-linux-x64.rpm""

# rpm-Uvh jdk-8u65-linux-x64.rpm

Configure Java

# echo "export JAVA_HOME=/usr/java/jdk1.8.0_65" > > / home/hadoop/.bashrc

# source / home/hadoop/.bashrc

# echo $JAVA_HOME

/ usr/java/jdk1.8.0_65

Download and install hadoop

To get a distribution of Hadoop, download the latest stable release from a mirror server on Apache.

Preparation for running a Hadoop cluster

# wget http://apache.fayea.com/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

Extract the downloaded Hadoop distribution. To edit the conf/hadoop-env.sh file, at least set JAVA_HOME to the root path of the Java installation.

# tar xf hadoop-2.6.0.tar.gz-C / usr/local

# mv / usr/local/hadoop-2.6.0 / usr/local/hadoop

Try the following command:

# bin/hadoop

The usage document for the hadoop script will be displayed.

Now you can start the Hadoop cluster in one of the three supported modes:

Stand-alone mode

Pseudo-distributed mode

Fully distributed mode

Operation method of stand-alone mode

By default, Hadoop is configured as a stand-alone Java process running in non-distributed mode. This is very helpful for debugging.

Now we can execute the example to feel the operation of Hadoop. Hadoop comes with a wealth of examples including wordcount, terasort, join, grep, etc.

Here we choose to run the grep example. We take all the files in the input folder as input, filter the words that conform to the regular expression dfs [Amurz.] + and count the number of occurrences, and finally output the results to the output folder.

# mkdir input

# cp conf/*.xml input

#. / bin/hadoop jar. / share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep. / input/. / ouput 'dfs [a Murz.] +'

# cat output/*

If the execution is successful, it will output a lot of information about the job, and the final output information is shown in the following figure. The result of the job is output in the specified output folder, and the regular word dfsadmin appears once through the command cat. / output/*.

[10:57:58] [hadoop@ocean-lab hadoop-2.6.0] $cat. / ouput/*

1 dfsadmin

Note that Hadoop does not overwrite the result file by default, so running the above example again will prompt an error and need to delete. / output first.

Otherwise, the following error will be reported

INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId=-already initialized

Org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/usr/local/hadoop-2.6.0/ouput already exists

If prompted "INFO metrics.MetricsUtil: Unable to obtain hostName java.net.UnknowHostException", you need to execute the following command to modify the hosts file to add the IP mapping to your hostname:

# cat / etc/hosts | grep ocean-lab

192.168.9.70 ocean-lab.ocean.org ocean-lab

The operation method of pseudo-distributed mode

Hadoop can run on a single node in a so-called pseudo-distributed mode, where each Hadoop daemon runs as a separate Java process.

The node acts as both NameNode and DataNode, and at the same time reads files in HDFS.

Before setting the Hadoop pseudo-distributed configuration, we also need to set the HADOOP environment variable, which is set in ~ / .bashrc by executing the following command

# Hadoop Environment Variables

Export HADOOP_HOME=/usr/local/hadoop-2.6.0

Export HADOOP_INSTALL=$HADOOP_HOME

Export HADOOP_MAPRED_HOME=$HADOOP_HOME

Export HADOOP_COMMON_HOME=$HADOOP_HOME

Export HADOOP_HDFS_HOME=$HADOOP_HOME

Export YARN_HOME=$HADOOP_HOME

Export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

Export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Source / .bashrc

Configuration

Use the following etc/hadoop/core-site.xml

Hadoop.tmp.dir

File:/usr/local/hadoop-2.6.0/tmp

Abase for other temporary directories.

Fs.defaultFS

Hdfs://localhost:9000

Similarly, modify the configuration file hdfs-site.xml

Dfs.replication

one

Dfs.namenode.name.dir

File:/usr/local/hadoop-2.6.0/tmp/dfs/name

Dfs.datanode.data.dir

File:/usr/local/hadoop-2.6.0/tmp/dfs/data

A note about the Hadoop configuration item

Although you only need to configure fs.defaultFS and dfs.replication to run (as is the case for official tutorials), if you do not configure the hadoop.tmp.dir parameter, the default temporary directory is / tmp/hadoo-hadoop, which may be cleaned up by the system upon restart, resulting in a re-execution of format. So we set it up and also specify dfs.namenode.name.dir and dfs.datanode.data.dir, otherwise we may make an error in the next steps.

Password-free ssh settings

Now confirm whether you can log in to localhost with ssh without entering a password:

# ssh localhost date

If you cannot log in to localhost with ssh without entering a password, execute the following command:

# ssh-keygen-t dsa-P''- f ~ / .ssh/id_dsa

# cat ~ / .ssh/id_dsa.pub > > ~ / .ssh/authorized_keys

# chmod 600 ~ / .ssh/authorized_keys

Format a new distributed file system:

$bin/hadoop namenode-format

11:30:20 on 15-12-23 INFO util.GSet: VM type = 64-bit

11:30:20 on 15-12-23 INFO util.GSet: 0.0299999329447746% max memory 966.7 MB = 297.0 KB

11:30:20 on 15-12-23 INFO util.GSet: capacity = 2 ^ 15 = 32768 entries

15-12-23 11:30:20 INFO namenode.NNConf: ACLs enabled? False

15-12-23 11:30:20 INFO namenode.NNConf: XAttrs enabled? True

15-12-23 11:30:20 INFO namenode.NNConf: Maximum size of an xattr: 16384

11:30:20 on 15-12-23 INFO namenode.FSImage: Allocated new BlockPoolId: BP-823870322-192.168.9.70-1450841420347

15-12-23 11:30:20 INFO common.Storage: Storage directory / usr/local/hadoop-2.6.0/tmp/dfs/name has been successfully formatted.

11:30:20 on 15-12-23 INFO namenode.NNStorageRetentionManager: Going to retain 1 p_w_picpaths with txid > = 0

15-12-23 11:30:20 INFO util.ExitUtil: Exiting with status 0

15-12-23 11:30:20 INFO namenode.NameNode: SHUTDOWN_MSG:

/ *

SHUTDOWN_MSG: Shutting down NameNode at ocean-lab.ocean.org/192.168.9.70

* * /

If successful, you will see "successfully formatted" and "Exitting with status 0" prompts

Be careful

The next time you start hadoop, you don't need to initialize NameNode, just run. / sbin/start-dfs.sh!

Start NameNode and DataNode

$. / sbin/start-dfs.sh

15-12-23 11:37:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable

Starting namenodes on [localhost]

Localhost: starting namenode, logging to / usr/local/hadoop-2.6.0/logs/hadoop-hadoop-namenode-ocean-lab.ocean.org.out

Localhost: starting datanode, logging to / usr/local/hadoop-2.6.0/logs/hadoop-hadoop-datanode-ocean-lab.ocean.org.out

Starting secondary namenodes [0.0.0.0]

The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.

RSA key fingerprint is a5:26:42:a0:5f:da:a2:88:52:04:9c:7f:8d:6a:98:9b.

Are you sure you want to continue connecting (yes/no)? Yes

0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts.

0.0.0.0: starting secondarynamenode, logging to / usr/local/hadoop-2.6.0/logs/hadoop-hadoop-secondarynamenode-ocean-lab.ocean.org.out

15-12-23 11:37:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable

[13:57:08] [hadoop@ocean-lab hadoop-2.6.0] $jps

27686 SecondaryNameNode

28455 Jps

27501 DataNode

27405 NameNode

27006 GetConf

If there is no process, it means that the startup failed. View log bebug.

After successful startup, you can visit the Web interface http://[ip,fqdn]:/50070 to view NameNode and Datanode information, and you can also view files in HDFS online.

Run Hadoop pseudo-distributed instance

In the stand-alone mode above, the grep example reads local data, while pseudo-distributed reading is data on HDFS.

To use HDFS, you first need to create a user directory in HDFS:

#. / bin/hdfs dfs-mkdir-p / user/hadoop

#. / bin/hadoop fs-ls / user/hadoop

Found 1 items

Drwxr-xr-x-hadoop supergroup 0 2015-12-23 15:03 / user/hadoop/input

Then the xml file in. / etc/hadoop is copied to the distributed file system as an input file, that is, / usr/local/hadoop/etc/hadoop is copied to / user/hadoop/input in the distributed file system. We are using the hadoop user and have created the corresponding user directory / user/hadoop, so we can use a relative path such as input in the command, whose corresponding absolute path is / user/hadoop/input:

#. / bin/hdfs dfs-mkdir input

#. / bin/hdfs dfs-put. / etc/hadoop/*.xml input

After the replication is complete, you can view the list of files in HDFS with the following command:

#. / bin/hdfs dfs-ls input

-rw-r--r-- 1 hadoop supergroup 4436 2015-12-23 16:46 input/capacity-scheduler.xml

-rw-r--r-- 1 hadoop supergroup 1180 2015-12-23 16:46 input/core-site.xml

-rw-r--r-- 1 hadoop supergroup 9683 2015-12-23 16:46 input/hadoop-policy.xml

-rw-r--r-- 1 hadoop supergroup 1136 2015-12-23 16:46 input/hdfs-site.xml

-rw-r--r-- 1 hadoop supergroup 620 2015-12-23 16:46 input/httpfs-site.xml

-rw-r--r-- 1 hadoop supergroup 3523 2015-12-23 16:46 input/kms-acls.xml

-rw-r--r-- 1 hadoop supergroup 5511 2015-12-23 16:46 input/kms-site.xml

-rw-r--r-- 1 hadoop supergroup 858 2015-12-23 16:46 input/mapred-site.xml

-rw-r--r-- 1 hadoop supergroup 690 2015-12-23 16:46 input/yarn-site.xml

Pseudo-distributed MapReduce jobs run in the same way as stand-alone mode, except that pseudo-distributed reads files in HDFS (you can delete the local input folder created in the stand-alone step and the output output folder to verify this).

#. / bin/hadoop jar. / share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs [a Murray z.] +'

The command to view the results of the run (view the output located in HDFS):

. / bin/hdfs dfs-cat output/*

1 dfsadmin

1 dfs.replication

1 dfs.namenode.name.dir

1 dfs.datanode.data.dir

The results are as follows, and notice that we have just changed the configuration file, so the running results are different.

The result of Hadoop pseudo-distributed running grep Hadoop pseudo-distributed running grep

We can also get the running results back locally:

# rm-r. / output # Delete the local output folder first (if it exists)

#. / bin/hdfs dfs-get output. / output # copy the output folder on HDFS to this machine

# cat. / output/*

1 dfsadmin

1 dfs.replication

1 dfs.namenode.name.dir

1 dfs.datanode.data.dir

When Hadoop runs the program, the output directory cannot exist, otherwise the error "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists" will be prompted, so to execute again, you need to execute the following command to delete the output folder:

# Delete the output folder

$. / bin/hdfs dfs-rm-r output

Deleted output

The output directory cannot exist when running the program

When running the Hadoop program, in order to prevent the results from being overwritten, the output directory specified by the program (such as output) cannot exist, otherwise it will prompt an error, so you need to delete the output directory before running. When actually developing an application, you can consider adding the following code to the program to automatically delete the output directory each time you run, so as to avoid tedious command line operations:

Configuration conf = new Configuration ()

Job job = new Job (conf)

/ * Delete the output directory * /

Path outputPath = new Path (args [1])

OutputPath.getFileSystem (conf) .delete (outputPath, true)

To close Hadoop, run

. / sbin/stop-dfs.sh

Start YARN

(pseudo-distributed does not start YARN, which generally does not affect program execution.)

Some readers may wonder how to start Hadoop and not see the JobTracker and TaskTracker mentioned in the book, because the new version of Hadoop uses the new MapReduce framework (MapReduce V2, also known as YARN,Yet Another Resource Negotiator).

YARN is separated from MapReduce and is responsible for resource management and task scheduling. YARN runs on MapReduce and provides high availability and high scalability. More introduction to YARN will not be carried out here. If you are interested, you can consult the relevant materials.

The above starts Hadoop through. / sbin/start-dfs.sh, which only starts the MapReduce environment. We can start YARN and let YARN be responsible for resource management and task scheduling.

First modify the configuration file mapred-site.xml

Mapreduce.framework.name

Yarn

Then modify the configuration file yarn-site.xml:

Yarn.nodemanager.aux-services

Mapreduce_shuffle

Then you can start YARN (you need to execute. / sbin/start-dfs.sh first):

#. / sbin/start-yarn.sh # launch YARN

#. / sbin/mr-jobhistory-daemon.sh start historyserver # Open the history server to check the running status of the task in Web

When enabled, you can see that there are two background processes, NodeManager and ResourceManager, via jps:

[09:18:34] [hadoop@ocean-lab ~] $jps

27686 SecondaryNameNode

6968 ResourceManager

7305 Jps

7066 NodeManager

27501 DataNode

27405 NameNode

After starting YARN, the method of running the instance is still the same, except that the resource management and task scheduling are different. Looking at the log information, we can find that when YARN is not enabled, it is "mapred.LocalJobRunner" running the task, and after enabling YARN, it is "mapred.YARNRunner" running the task. One advantage of starting YARN is that you can view the running of the task through the Web interface: http://[ip,fqdn]:8088/cluster

When YARN is enabled, you can view the task operation information. After YARN is enabled, you can view the task operation information.

However, YARN mainly provides better resource management and task scheduling for the cluster, but this does not show value on a single machine, but will make the program run a little slower. Therefore, whether or not to turn on YARN on a single machine depends on the actual situation.

If you do not start YARN, delete / rename mapred-site.xml

Otherwise, if the configuration file exists and YARN is not turned on, the running program will prompt "Retrying connect to server: 0.0.0.0max 0.0.0.0pur8032" error.

Similarly, the script to shut down YARN is as follows:

#. / sbin/stop-yarn.sh

#. / sbin/mr-jobhistory-daemon.sh stop historyserver

Hadoop common commands

# View the list of HDFS files

Hadoop fs-ls / usr/local/log/

# create a file directory

Hadoop fs-mkdir / usr/local/log/test

# Delete files

/ hadoop fs-rm / usr/local/log/07

# upload a local file to HDFS / usr/local/log/ directory

Adoop fs-put / usr/local/src/infobright-4.0.6-0-x86_64-ice.rpm / usr/local/log/

# download

Hadoop fs-get / usr/local/log/infobright-4.0.6-0-x86_64-ice.rpm / usr/local/src/

# View files

Hadoop fs-cat / usr/local/log/zabbix/access.log.zabbix

# View the basic usage of HDFS

# hadoop dfsadmin-report

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

Configured Capacity: 2956 5767680 (27.54 GB)

Present Capacity: 17956433920 (16.72GB)

DFS Remaining: 1795 6405248 (16.72 GB)

DFS Used: 28672 (28 KB)

DFS Used%: 0.005%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

Live datanodes (1):

Name: 127.0.0.1 50010 (localhost)

Hostname: ocean-lab.ocean.org

Decommission Status: Normal

Configured Capacity: 2956 5767680 (27.54 GB)

DFS Used: 28672 (28 KB)

Non DFS Used: 11609333760 (10.81 GB)

DFS Remaining: 1795 6405248 (16.72 GB)

DFS Used%: 0.005%

DFS Remaining%: 60.73%

Configured Cache Capacity: 0 (0B)

Cache Used: 0 (0B)

Cache Remaining: 0 (0B)

Cache Used%: 100.00%

Cache Remaining%: 0.005%

Xceivers: 1

Last contact: Thu Dec 24 09:52:14 CST 2015

Since then, you have mastered the configuration and basic use of Hadoop.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.