In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Experimental environment
CentOS 6.X
Hadoop 2.6.0
JDK 1.8.0_65
Purpose
The purpose of this document is to help you quickly complete the installation and use of Hadoop on a stand-alone computer so that you can have some experience with Hadoop distributed File system (HDFS) and Map-Reduce framework, such as running sample programs or simple jobs on HDFS.
precondition
Support platform
GNU/Linux is a platform for product development and operation. Hadoop has been verified on a cluster system consisting of 2000 nodes of GNU/Linux hosts.
Win32 platform is supported as a development platform. Since distributed operations have not been fully tested on the Win32 platform, they are not supported as a production platform.
Install softwar
If your cluster has not yet installed the required software, you have to install them first.
Take CentOS as an example:
# yum install ssh rsync-y
# ssh must install and keep sshd running so that Hadoop scripts can be used to manage remote Hadoop daemons.
Create a user
# useradd-m hadoop-s / bin/bash # create a new user hadoop
Hosts parsing
# cat / etc/hosts | grep ocean-lab
192.168.9.70 ocean-lab.ocean.org ocean-lab
Install jdk
JDK-http://www.oracle.com/technetwork/java/javase/downloads/index.html
Install the JAVA environment first
# wget-no-cookies-no-check-certificate-header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie"http://download.oracle.com/otn-pub/java/jdk/8u65-b17/jdk-8u65-linux-x64.rpm""
# rpm-Uvh jdk-8u65-linux-x64.rpm
Configure Java
# echo "export JAVA_HOME=/usr/java/jdk1.8.0_65" > > / home/hadoop/.bashrc
# source / home/hadoop/.bashrc
# echo $JAVA_HOME
/ usr/java/jdk1.8.0_65
Download and install hadoop
To get a distribution of Hadoop, download the latest stable release from a mirror server on Apache.
Preparation for running a Hadoop cluster
# wget http://apache.fayea.com/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
Extract the downloaded Hadoop distribution. To edit the conf/hadoop-env.sh file, at least set JAVA_HOME to the root path of the Java installation.
# tar xf hadoop-2.6.0.tar.gz-C / usr/local
# mv / usr/local/hadoop-2.6.0 / usr/local/hadoop
Try the following command:
# bin/hadoop
The usage document for the hadoop script will be displayed.
Now you can start the Hadoop cluster in one of the three supported modes:
Stand-alone mode
Pseudo-distributed mode
Fully distributed mode
Operation method of stand-alone mode
By default, Hadoop is configured as a stand-alone Java process running in non-distributed mode. This is very helpful for debugging.
Now we can execute the example to feel the operation of Hadoop. Hadoop comes with a wealth of examples including wordcount, terasort, join, grep, etc.
Here we choose to run the grep example. We take all the files in the input folder as input, filter the words that conform to the regular expression dfs [Amurz.] + and count the number of occurrences, and finally output the results to the output folder.
# mkdir input
# cp conf/*.xml input
#. / bin/hadoop jar. / share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep. / input/. / ouput 'dfs [a Murz.] +'
# cat output/*
If the execution is successful, it will output a lot of information about the job, and the final output information is shown in the following figure. The result of the job is output in the specified output folder, and the regular word dfsadmin appears once through the command cat. / output/*.
[10:57:58] [hadoop@ocean-lab hadoop-2.6.0] $cat. / ouput/*
1 dfsadmin
Note that Hadoop does not overwrite the result file by default, so running the above example again will prompt an error and need to delete. / output first.
Otherwise, the following error will be reported
INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId=-already initialized
Org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/usr/local/hadoop-2.6.0/ouput already exists
If prompted "INFO metrics.MetricsUtil: Unable to obtain hostName java.net.UnknowHostException", you need to execute the following command to modify the hosts file to add the IP mapping to your hostname:
# cat / etc/hosts | grep ocean-lab
192.168.9.70 ocean-lab.ocean.org ocean-lab
The operation method of pseudo-distributed mode
Hadoop can run on a single node in a so-called pseudo-distributed mode, where each Hadoop daemon runs as a separate Java process.
The node acts as both NameNode and DataNode, and at the same time reads files in HDFS.
Before setting the Hadoop pseudo-distributed configuration, we also need to set the HADOOP environment variable, which is set in ~ / .bashrc by executing the following command
# Hadoop Environment Variables
Export HADOOP_HOME=/usr/local/hadoop-2.6.0
Export HADOOP_INSTALL=$HADOOP_HOME
Export HADOOP_MAPRED_HOME=$HADOOP_HOME
Export HADOOP_COMMON_HOME=$HADOOP_HOME
Export HADOOP_HDFS_HOME=$HADOOP_HOME
Export YARN_HOME=$HADOOP_HOME
Export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
Export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Source / .bashrc
Configuration
Use the following etc/hadoop/core-site.xml
Hadoop.tmp.dir
File:/usr/local/hadoop-2.6.0/tmp
Abase for other temporary directories.
Fs.defaultFS
Hdfs://localhost:9000
Similarly, modify the configuration file hdfs-site.xml
Dfs.replication
one
Dfs.namenode.name.dir
File:/usr/local/hadoop-2.6.0/tmp/dfs/name
Dfs.datanode.data.dir
File:/usr/local/hadoop-2.6.0/tmp/dfs/data
A note about the Hadoop configuration item
Although you only need to configure fs.defaultFS and dfs.replication to run (as is the case for official tutorials), if you do not configure the hadoop.tmp.dir parameter, the default temporary directory is / tmp/hadoo-hadoop, which may be cleaned up by the system upon restart, resulting in a re-execution of format. So we set it up and also specify dfs.namenode.name.dir and dfs.datanode.data.dir, otherwise we may make an error in the next steps.
Password-free ssh settings
Now confirm whether you can log in to localhost with ssh without entering a password:
# ssh localhost date
If you cannot log in to localhost with ssh without entering a password, execute the following command:
# ssh-keygen-t dsa-P''- f ~ / .ssh/id_dsa
# cat ~ / .ssh/id_dsa.pub > > ~ / .ssh/authorized_keys
# chmod 600 ~ / .ssh/authorized_keys
Format a new distributed file system:
$bin/hadoop namenode-format
11:30:20 on 15-12-23 INFO util.GSet: VM type = 64-bit
11:30:20 on 15-12-23 INFO util.GSet: 0.0299999329447746% max memory 966.7 MB = 297.0 KB
11:30:20 on 15-12-23 INFO util.GSet: capacity = 2 ^ 15 = 32768 entries
15-12-23 11:30:20 INFO namenode.NNConf: ACLs enabled? False
15-12-23 11:30:20 INFO namenode.NNConf: XAttrs enabled? True
15-12-23 11:30:20 INFO namenode.NNConf: Maximum size of an xattr: 16384
11:30:20 on 15-12-23 INFO namenode.FSImage: Allocated new BlockPoolId: BP-823870322-192.168.9.70-1450841420347
15-12-23 11:30:20 INFO common.Storage: Storage directory / usr/local/hadoop-2.6.0/tmp/dfs/name has been successfully formatted.
11:30:20 on 15-12-23 INFO namenode.NNStorageRetentionManager: Going to retain 1 p_w_picpaths with txid > = 0
15-12-23 11:30:20 INFO util.ExitUtil: Exiting with status 0
15-12-23 11:30:20 INFO namenode.NameNode: SHUTDOWN_MSG:
/ *
SHUTDOWN_MSG: Shutting down NameNode at ocean-lab.ocean.org/192.168.9.70
* * /
If successful, you will see "successfully formatted" and "Exitting with status 0" prompts
Be careful
The next time you start hadoop, you don't need to initialize NameNode, just run. / sbin/start-dfs.sh!
Start NameNode and DataNode
$. / sbin/start-dfs.sh
15-12-23 11:37:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable
Starting namenodes on [localhost]
Localhost: starting namenode, logging to / usr/local/hadoop-2.6.0/logs/hadoop-hadoop-namenode-ocean-lab.ocean.org.out
Localhost: starting datanode, logging to / usr/local/hadoop-2.6.0/logs/hadoop-hadoop-datanode-ocean-lab.ocean.org.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
RSA key fingerprint is a5:26:42:a0:5f:da:a2:88:52:04:9c:7f:8d:6a:98:9b.
Are you sure you want to continue connecting (yes/no)? Yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to / usr/local/hadoop-2.6.0/logs/hadoop-hadoop-secondarynamenode-ocean-lab.ocean.org.out
15-12-23 11:37:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable
[13:57:08] [hadoop@ocean-lab hadoop-2.6.0] $jps
27686 SecondaryNameNode
28455 Jps
27501 DataNode
27405 NameNode
27006 GetConf
If there is no process, it means that the startup failed. View log bebug.
After successful startup, you can visit the Web interface http://[ip,fqdn]:/50070 to view NameNode and Datanode information, and you can also view files in HDFS online.
Run Hadoop pseudo-distributed instance
In the stand-alone mode above, the grep example reads local data, while pseudo-distributed reading is data on HDFS.
To use HDFS, you first need to create a user directory in HDFS:
#. / bin/hdfs dfs-mkdir-p / user/hadoop
#. / bin/hadoop fs-ls / user/hadoop
Found 1 items
Drwxr-xr-x-hadoop supergroup 0 2015-12-23 15:03 / user/hadoop/input
Then the xml file in. / etc/hadoop is copied to the distributed file system as an input file, that is, / usr/local/hadoop/etc/hadoop is copied to / user/hadoop/input in the distributed file system. We are using the hadoop user and have created the corresponding user directory / user/hadoop, so we can use a relative path such as input in the command, whose corresponding absolute path is / user/hadoop/input:
#. / bin/hdfs dfs-mkdir input
#. / bin/hdfs dfs-put. / etc/hadoop/*.xml input
After the replication is complete, you can view the list of files in HDFS with the following command:
#. / bin/hdfs dfs-ls input
-rw-r--r-- 1 hadoop supergroup 4436 2015-12-23 16:46 input/capacity-scheduler.xml
-rw-r--r-- 1 hadoop supergroup 1180 2015-12-23 16:46 input/core-site.xml
-rw-r--r-- 1 hadoop supergroup 9683 2015-12-23 16:46 input/hadoop-policy.xml
-rw-r--r-- 1 hadoop supergroup 1136 2015-12-23 16:46 input/hdfs-site.xml
-rw-r--r-- 1 hadoop supergroup 620 2015-12-23 16:46 input/httpfs-site.xml
-rw-r--r-- 1 hadoop supergroup 3523 2015-12-23 16:46 input/kms-acls.xml
-rw-r--r-- 1 hadoop supergroup 5511 2015-12-23 16:46 input/kms-site.xml
-rw-r--r-- 1 hadoop supergroup 858 2015-12-23 16:46 input/mapred-site.xml
-rw-r--r-- 1 hadoop supergroup 690 2015-12-23 16:46 input/yarn-site.xml
Pseudo-distributed MapReduce jobs run in the same way as stand-alone mode, except that pseudo-distributed reads files in HDFS (you can delete the local input folder created in the stand-alone step and the output output folder to verify this).
#. / bin/hadoop jar. / share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs [a Murray z.] +'
The command to view the results of the run (view the output located in HDFS):
. / bin/hdfs dfs-cat output/*
1 dfsadmin
1 dfs.replication
1 dfs.namenode.name.dir
1 dfs.datanode.data.dir
The results are as follows, and notice that we have just changed the configuration file, so the running results are different.
The result of Hadoop pseudo-distributed running grep Hadoop pseudo-distributed running grep
We can also get the running results back locally:
# rm-r. / output # Delete the local output folder first (if it exists)
#. / bin/hdfs dfs-get output. / output # copy the output folder on HDFS to this machine
# cat. / output/*
1 dfsadmin
1 dfs.replication
1 dfs.namenode.name.dir
1 dfs.datanode.data.dir
When Hadoop runs the program, the output directory cannot exist, otherwise the error "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists" will be prompted, so to execute again, you need to execute the following command to delete the output folder:
# Delete the output folder
$. / bin/hdfs dfs-rm-r output
Deleted output
The output directory cannot exist when running the program
When running the Hadoop program, in order to prevent the results from being overwritten, the output directory specified by the program (such as output) cannot exist, otherwise it will prompt an error, so you need to delete the output directory before running. When actually developing an application, you can consider adding the following code to the program to automatically delete the output directory each time you run, so as to avoid tedious command line operations:
Configuration conf = new Configuration ()
Job job = new Job (conf)
/ * Delete the output directory * /
Path outputPath = new Path (args [1])
OutputPath.getFileSystem (conf) .delete (outputPath, true)
To close Hadoop, run
. / sbin/stop-dfs.sh
Start YARN
(pseudo-distributed does not start YARN, which generally does not affect program execution.)
Some readers may wonder how to start Hadoop and not see the JobTracker and TaskTracker mentioned in the book, because the new version of Hadoop uses the new MapReduce framework (MapReduce V2, also known as YARN,Yet Another Resource Negotiator).
YARN is separated from MapReduce and is responsible for resource management and task scheduling. YARN runs on MapReduce and provides high availability and high scalability. More introduction to YARN will not be carried out here. If you are interested, you can consult the relevant materials.
The above starts Hadoop through. / sbin/start-dfs.sh, which only starts the MapReduce environment. We can start YARN and let YARN be responsible for resource management and task scheduling.
First modify the configuration file mapred-site.xml
Mapreduce.framework.name
Yarn
Then modify the configuration file yarn-site.xml:
Yarn.nodemanager.aux-services
Mapreduce_shuffle
Then you can start YARN (you need to execute. / sbin/start-dfs.sh first):
#. / sbin/start-yarn.sh # launch YARN
#. / sbin/mr-jobhistory-daemon.sh start historyserver # Open the history server to check the running status of the task in Web
When enabled, you can see that there are two background processes, NodeManager and ResourceManager, via jps:
[09:18:34] [hadoop@ocean-lab ~] $jps
27686 SecondaryNameNode
6968 ResourceManager
7305 Jps
7066 NodeManager
27501 DataNode
27405 NameNode
After starting YARN, the method of running the instance is still the same, except that the resource management and task scheduling are different. Looking at the log information, we can find that when YARN is not enabled, it is "mapred.LocalJobRunner" running the task, and after enabling YARN, it is "mapred.YARNRunner" running the task. One advantage of starting YARN is that you can view the running of the task through the Web interface: http://[ip,fqdn]:8088/cluster
When YARN is enabled, you can view the task operation information. After YARN is enabled, you can view the task operation information.
However, YARN mainly provides better resource management and task scheduling for the cluster, but this does not show value on a single machine, but will make the program run a little slower. Therefore, whether or not to turn on YARN on a single machine depends on the actual situation.
If you do not start YARN, delete / rename mapred-site.xml
Otherwise, if the configuration file exists and YARN is not turned on, the running program will prompt "Retrying connect to server: 0.0.0.0max 0.0.0.0pur8032" error.
Similarly, the script to shut down YARN is as follows:
#. / sbin/stop-yarn.sh
#. / sbin/mr-jobhistory-daemon.sh stop historyserver
Hadoop common commands
# View the list of HDFS files
Hadoop fs-ls / usr/local/log/
# create a file directory
Hadoop fs-mkdir / usr/local/log/test
# Delete files
/ hadoop fs-rm / usr/local/log/07
# upload a local file to HDFS / usr/local/log/ directory
Adoop fs-put / usr/local/src/infobright-4.0.6-0-x86_64-ice.rpm / usr/local/log/
# download
Hadoop fs-get / usr/local/log/infobright-4.0.6-0-x86_64-ice.rpm / usr/local/src/
# View files
Hadoop fs-cat / usr/local/log/zabbix/access.log.zabbix
# View the basic usage of HDFS
# hadoop dfsadmin-report
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Configured Capacity: 2956 5767680 (27.54 GB)
Present Capacity: 17956433920 (16.72GB)
DFS Remaining: 1795 6405248 (16.72 GB)
DFS Used: 28672 (28 KB)
DFS Used%: 0.005%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Live datanodes (1):
Name: 127.0.0.1 50010 (localhost)
Hostname: ocean-lab.ocean.org
Decommission Status: Normal
Configured Capacity: 2956 5767680 (27.54 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 11609333760 (10.81 GB)
DFS Remaining: 1795 6405248 (16.72 GB)
DFS Used%: 0.005%
DFS Remaining%: 60.73%
Configured Cache Capacity: 0 (0B)
Cache Used: 0 (0B)
Cache Remaining: 0 (0B)
Cache Used%: 100.00%
Cache Remaining%: 0.005%
Xceivers: 1
Last contact: Thu Dec 24 09:52:14 CST 2015
Since then, you have mastered the configuration and basic use of Hadoop.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.