In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed operation and storage.
The Hadoop core project provides the basic services for building a cloud computing environment on low-end hardware, and it also provides the necessary API interface for software running in the cloud.
The two basic parts of the Hadoop kernel are the MapReduce framework, the cloud computing environment, and the HDFS distributed file system. In the core framework of Hadoop, MapReduce is often called mapred,HDFS and often called dfs. HDFS provides storage for massive data, and MapReduce provides computing for massive data.
The core concept of MapReduce is to divide the input data into different logical blocks, and the Map task first processes each block separately in parallel. The processing results of these logical blocks are reassembled into different sorted sets, which are finally processed by the Reduce task.
HDFS distributed file system has high fault tolerance and is designed to be deployed on low-cost (low-cost) hardware; and it provides high throughput (high throughput) to access application data, which is suitable for applications with very large data sets (large data set). HDFS relax POSIX requirements, you can stream access to (streaming access) the data in the file system.
Reference: hadoop.appache.org
Experimental environment rhel6.5
Host server7, slave server8.server9 Note: domain names must be resolved to each other.
Installation and basic configuration
Each machine creates a hadoop user with a uid of 900. the password is redhat.
On Server7, hadoop users are under / home/.
# tar zxf hadoop-1.2.1.tar.gz-C hadoop
# cd hadoop
# ln-s hadoop-1.2.1/ hadoop
# sh jdk-6u32-linux-x64.bin / / install java
# ln-s jdk-1.6.32 java
# vim .bash _ profile / / configure path
Export JAVA_HOME=/home/hadoop/java
Export CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
Export PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
# source .bash _ profile
# echo $JAVA_HOME
Show / home/hadoop/java
# cd hadoop/conf
# vim hadoop-env.sh
Modify exprot JAVA_HOME=/home/hadoop/java
# mkdir.. / input
# cp * .xml.. / input / / create a distributed file system
# cd..
# bin/hadoop jar hadoop-examples-1.2.1.jar
Lists the relevant parameters of hadoop-example-1.2.1jar to input operations, such as grep lookup, sort sorting, wordcount counting and so on.
# bin/hadoop jar hadoop-examples-1.2.1.jar grep input output 'dfs [a dfs.] +' / / find the files in inout whose filenames begin with dfs followed by lowercase English, and store the results in the automatically generated file folder
# cd output/
# ls
# cat *
Then introduce the three working modes of hadoop.
Stand-alone mode (standalone)
Stand-alone mode is the default mode of Hadoop. When the source package of Hadoop is decompressed for the first time, Hadoop cannot understand the hardware installation environment, so it conservatively chooses the minimum configuration. All three XML files are empty in this default mode. When the configuration file is empty, Hadoop runs entirely locally. Because there is no need to interact with other nodes, stand-alone mode does not use HDFS and does not load any Hadoop daemons. This mode is mainly used to develop and debug the application logic of MapReduce programs.
Pseudo distribution pattern (Pseudo-Distributed Mode)
Pseudo-distributed mode runs Hadoop on a "single node cluster", where all daemons are running on the same machine. This mode adds code debugging to stand-alone mode, allowing you to check memory usage, HDFS input and output, and other daemon interactions.
Fully distributed mode
The Hadoop daemon runs on a cluster.
The above operation is in stand-alone mode.
Hadoop distributed deployment
Structure:
The master node includes the name node (namenode), the subordinate name node (secondarynamenode), and the jobtracker daemon (so-called master daemon), as well as the utilities and browsers used to manage the cluster.
Slave nodes include tasktracker and data nodes (dependent daemons). The difference between the two settings is that the master node includes daemons that provide Hadoop cluster management and coordination, while the slave nodes include daemons that implement Hadoop file system (HDFS) storage and MapReduce functions (data processing).
The role of each daemon in the Hadoop framework:
Namenode is the master server in Hadoop that manages file system namespaces and access to files stored in the cluster.
Secondary namenode, which is not a redundant daemon for namenode, but provides periodic checkpointing and cleanup tasks.
You can find a namenode and a secondary namenode in each Hadoop cluster.
Datanode manages storage connected to nodes (there can be multiple nodes in a cluster). Each node that stores the data runs a datanode daemon.
Each cluster has a jobtracker, which is responsible for scheduling work on the datanode.
Each datanode has a tasktracker, and they do the actual work.
Jobtracker and tasktracker are master-slave, with jobtracker dispatching datanode distribution and tasktracker performing tasks. Jobtracker also checks the requested work, and if a datanode fails for some reason, jobtracker reschedules the previous task.
The following implements pseudo-distribution
For convenience, set ssh password-free.
Hadoop user on Server7.
# ssh-keygen
# ssh-copy-id localhost
# ssh localhost / / login to this machine without a password
Modify the configuration file:
# cd hadoop/conf
# vim core-site.xml
Add below
Fs.default.name
Hdfs://172.25.0.7:9000
/ / specify namenode
# vim mapred-site.xml
Add below
Mapred.job.tracker
172.25.0.7:9001
/ / specify jobtracker
# vim hdfs-site.xml
Add below
Dfs.replication
one
/ / specify the number of copies saved in the file. Because it is pseudo-distributed, the copy is 1 local.
# cd..
# bin/hadoop namenode-format / / format namenode
# bin/start-dfs.sh / / start hdfs
# jps / / View the process
You can see that the secondarynamenode,namenode,datanode is all started. Namenode and datanode are on the same machine, so they are pseudo-distributed.
# bin/start-mapred.sh.sh / / start mapreduce
# bin/hadoop fs-put input test / / upload input to hdfs and rename it to test in hdfs
Browse the network interfaces of NameNode and JobTracker, and their addresses are:
NameNode-http://172.25.0.7:50070/
JobTracker-http://172.25.0.7:50030/
View namenode
# bin/hadoop fs-ls test / / list the files in the test directory in hdfs
View files under test on Web
The following implements a fully distributed mode
Both the master and slave computers install nfs-utils and start the rpcbind service (mainly responsible for notifying the client and the server's NFS port number when nfs is shared. Simply understand that rpc is a mediation service), removes the installation configuration by using hadoop directly from the slave through nfs.
On server7, start the nfs service
# vim / etc/exports
/ home/hadoop * (rw,all_squash,anonuid=900,anongid=900
/ / share hadoop, specify id for login users, and log in with users with uid of 900s
On server8,9
# mount 172.25.0.7:/home/hadoop / hooem/hadoop/ Mount the shared directory
On server7, hadoop users, change the hdfs-site under hadoop/conf and change the number of copies from 1 to 2.
# cd hadoop/conf
# vim slave add slave
172.25.0.8
172.25.0.9
# vim master sets the host
172.25.0.7
Format the pseudo-distributed file system before starting fully distributed mode
# cd..
# bin/stop-all.sh / / stop jobtracker,namenode,secondarynamenode
# bin/hadoop-daemon.sh stop tasktracker
# bin/hadoop-daemon.sh stop datanode / / stop tasktracker,datanode
# bin/hadoop namenode-format
# bin/start-dfs.sh displays server8,server9 connections.
# bin/start-mapred.sh
Added jobtracker process
On server8, jps can see three process jps,datanode,tasktracker
Slave computer can upload, query, etc.
# bin/hadoop fs-put input test
# bin/hadoop jar hadoop-example-1.2.1.jar grep test out 'dfs [a Merz] +'
On server7
# bin/hadoop dfsadmin-report / / display hdfs information
Since no files are added under hadoop, the dfs used% is 0%.
# dd if=/dev/zero of=bigfile bs=1M count=200
# bin/hadoop fs-put bigfile test
See that dfs used is 403.33MB on web (two slaves, each 200MB)
Note: sometimes operation errors cause hadoop to enter safe mode and cannot perform operations such as uploading
Just run the downlink instruction
# bin/hadoop dfsadmin-safemode leave
Hadoop supports real-time expansion and can add slaves online.
Add slave server10. Install nfs-utils and start the rpcbind service. Add the hadoop user of uid900, mount the hadoop of server7, and add 172.25.0.10 to slaves under hadoop/conf.
Note: hostname resolution of server10 must be added on the master and slave before adding server10.
On server10, hadoop users
# bin/hadoop-daemon.sh start datanode
# bin/hadoop-daemon.sh start tasktracker
On server7
# bin/hadoop dfsadmin-report
You can see the information of server10
You can see that the server10 dfs used is 0, and you can move the data from server9 to server10.
Data migration:
Data migration is the process of moving rarely or unused files to a secondary storage system.
Hadoop removes server9 datanode nodes online to achieve data migration:
# bin/hadoop-daemon.sh stop tasktracker / / when doing data migration, this node should not participate in tasktracker, otherwise an exception will occur
Modify conf/mapred-site.xml on master
Add below
Dfs.hosts.exclude
/ home/hadoop/hadoop-1.0.4/conf/datanode-excludes
Create a datanode-excludes under conf, and add the CVM to be deleted, one per line
# vim datanode-excludes
172.25.0.9 / / Delete node server9
# cd..
# bin/hadoop dfsadmin-refreshNodes / / refresh nodes online
# bin/hadoop dfsadmin-report
You can see the server9 status: Decommission in progress
To delete a tasktracker node online
Modify conf/mapred-site.xml on server7
Mapred.hosts.exclude
/ home/hadoop/hadoop-1.0.4/conf/tasktracker-excludes
Create a tasktracker-excludes file and add the hostname to be deleted, one per line
Server9.example.com
# bin/hadoop mradmin-refreshNodes
When the status of this node is displayed as Decommissioned, the data migration is complete and can be safely shut down.
The hadoop1.2.1 version is too low, and the scheduling ability of jobtracker is not strong, so it is easy to become a bottleneck when there is too much slvers. Using the new version 2.6.4 is a good choice.
Stop the process and delete the file:
On server7
# bin/stop-all.sh
# cd / home/hadoop
# rm-fr hadoop java hadoop-1.2.1 java1.6.32
# rm-fr / tmp/*
From the plane
# bin/hadoop-daemon.sh stop datanode
# bin/hadoop-daemon.sh stop tasktracker
# rm-fr / tmp/*
The following operation is basically the same as above.
Hadoop users on server7 and / home/hadoop/
# tar zxf jdk-7u79-linux-x64.tar.gz-C / home/hadoop/
# ln-s jdk1.7.0.79 java
# tar zxf hadoop-2.6.4.tar.gz
# ln-s hadoop-2.6.4 hadoop
# cd hadoop/etc/hadoop
# vim hadoop-env.sh
Export JAVA_HOME=/home/hadoop/java
Export HADOOP_PREFIX=/home/hadoop/hadoop
# cd / home/hadoop/hadoop
# mkdir input
# cp etc/hadoop/*.xml input
# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar grep input output 'dfs [a Merz.] +'
# cat output/*
Grep compiles with warning, and problems may occur when the cluster is large. Hadoop-native needs to be added.
# tar-xf hadoop-native-64.2.6.0.tar-C hadoop/lib/native/
# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar grep input output 'dfs [a Merz.] +'
There is no warning after compilation.
# cd etc/hadoop
# vim slaves
172.25.0.8
172.25.0.9
# vim etc/hadoop/core-site.xml
Fs.defaultFS
Hdfs://172.25.0.7:9000
# vim hdfs-site.xml
Dfs.replication
two
# bin/hdfs namenode-format
# sbin/start-dfs.sh
# jps
# ps-ax can see namenode and secondarynamenode processes
# bin/hdfs dfs-mkdir / user/hadoop
# bin/hdfs dfs-put input/ test
Input can be seen on web for upload.
MapReduce's JobTracker/TaskTracker mechanism requires large-scale adjustments to fix its shortcomings in scalability, memory consumption, threading model, reliability and performance.
In order to fundamentally solve the performance bottleneck of the old MapReduce framework and promote the longer-term development of the Hadoop framework, since version 0.23.0, the MapReduce framework of Hadoop has been completely reconstructed and fundamental changes have taken place. The new Hadoop MapReduce framework is named MapReduceV2 or Yarn
# vim etc/hadoop/yarn-site.xml
< property>Yarn.resourcemanager.hostname
Server7.example.com
# sbin/start-yarn.sh
# jps
Server8 can see that the process has been started
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.