In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Introduction
With the more extensive application of enterprise data and Hadoop, the framework design of hadoop1.x is more and more unable to meet the needs of people. Apache has been modifying Hadoop1.x, and finally launched a new generation of Hadoop2.x. From the perspective of the changing trend of the industry's use of distributed systems and the long-term development of the hadoop framework, MapReduce's JobTracker/TaskTracker mechanism needs large-scale adjustments to fix its shortcomings in scalability, memory consumption, threading model, reliability and performance. In order to fundamentally solve the performance bottleneck of the old MapReduce framework and promote the longer-term development of the Hadoop framework, since version 0.23.0, the MapReduce framework of Hadoop has been completely reconstructed and fundamental changes have taken place. The new Hadoop MapReduce framework is named MapReduceV2 or Yarn.
The difference between hadoop1.0 version and version 2.0
1.1 differences between HDFS
1.1.1 Hadoop1.x
In Hadoop1.x, HDFS designs the cluster by Masters/Slaves, and manages the cluster by NameNode and DataNode. The whole Hadoop1.x HDFS is divided into two parts: Namespace and BlockStorageServer. Namespace is completely distributed in NameNode nodes, and Namespace includes metadata of all files, p_w_picpaths images, edits files and so on. On the other hand, the BlockStorageServer distribution is distributed on NameNode nodes and Datanode nodes, and all the corresponding relationships between Block and DataNode nodes are stored in NamNodee nodes. On the other hand, the content data in Block is distributed in DataNode nodes. This is shown in figure 1 HDFS schematic diagram of Magi Hadoop 1.x version.
Fig. 1 schematic diagram of Hadoop1.x version HDFS
Disadvantages:
1) because the NameNode node is the center of the entire cluster, once the NameNode downtime occurs, the entire cluster will be paralyzed, and the problem will not be solved until the NameNode is restarted.
2) there is only one number of NameNode nodes, the performance of the stand-alone machine is limited, and the information about DataNode nodes is stored in NameNode, so it is impossible to add DataNode nodes horizontally infinitely in theory, which is why NameNode supports up to 4000 nodes.
1.1.2 Hadoop 2.x
Hadoop2.x implements federated HDFS, that is, multiple NameNode nodes coexist, and each NameNode node manages a Namespace, as shown in the HDFS diagram of figure 2.
Fig. 2 HDFS diagram of Hadoop2.x
The Block Pool:block pool, all the block nodes managed by a NameNode, a NameNode node and the outgoing Pool are a snap-in to manage their own Block.
In the federated HDFS, each Namespace has its own Block management, but these Block are all stored in the entire DataNode cluster. As shown in the figure above, the Namespace is isolated from each other. Even if one NameNode node goes down, it will not affect other NameSpace, nor will it affect the Block in the Datanodes it manages.
Advantages:
The main results are as follows: (1) the DataNode node can be expanded horizontally and without restriction.
(2) multiple NameNode can execute tasks concurrently, and the throughput of HDFS system can be improved.
(3) Security is greatly reminded that the collapse of a single NameNode node will not lead to the paralysis of the whole system.
1.2 differences between MapReduce 1.2.1 Hadoop1.x
The process for Hadoop1.x to run the MapReduce task is:
(1) Job Client submits tasks to JobTracker (in the NameNode node), and JobTracker sends query requests to each node to check the number of Task (tasks) executed in each DataNode node.
(2) JobTrack collects the information of DataNodes and allocates resources to Job.
(3) copy all the resources and information needed for the MapReduce task to the Datanodes node.
(4) after the DataNode node accepts the task, it reads the local Block and forms the corresponding Map and Reduce tasks. The management of these tasks is all supervised by the TaskTracker in the DataNodes node. This is shown in the diagram of MapReduce in figure 3.
Fig. 3 schematic diagram of MapReduce
As can be seen from the figure, JobTacker is the center of the whole Hadoop1.x MapReduce framework, which undertakes the functions of accepting tasks, computing resources, allocating resources, communicating with DataNode and so on. The Hadoop1.x framework was very popular when it was released, but with the increasing demand, Hadoop1.x 's MapReduce (MapReduce v1) can no longer meet the current needs, mainly in the following problems:
(1) JobTracker is the core of the whole MapReduce v1, and there is a single point of failure.
(2) JobTracker manages the task of the entire MapReduce job, resulting in resource consumption. When there is too much map/reduce task, JobTracker will consume a lot of memory and increase the risk of JobTracker fail.
(3) JobTracker queries the resources of DataNode, the number of Task used, and in order to consider the utilization of memory and CPU, memory overflow may occur if two Map/reduce Task with large memory are executed on one node.
(4) some classes in the code layer are more than 3000 lines, resulting in the task of the whole class is not clear enough, and the task of making changes is also huge, so it increases the difficulty of maintenance and developers to make changes.
1.2.2 Hadoop2.x
In order to cope with the increasing demand and the drawbacks of MapReduce v1, Apache redesigned MapReduce v2, resulting in the emergence of MapReduce v2, that is, YARN framework. Let's introduce the YARN framework. It is shown in the schematic diagram of Fig. 4 Magi YARN.
Fig. 4 schematic diagram of YARN
The noun explains:
ResourceManager: hereinafter referred to as RM. The central control module of YARN is responsible for unified planning of the use of resources.
NodeManager: hereinafter referred to as NM. The resource node module of YARN, which is responsible for starting and managing container.
ApplicationMaster: hereinafter referred to as AM. Each application in YARN launches an AM, which is responsible for requesting resources from RM, asking NM to start container, and telling container what to do.
Container: resource container. All applications in YARN run on top of container. AM also runs on container, but AM's container is requested by RM.
(1) ResourceManager: in MapReduce v1, JobTracker has two tasks: resource management and task scheduling. In the YARN framework, the two core tasks of JobTracker are separated, and the resource management in it forms a new ResourceManager. ResourceManager is responsible for managing the state of resources (traditional information such as memory, CPU, disk, and bandwidth) provided by each NodeManager node. During the MapReduce task, RM accurately calculates the resources for each entire cluster and has allocated the appropriate resources for the task.
(2) Container: an overall description of the memory, CPU and other resources of a node.
(3) ApplicationMaster: each MapReduce task corresponds to an AM,AM that is responsible for asking ResourceManager for the resource container needed to execute the task, according to the state of the process, managing the process and dealing with the reason for the failure of the process.
(4) NodeManager: it is an agent of a machine framework and a container for task execution, which manages a lot of information about nodes, such as memory, CPU, hard disk, network and other resources.
Advantages of YARN over MapReduce v1:
(1) the huge burden borne by JobTracker is divided into resourceManager and nodemanager. Resource management and task scheduling are distributed in different nodes, and the program is distributed and optimized.
(2) ResourceManager resource allocation no longer depends on the number of slot, but allocates tasks according to the memory of nodes, which makes load balancing more perfect.
(3) there is an ApplicationMasters process on the ResourceManager node, which is responsible for managing the status of each ApplicationMatser process, so as to realize the monitoring task.
1.3 other differences
MapReduce has become an application on YARN like HBase and Hive; the default block size for Hadoop1.x is 64m, and the default block size for Hadoop 2.x is 128m; in 2.x, in addition to datanode reporting status,nodemanager to namenode, you also report status to ResourceManager.
II. Hadoop2.x implementation of Hadoop1.x upgrade
Version: old version of Hadoop1.0.3;, new version of Hadoop2.6.4.
HOST Information:
Download the installation package for upgrade:
Hadoop-2.6.4.tar.gz http://apache.opencas.org/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz
Jdk-8u77-linux-x64.tar.gz: download from the official website
The placement path of the package: / usr/local/src/
Create a new HDFS system directory and test files:
[root@namenode ~] # hadoop fs-mkdir / test
[root@namenode ~] # hadoop fs-put / home/hadoop/hadoop/conf/* / test/
Extract the jdk installation package (for each node):
[root@namenode ~] # cd / usr/local/src
[root@namenode ~] # tar zxvf jdk-8u77-linux-x64.tar.gz
Back up the old jdk (each node needs to operate):
[root@namenode ~] # mv / usr/local/jdk1.6 / usr/local/jdk1.6.bak
Replace the new jdk version (each node needs to operate):
[root@namenode ~] # mv jdk1.8.0_77 / usr/local/jdk/
Modify the jdk environment (each node needs to operate):
[root@namenode ~] # vim / etc/profile
Change JAVA_HOME
Export JAVA_HOME=/usr/local/jdk
Export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
Export PATH=$PATH:$JAVA_HOME/bin
[root@namenode ~] # source / etc/profile
Verify that the jdk is successful:
[root@namenode ~] # java-version
2.2 namenode node operation
Extract the hadoop2.6 package:
[root@namenode ~] # tar zxvf hadoop-2.6.4.tar.gz
Back up the hadoop1.0 (each node operates):
[root@namenode ~] # mkdir / home/hadoop/backup
[root@namenode~] # mv / home/hadoop/hadoop / home/hadoop/backup/
Back up the metadata of the cluster namenode (the folder configured by dfs.name.dir in ${HADOOP_HOME} / conf/hdfs-site.xml):
[root@namenode] # cp-r / data/work/hdfs/name / data/backup/hdfsname.20160418.bak
Install hadoop2.6:
[root@namenode ~] # mv / usr/local/src/hadoop-2.6.4 / home/hadoop/hadoop
[root@namenode] # chown-R hadoop.hadoop / home/hadoop/hadoop
Switch to the hadoop user:
[root@namenode ~] # su-hadoop
Modify the user environment (each node operates):
[hadoop@namenode ~] $vim / home/hadoop/.bash_profile
Modify:
Export HADOOP_HOME=/home/hadoop/hadoop
Export PATH=$PATH:$HADOOP_HOME:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Export HADOOP_HOME_WARN_SUPPRESS=1
Export PATH
[hadoop@namenode ~] $source / home/hadoop/.bash_profile
P8
2.3 modify the configuration file
[hadoop@namenode ~] $cd / home/hadoop/hadoop/etc/hadoop/
[hadoop@hadoop/ ~] $vim hadoop-env.sh
Modify export JAVA_HOME=/usr/local/jdk
Add export HADOOP_PREFIX=/home/hadoop/hadoop
Export HADOOP_HEAPSIZE=15000
[hadoop@hadoop/ ~] $vim yarn-env.sh
Modify export JAVA_HOME=/usr/local/jdk
[hadoop@hadoop/ ~] $vim mapred-env.sh
Modify export JAVA_HOME=/usr/local/jdk
[hadoop@hadoop/ ~] $vim hdfs-site.xml
Dfs.namenode.http-address
Namenode:50070
NameNode gets fsp_w_picpath and edits from the current parameters
Dfs.namenode.secondary.http-address
Node2:50090
SecondNameNode gets the latest fsp_w_picpath from the current parameters
Dfs.replication
two
Sets the number of copies of files stored in HDFS. The default is 3.
Dfs.namenode.checkpoint.dir
File:///home/hadoop/hadoop2.2/hdfs/namesecondary
Set the local file system path where the temporary image is stored in secondary. If this is a comma-separated list of files, the image will be redundant copied to all directories, valid only for secondary
Dfs.webhdfs.enabled
True
Dfs.namenode.name.dir
File:///data/work/hdfs/name/
The local file system path used by namenode to persist namespaces and exchange logs
Dfs.datanode.data.dir
File:///data/work/hdfs
DataNode list of directories where block files are stored locally, separated by commas
Dfs.stream-buffer-size
131072
The default is 4KB, which is used as a hadoop buffer for hadoop to read hdfs files and write
Hdfs files and map output all use this buffer capacity, which is very conservative for current hardware and can be set to 128k.
(131072), or even 1m (too large map and reduce tasks may overflow memory)
Dfs.namenode.checkpoint.period
3600
The interval between two checkpoints, in seconds, valid only for secondary
[hadoop@hadoop/ ~] $vim mapred-site.xml
Mapreduce.framework.name
Yarn
[hadoop@hadoop/ ~] $vim yarn-site.xml
Yarn.nodemanager.aux-services
Mapreduce_shuffle
[hadoop@hadoop/ ~] $vim core-site.xml
Fs.defaultFS
Hdfs://namenode:9000/
Set the hostname and port of namenode
Hadoop.tmp.dir
/ home/hadoop/tmp
A directory where temporary files are stored
Create a new file directory (all node operations)
$mkdir / home/hadoop/tmp
$mkdir / data/work/hdfs/namesecondary/
$chown-R hadoop.hadoop / home/hadoop/tmp/
$chown-R hadoop.hadoop / data/work/hdfs/namesecondary/
Start hdfs
[hadoop@namenode ~] $tart-dfs.sh
[hadoop@namenode ~] $hadoop-daemon.sh start namenode-upgrade
Restart all daemon threads
[hadoop@namenode ~] $stop-dfs.sh
[hadoop@namenode ~] $start-all.sh
Check whether the metadata is retained successfully
[hadoop@namenode ~] $hadoop fs-ls /
Stop all daemons after success
[hadoop@namenode ~] $stop-all.sh
Modify / home/hadoop/hadoop/etc/hadoop/slaves
[hadoop@namenode ~] $vim slaves
Modify:
Node1
Node2
Copy the hadoop file to another node
[hadoop@namenode] $scp-r / home/hadoop/hadoop node2:/home/hadoop/hadoop/
[hadoop@namenode] $scp-r / home/hadoop/hadoop node1:/home/hadoop/hadoop/
The Node1,2 node modifies the directory permissions of hadoop
$chown-R hadoop.hadoop / home/hadoop/hadoop
Namenode starts the daemon thread
[hadoop@namenode ~] $start-all.sh
Under the dfs.namenode.name.dir directory of namenode and datanode (/ data/work/hdfs/name in this lab), there will be an extra folder previous/ or view information through jps.
Folder previous/, this is a backup of the data before the upgrade, and this folder is also required for rollback.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.