Hadoop1.x version upgrade Hadoop2.x 04/25 Update SLTechnology News&Howtos

Hadoop1.x version upgrade Hadoop2.x

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Introduction

With the more extensive application of enterprise data and Hadoop, the framework design of hadoop1.x is more and more unable to meet the needs of people. Apache has been modifying Hadoop1.x, and finally launched a new generation of Hadoop2.x. From the perspective of the changing trend of the industry's use of distributed systems and the long-term development of the hadoop framework, MapReduce's JobTracker/TaskTracker mechanism needs large-scale adjustments to fix its shortcomings in scalability, memory consumption, threading model, reliability and performance. In order to fundamentally solve the performance bottleneck of the old MapReduce framework and promote the longer-term development of the Hadoop framework, since version 0.23.0, the MapReduce framework of Hadoop has been completely reconstructed and fundamental changes have taken place. The new Hadoop MapReduce framework is named MapReduceV2 or Yarn.

The difference between hadoop1.0 version and version 2.0

1.1 differences between HDFS

1.1.1 Hadoop1.x

In Hadoop1.x, HDFS designs the cluster by Masters/Slaves, and manages the cluster by NameNode and DataNode. The whole Hadoop1.x HDFS is divided into two parts: Namespace and BlockStorageServer. Namespace is completely distributed in NameNode nodes, and Namespace includes metadata of all files, p_w_picpaths images, edits files and so on. On the other hand, the BlockStorageServer distribution is distributed on NameNode nodes and Datanode nodes, and all the corresponding relationships between Block and DataNode nodes are stored in NamNodee nodes. On the other hand, the content data in Block is distributed in DataNode nodes. This is shown in figure 1 HDFS schematic diagram of Magi Hadoop 1.x version.

Fig. 1 schematic diagram of Hadoop1.x version HDFS

Disadvantages:

1) because the NameNode node is the center of the entire cluster, once the NameNode downtime occurs, the entire cluster will be paralyzed, and the problem will not be solved until the NameNode is restarted.

2) there is only one number of NameNode nodes, the performance of the stand-alone machine is limited, and the information about DataNode nodes is stored in NameNode, so it is impossible to add DataNode nodes horizontally infinitely in theory, which is why NameNode supports up to 4000 nodes.

1.1.2 Hadoop 2.x

Hadoop2.x implements federated HDFS, that is, multiple NameNode nodes coexist, and each NameNode node manages a Namespace, as shown in the HDFS diagram of figure 2.

Fig. 2 HDFS diagram of Hadoop2.x

The Block Pool:block pool, all the block nodes managed by a NameNode, a NameNode node and the outgoing Pool are a snap-in to manage their own Block.

In the federated HDFS, each Namespace has its own Block management, but these Block are all stored in the entire DataNode cluster. As shown in the figure above, the Namespace is isolated from each other. Even if one NameNode node goes down, it will not affect other NameSpace, nor will it affect the Block in the Datanodes it manages.

Advantages:

The main results are as follows: (1) the DataNode node can be expanded horizontally and without restriction.

(2) multiple NameNode can execute tasks concurrently, and the throughput of HDFS system can be improved.

(3) Security is greatly reminded that the collapse of a single NameNode node will not lead to the paralysis of the whole system.

1.2 differences between MapReduce 1.2.1 Hadoop1.x

The process for Hadoop1.x to run the MapReduce task is:

(1) Job Client submits tasks to JobTracker (in the NameNode node), and JobTracker sends query requests to each node to check the number of Task (tasks) executed in each DataNode node.

(2) JobTrack collects the information of DataNodes and allocates resources to Job.

(3) copy all the resources and information needed for the MapReduce task to the Datanodes node.

(4) after the DataNode node accepts the task, it reads the local Block and forms the corresponding Map and Reduce tasks. The management of these tasks is all supervised by the TaskTracker in the DataNodes node. This is shown in the diagram of MapReduce in figure 3.

Fig. 3 schematic diagram of MapReduce

As can be seen from the figure, JobTacker is the center of the whole Hadoop1.x MapReduce framework, which undertakes the functions of accepting tasks, computing resources, allocating resources, communicating with DataNode and so on. The Hadoop1.x framework was very popular when it was released, but with the increasing demand, Hadoop1.x 's MapReduce (MapReduce v1) can no longer meet the current needs, mainly in the following problems:

(1) JobTracker is the core of the whole MapReduce v1, and there is a single point of failure.

(2) JobTracker manages the task of the entire MapReduce job, resulting in resource consumption. When there is too much map/reduce task, JobTracker will consume a lot of memory and increase the risk of JobTracker fail.

(3) JobTracker queries the resources of DataNode, the number of Task used, and in order to consider the utilization of memory and CPU, memory overflow may occur if two Map/reduce Task with large memory are executed on one node.

(4) some classes in the code layer are more than 3000 lines, resulting in the task of the whole class is not clear enough, and the task of making changes is also huge, so it increases the difficulty of maintenance and developers to make changes.

1.2.2 Hadoop2.x

In order to cope with the increasing demand and the drawbacks of MapReduce v1, Apache redesigned MapReduce v2, resulting in the emergence of MapReduce v2, that is, YARN framework. Let's introduce the YARN framework. It is shown in the schematic diagram of Fig. 4 Magi YARN.

Fig. 4 schematic diagram of YARN

The noun explains:

ResourceManager: hereinafter referred to as RM. The central control module of YARN is responsible for unified planning of the use of resources.

NodeManager: hereinafter referred to as NM. The resource node module of YARN, which is responsible for starting and managing container.

ApplicationMaster: hereinafter referred to as AM. Each application in YARN launches an AM, which is responsible for requesting resources from RM, asking NM to start container, and telling container what to do.

Container: resource container. All applications in YARN run on top of container. AM also runs on container, but AM's container is requested by RM.

(1) ResourceManager: in MapReduce v1, JobTracker has two tasks: resource management and task scheduling. In the YARN framework, the two core tasks of JobTracker are separated, and the resource management in it forms a new ResourceManager. ResourceManager is responsible for managing the state of resources (traditional information such as memory, CPU, disk, and bandwidth) provided by each NodeManager node. During the MapReduce task, RM accurately calculates the resources for each entire cluster and has allocated the appropriate resources for the task.

(2) Container: an overall description of the memory, CPU and other resources of a node.

(3) ApplicationMaster: each MapReduce task corresponds to an AM,AM that is responsible for asking ResourceManager for the resource container needed to execute the task, according to the state of the process, managing the process and dealing with the reason for the failure of the process.

(4) NodeManager: it is an agent of a machine framework and a container for task execution, which manages a lot of information about nodes, such as memory, CPU, hard disk, network and other resources.

Advantages of YARN over MapReduce v1:

(1) the huge burden borne by JobTracker is divided into resourceManager and nodemanager. Resource management and task scheduling are distributed in different nodes, and the program is distributed and optimized.

(2) ResourceManager resource allocation no longer depends on the number of slot, but allocates tasks according to the memory of nodes, which makes load balancing more perfect.

(3) there is an ApplicationMasters process on the ResourceManager node, which is responsible for managing the status of each ApplicationMatser process, so as to realize the monitoring task.

1.3 other differences

MapReduce has become an application on YARN like HBase and Hive; the default block size for Hadoop1.x is 64m, and the default block size for Hadoop 2.x is 128m; in 2.x, in addition to datanode reporting status,nodemanager to namenode, you also report status to ResourceManager.

II. Hadoop2.x implementation of Hadoop1.x upgrade

Version: old version of Hadoop1.0.3;, new version of Hadoop2.6.4.

HOST Information:

Download the installation package for upgrade:

Hadoop-2.6.4.tar.gz http://apache.opencas.org/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz

Jdk-8u77-linux-x64.tar.gz: download from the official website

The placement path of the package: / usr/local/src/

Create a new HDFS system directory and test files:

[root@namenode ~] # hadoop fs-mkdir / test

[root@namenode ~] # hadoop fs-put / home/hadoop/hadoop/conf/* / test/

Extract the jdk installation package (for each node):

[root@namenode ~] # cd / usr/local/src

[root@namenode ~] # tar zxvf jdk-8u77-linux-x64.tar.gz

Back up the old jdk (each node needs to operate):

[root@namenode ~] # mv / usr/local/jdk1.6 / usr/local/jdk1.6.bak

Replace the new jdk version (each node needs to operate):

[root@namenode ~] # mv jdk1.8.0_77 / usr/local/jdk/

Modify the jdk environment (each node needs to operate):

[root@namenode ~] # vim / etc/profile

Change JAVA_HOME

Export JAVA_HOME=/usr/local/jdk

Export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

Export PATH=$PATH:$JAVA_HOME/bin

[root@namenode ~] # source / etc/profile

Verify that the jdk is successful:

[root@namenode ~] # java-version

2.2 namenode node operation

Extract the hadoop2.6 package:

[root@namenode ~] # tar zxvf hadoop-2.6.4.tar.gz

Back up the hadoop1.0 (each node operates):

[root@namenode ~] # mkdir / home/hadoop/backup

[root@namenode~] # mv / home/hadoop/hadoop / home/hadoop/backup/

Back up the metadata of the cluster namenode (the folder configured by dfs.name.dir in ${HADOOP_HOME} / conf/hdfs-site.xml):

[root@namenode] # cp-r / data/work/hdfs/name / data/backup/hdfsname.20160418.bak

Install hadoop2.6:

[root@namenode ~] # mv / usr/local/src/hadoop-2.6.4 / home/hadoop/hadoop

[root@namenode] # chown-R hadoop.hadoop / home/hadoop/hadoop

Switch to the hadoop user:

[root@namenode ~] # su-hadoop

Modify the user environment (each node operates):

[hadoop@namenode ~] $vim / home/hadoop/.bash_profile

Modify:

Export HADOOP_HOME=/home/hadoop/hadoop

Export PATH=$PATH:$HADOOP_HOME:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Export HADOOP_HOME_WARN_SUPPRESS=1

Export PATH

[hadoop@namenode ~] $source / home/hadoop/.bash_profile

2.3 modify the configuration file

[hadoop@namenode ~] $cd / home/hadoop/hadoop/etc/hadoop/

[hadoop@hadoop/ ~] $vim hadoop-env.sh

Modify export JAVA_HOME=/usr/local/jdk

Add export HADOOP_PREFIX=/home/hadoop/hadoop

Export HADOOP_HEAPSIZE=15000

[hadoop@hadoop/ ~] $vim yarn-env.sh

Modify export JAVA_HOME=/usr/local/jdk

[hadoop@hadoop/ ~] $vim mapred-env.sh

Modify export JAVA_HOME=/usr/local/jdk

[hadoop@hadoop/ ~] $vim hdfs-site.xml

Dfs.namenode.http-address

Namenode:50070

NameNode gets fsp_w_picpath and edits from the current parameters

Dfs.namenode.secondary.http-address

Node2:50090

SecondNameNode gets the latest fsp_w_picpath from the current parameters

Dfs.replication

two

Sets the number of copies of files stored in HDFS. The default is 3.

Dfs.namenode.checkpoint.dir

File:///home/hadoop/hadoop2.2/hdfs/namesecondary

Set the local file system path where the temporary image is stored in secondary. If this is a comma-separated list of files, the image will be redundant copied to all directories, valid only for secondary

Dfs.webhdfs.enabled

True

Dfs.namenode.name.dir

File:///data/work/hdfs/name/

The local file system path used by namenode to persist namespaces and exchange logs

Dfs.datanode.data.dir

File:///data/work/hdfs

DataNode list of directories where block files are stored locally, separated by commas

Dfs.stream-buffer-size

131072

The default is 4KB, which is used as a hadoop buffer for hadoop to read hdfs files and write

Hdfs files and map output all use this buffer capacity, which is very conservative for current hardware and can be set to 128k.

(131072), or even 1m (too large map and reduce tasks may overflow memory)

Dfs.namenode.checkpoint.period

3600

The interval between two checkpoints, in seconds, valid only for secondary

[hadoop@hadoop/ ~] $vim mapred-site.xml

Mapreduce.framework.name

Yarn

[hadoop@hadoop/ ~] $vim yarn-site.xml

Yarn.nodemanager.aux-services

Mapreduce_shuffle

[hadoop@hadoop/ ~] $vim core-site.xml

Fs.defaultFS

Hdfs://namenode:9000/

Set the hostname and port of namenode

Hadoop.tmp.dir

/ home/hadoop/tmp

A directory where temporary files are stored

Create a new file directory (all node operations)

$mkdir / home/hadoop/tmp

$mkdir / data/work/hdfs/namesecondary/

$chown-R hadoop.hadoop / home/hadoop/tmp/

$chown-R hadoop.hadoop / data/work/hdfs/namesecondary/

Start hdfs

[hadoop@namenode ~] $tart-dfs.sh

[hadoop@namenode ~] $hadoop-daemon.sh start namenode-upgrade

Restart all daemon threads

[hadoop@namenode ~] $stop-dfs.sh

[hadoop@namenode ~] $start-all.sh

Check whether the metadata is retained successfully

[hadoop@namenode ~] $hadoop fs-ls /

Stop all daemons after success

[hadoop@namenode ~] $stop-all.sh

Modify / home/hadoop/hadoop/etc/hadoop/slaves

[hadoop@namenode ~] $vim slaves

Modify:

Node1

Node2

Copy the hadoop file to another node

[hadoop@namenode] $scp-r / home/hadoop/hadoop node2:/home/hadoop/hadoop/

[hadoop@namenode] $scp-r / home/hadoop/hadoop node1:/home/hadoop/hadoop/

The Node1,2 node modifies the directory permissions of hadoop

$chown-R hadoop.hadoop / home/hadoop/hadoop

Namenode starts the daemon thread

[hadoop@namenode ~] $start-all.sh

Under the dfs.namenode.name.dir directory of namenode and datanode (/ data/work/hdfs/name in this lab), there will be an extra folder previous/ or view information through jps.

Folder previous/, this is a backup of the data before the upgrade, and this folder is also required for rollback.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.