What's the difference between Hadoop1.x and Hadoop2.x? 03/31 Update SLTechnology News&Howtos

What's the difference between Hadoop1.x and Hadoop2.x?

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is to share with you about the difference between Hadoop1.x and Hadoop2.x. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Background of Hadoop 2.0 production

HDFS and MapReduce in Hadoop 1.0 have problems in terms of high availability and scalability.

The high pressure of JobTracker access affects the scalability of the system.

It is difficult to support computing frameworks other than MapReduce, such as Spark, Storm and so on.

NameNode single point of failure, difficult to apply to online scenarios

The pressure of NameNode is too high, and the memory is limited, which affects the scalability of the system.

Problems in HDFS

Problems in MapReduce

HDFS 2.x

Resolve HDFS 1. 0 single point of failure and memory constraints.

Solve a single point of failure

Reference HDFS High Availability Using the Quorum Journal Manager

Reference ZooKeeper Getting Started Guide

HDFS HA: solved by preparing NameNode

If the resident NameNode fails, switch to the standby NameNode.

Solve the problem of limited memory

HDFS Federation (Federal)

Scale horizontally to support multiple NameNode

Each NameNode is in charge of part of the directory.

All NameNode share all DataNode stored resources.

2.x only the architecture has changed, and the mode of use remains the same.

Transparent to HDFS users

The commands and API in HDFS 1.x can still be used

Active and standby NameNode

Solve a single point of failure

Master NameNode provides external services, standby NameNode synchronizes master NameNode metadata to be switched

All DataNode report block information to both NameNode simultaneously.

Two switching options

Manual switching: switch between preparations through the command, you can use HDFS upgrade and other occasions; (X)

Automatic switching: based on Zookeeper. (√)

Automatic switching scheme based on Zookeeper

Zookeeper Failove Controller: monitor Namenode health and register Namenode with Zookeeper

After NameNode dies, ZKFC is a NameNode competitive lock, and the NameNode that acquires the ZKFC lock becomes active.

HDFS 2.x Federation

Through multiple namenode/namespace, the storage and management of metadata are distributed to multiple nodes, so that namenode/namespace can scale horizontally by adding machines.

It can distribute the load of a single namenode to multiple nodes, and it will not degrade the performance of HDFS when the scale of HDFS data is large. Multiple namsespace can be used to isolate different types of applications, and the storage and management of HDFS of different types of applications can be assigned to different namenode.

YARN

YARN-Yet Another Resource Negotiator

The new resource management system introduced by Hadoop 2.0 evolved directly from MRv1.

Core idea: separate the resource management and task scheduling functions of JobTracker in MRv1, which are implemented by ResourceManager and ApplicationMaster processes respectively;-ResourceManager: responsible for resource management and scheduling of the whole cluster; there is only one in the whole cluster

ApplicationMaster: responsible for application-related transactions, such as task scheduling, task monitoring and fault tolerance; each application corresponds to an ApplicationMaster

With the introduction of YARN, multiple computing frameworks can run in a cluster.

Each application corresponds to one ApplicationMaster

At present, several computing frameworks can be run on YARN, such as MapReduce, Spark, Storm and so on.

MapReduce On YARN

MapReduce that runs on YARN is called MRv2

Run MapReduce jobs directly on YARN, rather than on MRv1 systems built by JobTracker and TaskTracker

JobTracker and TaskTracker do not exist in Hadoop2.0

The basic functions of the MRv2 module:

YARN: responsible for resource management and scheduling

MRAppMaster: responsible for task segmentation, task scheduling, task monitoring and fault tolerance of an application / job

Map/Reduce Task: task-driven engine, consistent with MRv1

Each application / job (MapReduce job) corresponds to one MRAppMaster

A single application / job fails to run, does not affect other applications / jobs, and is restarted by YARN

After the task fails, MRAppMaster re-applies for resources

Responsible for application / job related affairs, including re-allocating resources from YARN to internal tasks, task segmentation, task health and fault tolerance, etc.

Thank you for reading! This is the end of the article on "what's the difference between Hadoop1.x and Hadoop2.x". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.