What are the problems in hadoop1? 07/16 Update SLTechnology News&Howtos

What are the problems in hadoop1?

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the problems in hadoop1". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn what are the problems in hadoop1.

I gave an explanation to hadoop1 and hadoop 2. The picture is good. Take a look.

Hadoop 1.0

From the above picture, you can clearly see the flow and design ideas of the original MapReduce program:

First of all, the user program (JobClient) submits a job,job message and sends it to Job Tracker. Job Tracker is the center of the Map-reduce framework. It needs to communicate regularly with the machines in the cluster (heartbeat), needs to manage which programs should run on which machines, and needs to manage all job failures, restarts and other operations.

TaskTracker is a part of every machine in the Map-reduce cluster, and what he does is mainly to monitor the resources of his machine.

TaskTracker also monitors the tasks health of the current machine. TaskTracker needs to send this information to JobTracker,JobTracker via heartbeat and collects this information to assign which machines the newly submitted job is running on. The dotted arrow above represents the process of sending and receiving messages.

It can be seen that the original map-reduce architecture is simple and straightforward, and in the first few years of launch, it has also received a large number of successful cases and been widely supported and affirmed by the industry, but with the growth of the scale and workload of distributed system clusters, the problems of the original framework have gradually surfaced, and the main problems are as follows:

JobTracker is the centralized processing point of Map-reduce, and there is a single point of failure.

JobTracker completes too many tasks, resulting in excessive resource consumption. When there are too many map-reduce job, it will cause a lot of memory overhead. Potentially, it also increases the risk of JobTracker fail. This is also the industry's general conclusion that the Map-Reduce of the old Hadoop can only support 4000 nodes of hosts.

On the TaskTracker side, it is too simple to take the number of map/reduce task as the representation of resources, without taking into account the memory consumption of cpu/. If two task with large memory consumption are scheduled together, OOM is easy to occur.

On the TaskTracker side, resources are forcibly divided into map task slot and reduce task slot. If there is only map task or reduce task in the system, it will cause a waste of resources, that is, the problem of cluster resource utilization mentioned earlier.

When analyzing the source code, we will find that the code is very difficult to read, often because a class does too many things, the code volume is more than 3000 lines, resulting in the task of class is not clear, increasing the difficulty of bug repair and version maintenance.

From an operational point of view, the current Hadoop MapReduce framework forces system-level upgrades when there are any important or unimportant changes (such as bug fixes, performance improvements, and characterization). To make matters worse, it forces every client of a distributed cluster system to be updated at the same time, regardless of user preferences. These updates can cause users to waste a lot of time verifying that their previous application is suitable for the new version of Hadoop.

Hadoop2.0:

From the perspective of the changing trend of the industry's use of distributed systems and the long-term development of the hadoop framework, MapReduce's JobTracker/TaskTracker mechanism needs large-scale adjustments to fix its shortcomings in scalability, memory consumption, threading model, reliability and performance. Over the past few years, the hadoop development team has made some bug fixes, but recently these fixes have become increasingly expensive, indicating that it is becoming more and more difficult to make changes to the original framework.

In order to fundamentally solve the performance bottleneck of the old MapReduce framework and promote the longer-term development of the Hadoop framework, since version 0.23.0, the MapReduce framework of Hadoop has been completely reconstructed and fundamental changes have taken place. The new Hadoop MapReduce framework is named MapReduceV2 or Yarn

The fundamental idea of refactoring is to separate the two main functions of JobTracker into separate components, which are resource management and task scheduling / monitoring. The new resource manager globally manages the allocation of computing resources for all applications, and the ApplicationMaster of each application is responsible for scheduling and coordinating accordingly. An application is nothing more than a single traditional MapReduce task or a DAG (directed acyclic graph) task. ResourceManager and the node management server of each machine can manage the user's processes on that machine and organize computing.

In fact, the ApplicationMaster of each application is a detailed framework library that combines resources obtained from ResourceManager and NodeManager to run and monitor tasks.

In the figure above, ResourceManager supports hierarchical application queues, which enjoy a certain proportion of the resources of the cluster. In a sense, it is a pure scheduler that does not monitor and track the status of the application during execution. Similarly, it cannot restart tasks that fail due to application failures or hardware errors.

ResourceManager schedules resources based on the requirements of the application; each application needs different types of resources and therefore different containers. Resources include: memory, CPU, disk, network and so on. It can be seen that this is significantly different from the current Mapreduce fixed-type resource usage model, which has a negative impact on the use of clusters. The resource manager provides a plug-in for scheduling policies that is responsible for allocating cluster resources to multiple queues and applications. The scheduling plug-in can be based on the existing capacity scheduling and fair scheduling model.

In the image above, NodeManager is the agent for each machine framework, is the container for executing the application, monitors the resource usage of the application (CPU, memory, hard disk, network) and reports to the scheduler.

The ApplicationMaster of each application is responsible for requesting the appropriate resource container from the scheduler, running tasks, tracking the status of applications and monitoring their progress, and dealing with the causes of task failure.

Thank you for your reading, the above is the content of "what are the problems of hadoop1?" after the study of this article, I believe you have a deeper understanding of the problems of hadoop1, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.