How to compare MRv1 and Yarn 04/26 Update SLTechnology News&Howtos

How to compare MRv1 and Yarn

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article analyzes "how to achieve MRv1 and Yarn comparison". The content is detailed and easy to understand. Friends who are interested in "how to achieve the comparison between MRv1 and Yarn" can follow the editor's idea to read it slowly and deeply. I hope it will be helpful to everyone after reading. Let's learn more about "how to compare MRv1 and Yarn" with the editor.

YARN is not the next generation MapReduce (MRv2). The next generation MapReduce is exactly the same as the first generation MapReduce (MRv1) in programming interface and data processing engine (MapTask and ReduceTask). It can be considered that MRv2 reuses these modules of MRv1, but it is different from the resource management and job management system. Resource management and job management in MRv1 are implemented by JobTracker, which integrates two functions, while in MRv2, these two parts are separated. Job management is implemented by ApplicationMaster, while resource management is completed by the new system YARN. Because YARN is universal, YARN can also be used as a resource management system for other computing frameworks, not only limited to MapReduce, but also other computing frameworks, such as Spark, Storm and so on. Generally speaking, the computing framework running on YARN is called "X on YARN", such as "MapReduce On YARN", "Spark On YARN", "Storm On YARN" and so on.

This is clearly described on the official website:

Hadoop Common: The common utilities that support the other Hadoop modules.

Hadoop Distributed File System (HDFS ™): A distributed file system that provides high-throughput access to application data.

Hadoop YARN: A framework for job scheduling and cluster resource management.

Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Problems with the original Hadoop MapReduce framework

For big data storage and distributed processing systems in the industry, Hadoop is a well-known excellent open source distributed file storage and processing framework. The introduction of the Hadoop framework will not be repeated here. Readers can refer to the official introduction of Hadoop. Colleagues who have used and studied the old Hadoop framework (0.20.0 and earlier) should be familiar with the following original MapReduce framework diagram:

Figure 1.Hadoop original MapReduce architecture

From the above picture, you can clearly see the flow and design ideas of the original MapReduce program:

First of all, the user program (JobClient) submits a job,job message and sends it to Job Tracker. Job Tracker is the center of the Map-reduce framework. It needs to communicate regularly with the machines in the cluster (heartbeat), needs to manage which programs should run on which machines, and needs to manage all job failures, restarts and other operations.

TaskTracker is a part of every machine in the Map-reduce cluster, and what he does is mainly to monitor the resources of his machine.

TaskTracker also monitors the tasks health of the current machine. TaskTracker needs to send this information to JobTracker,JobTracker via heartbeat and collects this information to assign which machines the newly submitted job is running on. The dotted arrow above represents the process of sending and receiving messages.

It can be seen that the original map-reduce architecture is simple and straightforward, and in the first few years of launch, it has also received a large number of successful cases and been widely supported and affirmed by the industry, but with the growth of the scale and workload of distributed system clusters, the problems of the original framework have gradually surfaced, and the main problems are as follows:

JobTracker is the centralized processing point of Map-reduce, and there is a single point of failure.

JobTracker completes too many tasks, resulting in excessive resource consumption. When there are too many map-reduce job, it will cause a lot of memory overhead. Potentially, it also increases the risk of JobTracker fail. This is also the industry's general conclusion that the Map-Reduce of the old Hadoop can only support 4000 nodes of hosts.

On the TaskTracker side, it is too simple to take the number of map/reduce task as the representation of resources, without taking into account the memory consumption of cpu/. If two task with large memory consumption are scheduled together, OOM is easy to occur.

On the TaskTracker side, resources are forcibly divided into map task slot and reduce task slot. If there is only map task or reduce task in the system, it will cause a waste of resources, that is, the problem of cluster resource utilization mentioned earlier.

When analyzing the source code, we will find that the code is very difficult to read, often because a class does too many things, the code volume is more than 3000 lines, resulting in the task of class is not clear, increasing the difficulty of bug repair and version maintenance.

From an operational point of view, the current Hadoop MapReduce framework forces system-level upgrades when there are any important or unimportant changes (such as bug fixes, performance improvements, and characterization). To make matters worse, it forces every client of a distributed cluster system to be updated at the same time, regardless of user preferences. These updates can cause users to waste a lot of time verifying that their previous application is suitable for the new version of Hadoop.

The principle and Operation Mechanism of the New Hadoop Yarn Framework

From the perspective of the changing trend of the industry's use of distributed systems and the long-term development of the hadoop framework, MapReduce's JobTracker/TaskTracker mechanism needs large-scale adjustments to fix its shortcomings in scalability, memory consumption, threading model, reliability and performance. Over the past few years, the hadoop development team has made some bug fixes, but recently these fixes have become increasingly expensive, indicating that it is becoming more and more difficult to make changes to the original framework.

In order to fundamentally solve the performance bottleneck of the old MapReduce framework and promote the longer-term development of the Hadoop framework, since version 0.23.0, the MapReduce framework of Hadoop has been completely reconstructed and fundamental changes have taken place. The new Hadoop MapReduce framework is named MapReduceV2 or Yarn, and its architecture diagram is shown in the following figure:

Figure 2. New Hadoop MapReduce Framework (Yarn) architecture

The fundamental idea of refactoring is to separate the two main functions of JobTracker into separate components, which are resource management and task scheduling / monitoring. The new resource manager globally manages the allocation of computing resources for all applications, and the ApplicationMaster of each application is responsible for scheduling and coordinating accordingly. An application is nothing more than a single traditional MapReduce task or a DAG (directed acyclic graph) task. ResourceManager and the node management server of each machine can manage the user's processes on that machine and organize computing.

In fact, the ApplicationMaster of each application is a detailed framework library that combines resources obtained from ResourceManager and NodeManager to run and monitor tasks.

In the figure above, ResourceManager supports hierarchical application queues, which enjoy a certain proportion of the resources of the cluster. In a sense, it is a pure scheduler that does not monitor and track the status of the application during execution. Similarly, it cannot restart tasks that fail due to application failures or hardware errors.

ResourceManager schedules resources based on the requirements of the application; each application needs different types of resources and therefore different containers. Resources include: memory, CPU, disk, network and so on. It can be seen that this is significantly different from the current Mapreduce fixed-type resource usage model, which has a negative impact on the use of clusters. The resource manager provides a plug-in for scheduling policies that is responsible for allocating cluster resources to multiple queues and applications. The scheduling plug-in can be based on the existing capacity scheduling and fair scheduling model.

In the image above, NodeManager is the agent for each machine framework, is the container for executing the application, monitors the resource usage of the application (CPU, memory, hard disk, network) and reports to the scheduler.

The ApplicationMaster of each application is responsible for requesting the appropriate resource container from the scheduler, running tasks, tracking the status of applications and monitoring their progress, and dealing with the causes of task failure.

Yarn task execution process

Its basic design idea is to split the JobTracker in MapReduce into two separate services: a global resource manager ResourceManager and a unique ApplicationMaster for each application. ResourceManager is responsible for the resource management and allocation of the whole system, while ApplicationMaster is responsible for the management of individual applications.

When a user submits an application to YARN, YARN will run the application in two phases: the first phase is to start ApplicationMaster;, and the second stage is for ApplicationMaster to create the application, request resources for it, and monitor its entire running process until it runs successfully. As shown in figure 2-7, the workflow of YARN is divided into the following steps:

Step 1 the user submits the application to YARN, including the ApplicationMaster program, the command to start ApplicationMaster, the user program, and so on.

Step 2 ResourceManager assigns the first Container to the application and communicates with the corresponding NodeManager, asking it to start the application's ApplicationMaster in this Container.

Step 3 ApplicationMaster first registers with ResourceManager so that users can view the running status of the application directly through ResourceManage, and then it will request resources for each task and monitor its running status until the end of the run, that is, repeat step 4x7.

Step 4 ApplicationMaster uses polling to apply for and receive resources from ResourceManager through the RPC protocol.

Step 5 once the ApplicationMaster requests a resource, it communicates with the corresponding NodeManager, asking it to start the task.

Step 6 after NodeManager has set up the running environment for the task (including environment variables, jar packages, binary programs, etc.), write the task startup command into a script and start the task by running the script.

Step 7 each task reports its status and progress to ApplicationMaster through a RPC protocol, so that ApplicationMaster can keep abreast of the running status of each task, so that it can restart the task when it fails. During the running of the application, users can query the current running state of the application to ApplicationMaster at any time through RPC.

Step 8 when the application is finished, ApplicationMaster logs out to ResourceManager and closes itself.

On how to achieve MRv1 and Yarn comparison to share here, I hope that the above content can make you improve. If you want to learn more knowledge, please pay more attention to the editor's updates. Thank you for following the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.