What is the MapReduce running architecture and Yarn resource scheduling? 07/08 Update SLTechnology News&Howtos

What is the MapReduce running architecture and Yarn resource scheduling?

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article analyzes "MapReduce operating architecture and Yarn resource scheduling". The content is detailed and easy to understand, and friends who are interested in "MapReduce operating architecture and Yarn resource scheduling" can follow the editor's train of thought to read it in depth. I hope it will be helpful to you after reading. Let's learn more about "MapReduce operating architecture and Yarn resource scheduling" with the editor.

Preface

one day, a research institute designed a drawing of a private jet, and then a company built a private jet based on the drawing. Then some rich man thought the plane was very good, so he bought it at a high price. If the plane wants to take off, it needs to apply for a route from the Air Traffic Control, and after the application is successful, the rich man hired another pilot. Finally, the pilot opened the plane, and the rich man got on the plane he wanted and soared into the sky.

The above process of can be summarized as follows:

Design drawings-- > Private aircraft-- > Air Traffic Control (apply for routes)-- > hire pilots (fly)-- > start the plane (soar in the sky).

After reading the above example, will take a look at the running process of a MapReduce application:

in 2006, the computing framework of MapReduce was incorporated into the Hadoop project and gradually came into people's eyes. For a time, major companies and developers are learning to use MapReduce.

one day, you develop an Application based on the principles of MapReduce and run it on a server. After the operation starts, App first applies for resources from the resource manager. After the resource application is successful, App goes to the task scheduler and asks it to arrange a task that can schedule App. Finally, task executes App to complete distributed computing.

The above process of can be summarized as follows:

MapReduce Framework-- > Application-- > Resource Scheduler (apply for Resources)-- > Task Scheduler (execute App)-- > distributed parallel Computing.

since the execution of the program requires a resource scheduler and a task scheduler, how does MapReduce run on each node?

In his previous introduction to Hadoop, also mentioned that the composition of the two versions of Hadoop is not the same, and the biggest difference lies in the scheduling of resources and tasks when MapReduce applications are running. Next, let's take a look at how the two versions are scheduled.

Hadoop1.x version

Official pictures:

In the Hadoop1.x version, MR comes with a resource scheduler, which is a master-slave architecture. In the first picture, there are two new roles: JobTracker and TaskTracker.

The flow of the above three images in is similar to the following figure:

When analyzes the flow chart, it is not difficult to see:

JobTracker is under great pressure and is prone to a single point of failure. JobTracker is both the resource scheduling master node and the task scheduling master node, which monitors the resource load of the whole cluster.

TaskTracker is to manage the resources of its own node from the node, and at the same time keep the heartbeat with JobTracker, report the acquisition of resources, and then obtain task tasks.

Resource management and computing scheduling are strongly coupled, so it is difficult for other frameworks to join the cluster.

JobTracker, TaskTracker (hereinafter referred to as JTT) is included with MR. If you want to run Spark on this cluster, you need to manually implement a set of JTT. After implementation, the cluster will have two sets of JTT to manage the resources of the entire cluster.

when the MR program applies for resources (assuming a large number of resources are applied for), the JTT implemented by Spark does not know that MR's JTT has applied for resources. Spark's JTT will still think that the resources of the whole cluster are sufficient. When Spark's computing program runs, it will also apply for resources (assuming that a large number of resources are also applied for), but there are not enough resources on the cluster at this time. This will lead to the problem of resource grabbing, which is caused by resource isolation. In other words, different frameworks do not have global control over resources.

The idea of MR processing is that computing is like data movement, or computing mobile data does not move, or data localization. In the above process, the data "over the network" calculated by reduce is transmitted through the network, that is to say, the data moves to computing, so there is a large number of network IO, which reduces the efficiency of the cluster.

, in the Hadoop2.x version, completely abolishes the JTT that comes with MR and introduces a new resource management tool-Yarn. Here's how Yarn works.

Hadoop2.x version

Official pictures:

In the Hadoop2.x version of , Hadoop decouples its own resource scheduler extraction into a separate resource scheduler-Yarn (Yet Another Resource Negotiator). Yarn resource scheduler is also a master-slave architecture. The master node is ResourceManager and the slave node is NodeManager.

Yarn resource scheduling process:

Yarn resource scheduling flowchart:

The above process of is tedious, which is divided into 9 steps, which can be analyzed in the figure:

The core idea of Yarn separates the resource management and task scheduling functions of JobTracker in Hadoop1.x, which are implemented by ResourceManager and ApplicationMaster processes respectively.

ResourceManager: responsible for resource management and scheduling of the entire cluster

ApplicationMaster: responsible for application-related transactions, such as task scheduling, task monitoring and fault tolerance.

Yarn provides a fault-tolerant mechanism for AppMstr. When AppMstr dies, RM will restart an AppMstr on a random node with sufficient resources.

Each MapReduce job corresponds to an AppMstr

NodeManager: manage resources on a single node, process commands from RM, keep heartbeat with RM, process commands from ApplicationMaster.

Container: resource abstraction on NM nodes, encapsulation of memory, CPU, disk, network and other resources.

With the introduction of YARN, multiple computing frameworks can run in a cluster. Each application corresponds to an ApplicationMaster. At present, the computing frameworks that can run on YARN are MapReduce, Spark, Storm and so on.

On the MapReduce running architecture and Yarn resource scheduling is how to share here, I hope that the above content can make you improve. If you want to learn more knowledge, please pay more attention to the editor's updates. Thank you for following the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.