How to parse YARN, a distributed resource scheduling framework 04/18 Update SLTechnology News&Howtos

How to parse YARN, a distributed resource scheduling framework

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to parse the distributed resource scheduling framework YARN, aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Background of YARN production

Let's first take a look at the architecture of MapReduce 1.x and its problems.

Architecture of hadoop1.x

As shown in the figure, 1.x 's architecture also adopts a master-slave structure: that is, the master-slaves architecture, one JobTracker with multiple TaskTracker

JobTracker: responsible for resource management and job scheduling; Tasktracker regularly reports the health status, resource usage and job execution of this node to jobtracker; at the same time, it also receives commands from JobTracker, which is responsible for starting and killing the specific execution of tasks. The MapReduce job is split into Map tasks and Task tasks, which are executed and reported by TaskTracker.

The disadvantages of such an architecture are:

Only one JobTracker is responsible for centralized processing of cluster transactions, and there is a single point of failure. And the pressure is too high to expand.

There are too many tasks that need to be done by JobTracker. It needs to maintain both the state of job and the state of task of job, resulting in excessive consumption of resources.

Only MR jobs are supported. Other computing frameworks, such as spark,storm, are not supported.

There are many clusters, such as Spark cluster and hadoop cluster, which can not be managed uniformly, the utilization rate of resources is low, there is no way to share resources with each other, and the cost of operation and maintenance is high.

Overview of YARN

Yet Another Resource Negotiator . Is an operating system-level resource scheduling framework.

The most basic idea of MRv2 is to separate the main resource management and Job scheduling / monitoring functions of the original JobTracker as two separate daemons. There is a global ResourceManager (RM) and each Application has an ApplicationMaster (AM), and Application is equivalent to MapReduce Job or DAG jobs. ResourceManager and NodeManager (NM) constitute the basic data computing framework. ResourceManager coordinates the resource utilization of the cluster. Any Client or running applicatitonMaster that wants to run Job or Task must apply for certain resources from RM. ApplicatonMaster is a framework-specific library that has its own AM implementation for the MapReduce framework, and users can also implement their own AM. At run time, AM will launch and monitor Tasks together with NM.

Reference: https://blog.51cto.com/14048416/2342195

Schematic diagram of hadoop2.x yarn

The figure above shows the location of the YARN: above the HDFS and under a variety of applications. So many different types of computing frameworks can run in the same cluster, share the data on the same HDFS cluster, and enjoy the overall resource scheduling. That is, XXX on YARN, such as Spark on YARN,Spark on YARN,MapReduce on Yarn,Storm on YARN,Flink on YARN. The advantage is to share cluster resources with other computing frameworks and allocate them according to voluntary needs, thus improving the utilization of cluster resources.

YARN Architecture Architecture and Core components of YARN

The architecture of YARN includes five core components: ResourceManager (RM), NodeManager (NM), ApplicationMaster (AM) and Container,Client, which is still a master-slave structure, that is, the form of 1 RM+N NM. Their functions are as follows:

1) RM: only one cluster provides services at the same time (one master and one backup is used in production to prevent failures), which is responsible for the unified management and scheduling of cluster resources. The main tasks undertaken are:

Process client request: submit a job and kill a job.

Monitor the NM, and if a NM fails, tell the AM the tasks running on the NM, and it is up to AM to decide whether to rerun the corresponding task.

2) NM: there are multiple nodes in the cluster, which are responsible for the management and use of their own node resources. It undertakes the following tasks:

Regularly report the resource usage and health status of the node to the RM.

Receive and process various commands from RM, such as starting Container to run AM.

Handle commands from AM, such as starting Container and running task.

Resource management of a single node

3) AM: each application corresponds to an AM (one for each MapReduce job and one for each Spark job), which is responsible for the corresponding application management.

Request resources (core, memory, etc.) from RM for the application and allocate them later

You need to communicate with NM: starting or stopping task,task and AM are both running in Container.

A NM may run many task, which belong to different AM

4) Container

A container that encapsulates resources such as CPU Memory is an abstraction of the environment in which a task runs.

AM runs in Container, and so does task

5) Client: client

A request that initiates a response, for example:

Submit a job to view the progress of the job

Kill operation

YARN execution process schematic diagram of YARN execution process

The ① client submits a task request to RM

②③ RM first launches a Container on NM to run AM.

After ④ AM starts, register with RM. (the task is managed by AM. After registration, users can query the job progress on AM through RM.) and apply for resources (Core Memory) from NM. RM assigns the corresponding NM resources to AM.

⑤⑥ AM issues instructions to the corresponding NM,NM to start Container to run task.

This is a basic process executed by YARN, which is a general process. MapReduce jobs correspond to MapReduce Application master,Spark jobs correspond to Spark Application Master, and other jobs also have corresponding Application Master.

We configured YARN earlier. Refer to the configuration and use examples of Yarn in hadoop. There are mainly two configuration files: mapred-site.xml and yarn-site.xml. When starting, there is a start-yarn.sh command that is used to start RM and NM (use stop-yarn.sh to stop the YARN process). After starting YARN, we can view the YARN cluster on the web browser. Including the situation of the current node, the running status of the task and so on.

This is the answer to the question on how to analyze the distributed resource scheduling framework YARN. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.