Selection and use of Hadoop Yarn Scheduler 07/01 Update SLTechnology News&Howtos

Selection and use of Hadoop Yarn Scheduler

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. introduction

Yarn plays the role of resource management and task scheduling in the ecosystem of Hadoop. Take a brief look at the architecture of Yarn before discussing its constructors.

The figure above shows the basic architecture of Yarn, in which ResourceManager is the core component of the entire architecture, which is responsible for the management of memory, CPU and other resources in the entire cluster; ApplicationMaster is responsible for the task scheduling of the application throughout the life cycle; NodeManager is responsible for the supply and isolation of resources on this node; Container can be abstractly regarded as a container for running tasks. The scheduler discussed in this article is scheduled in the ResourceManager build, and then we will study three schedulers, including FIFO scheduler, Capacity scheduler, and Fair scheduler.

2. FIFO scheduler

The figure above is a schematic diagram of the execution process of the FIFO scheduler. The FIFO scheduler is commonly known as the first-in, first-out (First In First Out) scheduler. FIFO scheduler is one of the earliest scheduling strategies applied by Hadoop, which can be simply understood as a Java queue, which means that there can only be one job running in the cluster at the same time. All Application are executed in the order in which they were submitted, and only when the previous Job execution is completed will the subsequent Job be executed in the order of the queue. FIFO scheduler runs jobs in an exclusive way of cluster resources, which has the advantage that a job can make full use of all cluster resources, but for MR jobs with short running time, high importance or interactive queries, they have to wait for the jobs in front of the sequence to be completed, which leads to that if there is a very large Job running, then the subsequent jobs will be blocked. Therefore, although the implementation of a single FIFO scheduling is simple, it can not meet the requirements for many practical scenarios. This led to the emergence of Capacity scheduler and Fair scheduler.

3. Capacity scheduler

The figure above is a schematic diagram of the execution process of the Capacity scheduler. The Capacity scheduler is also known as the container scheduler. It can be understood as a queue of resources. The resource queue is allocated by the user. For example, because of the need for work, the entire cluster is divided into two AB queues, and queue A can be further divided under queue A, for example, queue An is subdivided into 1 and 2 subqueues. Then the allocation of the queue can refer to the following tree structure:

-A [60%]

| |-A.1 [40%] |

| |-A.2 [60%] |

-B [40%]

The above tree structure can be understood as queue An occupies 60% of the total resources and queue B occupies 40% of the total resources. Queue An is divided into two sub-queues, with A.1 accounting for 40% and A.2 accounting for 60%, which means that A.1 and A.2 occupy 40% and 60% of the resources of queue A respectively. Although the resources of the cluster have been specifically allocated at this time, it does not mean that A can only use 60% of the resources it is assigned after submitting the task, while 40% of the resources in queue B are idle. As long as the resources in other queues are idle, the queues with task submissions can use the resources allocated by the idle queues, and how much is used is determined by allocation. The configuration of the parameters will be mentioned later.

Here I still want to recommend the big data Learning Exchange Group I built myself: 784557197, all of them are developed by big data. If you are studying big data, the editor welcomes you to join us. Everyone is a software development party. Irregularly share practical information (only related to big data software development), including a 2018 latest big data advanced materials and advanced development tutorials organized by myself. Welcome to join us if you want to go deep into big data.

The Capacity scheduler has the following features: a hierarchical queue design that ensures that child queues can use all the resources set by the parent queue. In this way, through hierarchical management, it is easier to reasonably allocate and limit the use of resources. Capacity guarantee, the proportion of a resource will be set on the queue, which ensures that each queue will not occupy the resources of the entire cluster. Security and strict access control for each queue. Users can only submit tasks to their own queues, and cannot modify or access tasks from other queues. Flexible allocation, free resources can be allocated to any queue. When there is contention among multiple queues, it will be balanced proportionally. Multi-tenant lease, through the capacity limit of the queue, multiple users can share the same cluster, and colleagues ensure that each queue is allocated its own capacity to improve utilization. Operational, Yarn supports dynamic modification to adjust the allocation of capacity, permissions, etc., which can be modified directly at run time. It also provides the administrator interface to display the current queue status. An administrator can add a queue at run time, but cannot delete a queue. The administrator can also pause a queue at run time to ensure that the cluster will not receive other tasks during the execution of the current queue. If a queue is set to stopped, tasks cannot be submitted to him or to the subqueue. Based on resource scheduling, coordinate applications with different resource requirements, such as memory, CPU, disk, and so on. Configuration of related parameters:

(1) capacity: the resource capacity of the queue (percentage). When the system is very busy, you should ensure that the capacity of each queue is met, and if there are fewer applications per queue, you can share the remaining resources with other queues. Note that the sum of the capacity of all queues should be less than 100.

(2) maximum-capacity: the upper limit (percentage) of resource usage for the queue. Because of resource sharing, the amount of resources used by a queue may exceed its capacity, and the maximum amount of resources used can be limited by this parameter. (this is also the maximum percentage of resources that can be consumed by queues with tasks running as mentioned earlier.)

(3) user-limit-factor: the maximum amount of resources (percentage) that each user can use. For example, assuming that the value is 30, the amount of resources used by each user cannot exceed 30% of the queue capacity at any one time.

(4) maximum-applications: the maximum number of applications waiting and running in the cluster or queue, which is a strong limit. Once the number of applications in the cluster exceeds this limit, subsequently submitted applications will be rejected. The default value is 10000. The upper limit of the number of all queues can be set by parameter yarn.scheduler.capacity.maximum-applications (which can be regarded as the default value), while a single queue can be set to its own value by parameter yarn.scheduler.capacity..maximum- applications.

(5) maximum-am-resource-percent: the upper limit of the proportion of resources used to run application ApplicationMaster in the cluster. This parameter is usually used to limit the number of active applications. The parameter type is floating point, and the default is 0.1, which means 10%. The upper limit of the proportion of ApplicationMaster resources for all queues can be determined by the parameter yarn.scheduler.capacity. Maximum-am-resource-percent setting (can be seen as the default), while a single queue can be set through the parameter yarn.scheduler.capacity.. Maximum-am-resource-percent sets a value that suits you.

(6) state: the queue status can be STOPPED or RUNNING. If a queue is in the STOPPED state, the user cannot submit the application to the queue or its subqueues. Similarly, if the ROOT queue is in the STOPPED state, the user cannot submit the application to the cluster, but the running application can still run and end normally, so that the queue can exit gracefully.

(7) acl_submit_applications: define which Linux users / user groups can submit applications to a given queue. It is important to note that this property is inherited, that is, if a user can submit an application to a queue, it can submit the application to all of its subqueues. When you configure this property, you use "," to split between users or user groups, and spaces between users and user groups, such as "user1, user2 group1,group2".

(8) acl_administer_queue: assign an administrator to the queue who can control all the applications of the queue, such as killing any one of the applications. Again, this property is inherited, and if a user can submit an application to a queue, it can submit the application to all of its subqueues.

4. Fair scheduler

The figure above is a schematic diagram of the execution of the Fair scheduler in a queue. The Fair scheduler is also known as the fair scheduler. The Fair scheduler is a queue resource allocation method in which all Job get resources evenly over the entire timeline. By default, the Fair scheduler only schedules and allocates memory resources fairly. When there is only one task running in the cluster, the task takes up the resources of the entire cluster. When other tasks are submitted, those released resources will be allocated to the new Job, so each task will eventually get almost the same amount of resources.

The Fair Scheduler can also work between multiple queues, as shown in the figure above, for example, there are two users An and B, each with a queue. When A starts a Job and B has no tasks to submit, A will get all the cluster resources; when B starts a Job, A's tasks will continue to run, but queue A will slowly release some of its resources, and after a while the two tasks will each get half of the cluster resources. If B starts the second Job at this time and other tasks are still running, it will share the resources of queue B with the first Job in queue B, that is, the two Job of queue B will use the resources of cluster 1/4 respectively, while the Job of queue A will still use half of the resources of the cluster. As a result, the resources of the cluster will eventually be shared equally between the two users.

Configuration of related parameters:

(1) yarn.scheduler.fair.allocation.file: the location of the "allocation" file, which is a configuration file that describes the queue and its properties. This file must be a strictly formatted xml file. If it is a relative path, then this file will be looked for under classpath (under the conf directory). The default is "fair-scheduler.xml".

(2) yarn.scheduler.fair.user-as-default-queue: whether to use the username related to allocation as the default queue name, when queue name is not specified. If set to false (and no queue name specified) or not set, all jobs will share the "default" queue. The default is true.

(3) yarn.scheduler.fair.preemption: whether to use "preemption" (priority, preemption). The default is fasle. In this version, this feature is testable.

(4) yarn.scheduler.fair.assignmultiple: allows multiple container allocation messages to be sent in a heartbeat. The default is false.

(5) yarn.scheduler.fair.max.assign: if assignmultuple is true, the maximum number of assigned container sent in a heartbeat. The default is-1, which is unlimited.

(6) yarn.scheduler.fair.locality.threshold.node: a float value between 0 and 1, indicating the maximum number of opportunities to give up the containers that does not meet the node-local condition while waiting for the container that meets the node-local condition, and the number of nodes abandoned is the proportion of the size of the cluster. The default value of-1.0 means that no scheduling opportunities are given up.

(7) yarn.scheduler.fair.locality.threashod.rack: ditto, satisfy rack-local.

(8) yarn.scheduler.fair.sizebaseweight: whether the weight is based on the size of the application (the number of Job). The default is false, and if it is true, the complex application will get more resources.

V. Summary

It is recommended to use FIFO scheduler if the business logic is simple or when you are new to Hadoop; if you need to control the priority of some applications and want to make full use of cluster resources, it is recommended to use Capacity scheduler; if you want multi-user or multi-queue fair sharing of cluster resources, then choose Fair scheduler. I hope you can choose the right scheduler according to your business needs.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.