Hadoop YARN: scheduling performance Optimization practice 04/16 Update SLTechnology News&Howtos

Hadoop YARN: scheduling performance Optimization practice

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Background

As the resource management system of Hadoop, YARN is responsible for the management of computing resources and job scheduling on Hadoop clusters.

Meituan's YARN builds branches based on Community 2.7.1. At present, offline services, real-time services and machine learning services are supported on YARN.

Offline business mainly runs data warehouse jobs based on Hive on MapReduce and Spark SQL.

Real-time business mainly runs real-time streaming computing jobs based on Spark Streaming,Flink.

Machine learning business mainly runs computing tasks such as TensorFlow,MXNet,MLX (large-scale machine learning system developed by Meituan Dianping).

YARN faces many problems of high availability, scalability and stability. Among them, the most serious problem encountered in scalability is the performance problem of the scheduler caused by the growth of cluster and business scale. From a business perspective, assume that there are 1000 nodes in the cluster, each providing 100 CPU computing power. Each task uses 1 CPU, with an average execution time of 1 minute. The cluster always has a resource demand of more than 100,000 CPUs during peak periods. On average, the cluster scheduler can only schedule 50, 000 tasks per minute. From the minute level, the cluster resource utilization is 50000 / (100mm 1000) = 0.5, then 50% of the cluster computing resources cannot be used due to scheduling capacity problems.

With the expansion of cluster scale and the growth of business volume, the cluster scheduling capacity will gradually decrease with the increase of pressure. Assuming that the scheduling capacity remains unchanged, 50, 000 tasks are scheduled per minute. Calculated according to the size of 5000 nodes, if no optimization improvement is made, the utilization rate of cluster resources is 50, 000 / (100-5, 000) = 10%. The remaining 90% of machine resources cannot be utilized.

After this problem is solved, when the cluster has spare resources, the demand for job resources can be quickly met, and the computing resources of the cluster can be fully utilized.

The following will gradually explain the core modules of the Hadoop YARN scheduling system, uncover the root causes of the above performance problems, and propose systematic solutions. Finally, Hadoop YARN can support tens of thousands of nodes in a single cluster and support the scheduling ability of running tens of thousands of jobs concurrently.

Overall architecture YARN architecture

YARN is responsible for job resource scheduling, finding resources that meet the business in the cluster, helping jobs start tasks, and managing the life cycle of jobs.

For detailed architectural design of YARN, please refer to the official documentation of Hadoop.

Resource abstraction

YARN abstracts cluster resources in the two resource dimensions of cpu,memory.

Class Resource {number of int cpu; / / cpu cores int memory-mb; / / MB of memory}

The request for a job to apply for a resource from YARN is: list [ResourceRequest]

Class ResourceRequest {int numContainers; / / the number of container required Resource capability;// the resources of each container}

YARN's response to the job is: list [Container]

Class Container {ContainerId containerId; / / YARN globally unique container marking Resource capability; / / Resource information of the container String nodeHttpAddress; / / hostname} YARN scheduling architecture of the NodeManager that the container can start

Cdn.xitu.io/2019/8/5/16c5facf148d4676?w=950&h=568&f=png&s=123418 ">

Noun interpretation

ResourceScheduler is the scheduler of YARN and is responsible for the allocation of Container.

AsyncDispatcher is a single-threaded event dispatcher that sends scheduled events to the scheduler.

ResourceTrackerService is a resource tracking service, which is mainly responsible for receiving and processing NodeManager heartbeat information.

ApplicationMasterService is the RPC service of the job, which is mainly responsible for receiving the heartbeat information of the job.

AppMaster is the program controller of the job, which is responsible for interacting with YARN to obtain / release resources.

Scheduling process

Job resource application process: AppMaster informs YARN of resource requirements through heartbeat (list [ResourceRequest]), and retrieves the resources that have been allocated by the scheduler (List [Container]) after the last heartbeat.

The scheduler allocates resources as follows: the Nodemanager heartbeat triggers the scheduler to assign Container to the NodeManager.

Resource requests and allocations are made asynchronously. ResourceScheduler is an abstract class that needs to be implemented on its own. The community implements Fair Scheduler (FairScheduler) and capacity Scheduler (CapacityScheduler). Meituan Dianping uses a fair scheduler according to the characteristics of his business model.

Organization of Fair Scheduler operations

In the Fair Scheduler, the App is the leaf that mounts the tree queue shown below.

Core scheduling process

The scheduler locks the FairScheduler object to avoid core data structure conflicts.

The scheduler selects a node (node) of the cluster, starting from the root node ROOT of the tree queue, each layer queue selects a sub-queue according to the fairness policy, and finally selects an App in the leaf queue according to the fairness policy, and finds a suitable resource on the node for this App.

For each layer queue, perform the following process:

Queue pre-check: check whether the resource usage of the queue has exceeded the Quota of the queue

Sort subqueues / App: sort subqueues / App according to fair scheduling policy

Recursive scheduling subqueue / App

For example, the path to a schedule is ROOT-> ParentQueueA-> LeafQueueA1-> App11, and this time the schedule assigns Container to App11 from node.

Pseudo code

Class FairScheduler {/ * input:NodeId * output:Resource indicates the amount of resources allocated to a container of an app * root is the root of the tree queue Queue * / synchronized Resource attemptScheduling (NodeId node) {root.assignContainer (NodeId);}} class Queue {Resource assignContainer (NodeId node) {if (! PreCheck (node) return; / / pre-check sort (this.children); / / sort if (this.isParent) {for (Queue Q: this.children) q.assignContainer (node); / / Recursive call} else {for (App app: this.runnableApps) app.assignContainer (node);}} class App {Resource assignContainer (NodeId node) {. Fair Scheduler Architecture

Fair scheduler is a multi-thread asynchronous cooperation architecture, and in order to ensure the consistency of data in the scheduling process, FairScheduler object locks are added to the main process. The core scheduling process is executed by a single thread. This means that the Container allocation is serial, which is the core reason for the performance bottleneck of the scheduler.

Scheduler Lock:FairScheduler object lock

AllocationFileLoaderService: responsible for hot loading fair policy profiles and updating queue data structures

Continuous Scheduling Thread: core scheduling thread, constantly executing the core scheduling process in the previous section

Update Thread: update queue resource requirements, perform Container preemption process, etc.

Scheduler Event Dispatcher Thread: scheduler event handler, handling events such as App addition, App end, node addition, node removal, etc.

Performance evaluation

The architecture of the fair scheduler is introduced above, and there are performance problems in this system under large-scale business pressure. From the performance of the application layer, the demand for job resources can not be met. From the point of view of the system module, multiple modules work together, and each module has performance problems more or less. How to evaluate the performance of the system to meet the needs of online business? How to evaluate the service carrying capacity of the system? We need to find a performance target for the system. Therefore, before talking about the performance optimization scheme, we need to talk about the performance evaluation method of the scheduling system.

Generally speaking, the performance of the online service system is evaluated by the delay time of the QPS and the response TP99 that the system can carry. The difference between the scheduling system and the online service system is that the performance of the scheduling system can not be evaluated by the response delay of RPC (ResourceManager receives NodeManager and AppMaster RPC requests). The reason is that these RPC calls are asynchronous with the scheduling process of the scheduling system, so no matter how poor the scheduling performance is, the RPC response is almost unaffected. Similarly, no matter how poor the RPC response, scheduling performance is almost unaffected.

Business indicators-effective scheduling

First of all, the business index of the scheduling system is analyzed from the point of view of meeting the business needs. The business goal of the scheduling system is to meet the needs of business resources. The indicator is effective scheduling (validSchedule). In a production environment, as long as the validSchedule is up to standard, we believe that the current scheduler meets the needs of the online business.

A validSchedulePerMin is defined to indicate that the scheduling performance of a given minute is up to standard. The value of reaching the standard is 1, and the value of not reaching the standard is 0.

ValidPending = min (queuePending, QueueMaxQuota) if (usage / total > 90% | | validPending = = 0): validSchedulePerMin = 1 / / the cluster resource utilization is higher than 90%, or the cluster effective resource requirement is 0, when the performance of the scheduler is up to standard. If (validPending > 0 & & usage / total

< 90%) : validSchedulePerMin = 0;//集群资源使用率低于90%，并且集群存在有效资源需求，这时调度器性能不达标。 validPending表示集群中作业有效的资源需求量 queuePending表示队列中所有作业的资源需求量 QueueMaxQuota表示该队列资源最大限额 usage表示集群已经使用的资源量 tatal表示集群总体资源设置90%的原因是：资源池中的每个节点可能都有一小部分资源因为无法满足任何的资源需求，出现的资源碎片问题。这个问题类似linux内存的碎片问题。由于离线作业的任务执行时间非常短，资源很快可以得到回收。在离线计算场景，调度效率的重要性远远大于更精确地管理集群资源碎片，因此离线调度策略暂时没有考虑资源碎片的问题。 validSchedulePerDay表示调度性能每天的达标率。 validSchedulePerDay = ΣvalidSchedulePerMin /1440 目前线上业务规模下，业务指标如下： validSchedulePerMin >

0.9; validSchedulePerDay > 0.99

System performance indicator-number of scheduled Container per second

The essence of scheduling system is to assign Container to jobs, so the scheduling system performance index CPS-- scheduling Container per second is proposed. In the production environment, as long as the validSchedule reaches the standard, it shows that the current scheduler can meet the needs of online business. In the test environment, we need to pay attention to the CPS under different pressure conditions, find the upper limit of the current system load capacity, and further guide the performance optimization work.

CPS is related to test pressure, and the greater the test pressure, the lower the CPS. As you can see from the architecture of Fair Scheduler above, CPS is related to the following information:

The total number of resources in the cluster; the more resources in the cluster, the more Container the cluster can run concurrently, and the greater the scheduling pressure on the scheduling system. At present, there is little difference in the amount of cpu and memory resources of each physical machine, so the overall resource number of the cluster mainly depends on the number of physical machine nodes of the cluster.

The number of App running in the cluster; the more jobs, the more information you need to schedule, and the greater the scheduling pressure.

The number of queues in the cluster; the more the number of queues, the more information you need to schedule and the greater the scheduling pressure.

The execution time of each task in the cluster; the shorter the task execution time, the faster the release of resources, then the more free resources dynamically generated, the greater the pressure on the scheduling system.

For example, a cluster of 1000 nodes runs 1000 App at the same time, these App are distributed over 500 Queue, and each Container execution time of each App is 1 minute. Under such pressure, the scheduling system can schedule 1000 Container per second when there are a lot of resource requirements. Then under this condition, the CPS of the scheduling system is 1000 Universe s.

Dispatching pressure simulator

In the online environment, we can see whether the current scheduling performance meets the business requirements by observing the indicators of the scheduling system mentioned above. However, we have made a performance optimization strategy, which can not be tested directly in the online environment, so we must have the ability to verify that the performance of the scheduler meets the business requirements in the online environment. After that, the effective optimization strategy can be extended to the online environment.

Can we analyze and study the performance optimization of the scheduler if we also build a cluster of the same size offline? It is possible in theory, but it requires a lot of physical machine resources, which is a huge cost for the company. Therefore, we need a pressure simulator of the scheduler, which can simulate the scheduling process of YARN without a lot of physical machine resources.

The community provides Scheduler Load Simulater (SLS), a stress simulation tool for the open source scheduler.

As shown in the figure above, on the left is the architecture diagram of the open source SLS, all in one process, and there is a Scheduler simulated by threads in the ResourceManager module. Both App and NM (NodeManager) are impersonated by threads. Job resource application and NM node heartbeat are called by method.

The problems with open source architecture are:

Simulating large-scale APP and NM requires a large number of threads, which causes the scheduler thread and the NM/App simulation thread to compete for cpu resources, affecting the evaluation of the scheduler.

Unreasonable logic is added to the Scheduler Wapper of SLS, which seriously affects the performance of the scheduler.

For the sake of generality, SLS does not invade the scheduling process of FairScheduler to obtain performance indicators, but only obtains Queue resource requirements, Queue resource usage, App resource requirements, App resource usage and other indicators from the periphery. These indicators are not performance indicators, and can not be used to analyze system performance bottlenecks.

In view of the existing problems, we have carried out an architectural transformation. On the right is the modified architecture diagram, which strips the analog logic of Scheduler Wapper from SLS and replaces it with real ResourceManager. SLS is only responsible for the resource request of the simulation job and the heartbeat report of the node. ResourceManager is real, and the exposure indicators of online production environment and offline pressure test environment are exactly the same, so online and offline indicators can be compared intuitively. Detailed code reference: YARN-7672

Fine-grained monitoring index

Using the scheduling pressure simulator for pressure test, it is observed that the validSchedule is not up to standard, but it is still not clear where the performance bottleneck lies. Therefore, fine-grained indicators are needed to determine the bottleneck of performance. Because the scheduling process is single-threaded, the means to obtain fine-grained indicators is to invade the FairScheduler and collect the time consumption of key functions per minute in the scheduling process. The goal is to find the function that takes the most time to locate the bottleneck of the system. For example, by adding time statistics before and after the preCheck function, you can collect the time consumed by preCheck during scheduling.

Based on the above ideas, we define more than 10 fine-grained indicators, and the key indicators are:

Parent queue preCheck time per minute

Parent queue sort time per minute

Sub-queue preCheck time per minute

Subqueue sorting time per minute

Time to allocate resources to a job per minute

The time spent per minute because there is no resource requirement for the job

Key optimization point

When the pressure test is done for the first time, the given pressure is the peak pressure of the online production environment at that time (1000 nodes, 1000 job concurrency, 1000 queues, 40 seconds of single Container execution time). After optimization, the performance of the scheduler is improved to meet the business requirements, and then the test pressure is adjusted by predicting the growth of the business scale, and the optimization is carried out iteratively.

The following figure shows the performance optimization timeline, and the vertical axis is the scheduling performance CPS.

Optimized sort comparison function

In the core scheduling process, the second step is to sort the subqueues. Looking at the fine-grained indicators, we can clearly see that the scheduling process takes a total of 50 seconds per minute, of which the sorting time takes 30 seconds, accounting for the largest proportion, so we first consider to optimize the sorting time.

The quick sorting algorithm used in sorting itself has no room for optimization. Through further analysis of the sort comparison function, it is found that the time complexity of the sort comparison function is very high.

The part with the highest computational complexity is the resource usage (resourceUsage) of the queue / job that needs to be obtained. In the original algorithm, when every two queues are compared and the resourceUsage is needed, the program is calculated on the spot. It is calculated by recursively accumulating the resourceUsage of all jobs under the queue. This results in a huge amount of repeated calculation.

Optimization strategy: optimize the on-site calculation to calculate in advance.

Advance calculation algorithm: when a Container is allocated to an App (the amount of resources is defined as containerResource), then the resourceUsage of the parent queue is recursively adjusted to make the parent queue resourceUsage + = containerResource. When a Container of an App is released, by the same token, let the parent queue resourceUsage-= containerResource. By using the advance calculation algorithm, the statistical time complexity of queue resourceUsage is reduced to O (1).

Optimization effect: the time consumption of fine-grained indicators related to sorting is significantly reduced.

The metrics in the red box indicate the time spent by the per-minute scheduler in queue / job sorting. As can be seen from the figure, after optimization, the sorting time is reduced from 30G (30 seconds) per minute to less than 5G (5 seconds). Detailed code reference: YARN-5969

Optimize job skip time

From the image above, after optimizing the sort comparison function, the blue line has increased significantly, from 2 seconds to 20 seconds. This blue line indicator means the time it takes per minute for the scheduler to skip jobs that have no resource requirements. From the perspective of time ratio, the current optimization goal is to reduce the time of this blue line.

The analysis code shows that all queues / jobs participate in the scheduling. But in fact, many queues / jobs have no resource requirements at all and do not need to participate in scheduling. Therefore, the optimization strategy is to remove queues / jobs with no resource requirements from the queue's Children before sorting.

Optimization effect: this index has been reduced from 20 seconds to almost negligible. Detailed code reference: YARN-3547

At this time, it is obvious from the above picture that there is an upward trend, and this indicator accounts for the largest proportion of the total scheduling time. The indicator corresponding to this line means that after determining the job to be scheduled, the scheduler allocates a Container time to the job. The average execution time of this part of the logic is within 0.02ms, and it will not increase with the increase of cluster size and job size, so no further optimization will be done for the time being.

Queue parallel sorting optimization

From the core scheduling process, we can see that the allocation of each Container requires queue sorting. The sorting time increases linearly with the increase of business scale (the number of jobs and queues).

Architectural thinking: for the fair scheduler, sorting is to achieve a fair scheduling strategy, but the resource requirements change all the time, each change will lead to unfair use of job resources. Even if each Container is sorted as it is allocated, a fairness policy cannot be achieved across the entire timeline. Which Zhengzhou infertility hospital is good: http://yyk.39.net/zz3/zonghe/1d427.html

For example, if the cluster has 10 cpu,T1 moments, there is only one job App1 running in the cluster, and 10 cpu are applied for, then the cluster will assign all 10 cpu to App1. At T2 (T2 > T1), there is a new job App2 in the cluster, and the cluster has no resources, so it is impossible to allocate resources for App2. At this point, the use of resources by App1 and App2 in the cluster is unfair. From this example, fair scheduling can not be achieved on the timeline only through the scheduling allocation algorithm.

At present, the fair strategy of the fair scheduler is to ensure the fairness of resource scheduling in the cluster at a certain time. In the entire timeline, preemption strategy is needed to complement the goal of achieving fairness. Therefore, from a timeline point of view, it is not necessary to sort each Container when allocating it.

To sum up, the optimization strategy is to parallelize the sorting process and the scheduling process. The main points are as follows:

The step where the scheduling process is no longer sorted.

A separate thread pool handles the sorting of all queues, where each thread handles the sorting of one queue.

Before sorting, the information used in the sorting part of the queue / job is deeply cloned to ensure that the data structure of the queue / job remains the same.

The optimization results are as follows:

Queue sorting efficiency: it only takes less than 5 milliseconds (2ms-5ms) to sort 2000 queues using thread pool, and it can be sorted at least 200 times in one second, which has no impact on business at all.

Under the pressure of running 10, 000 jobs in parallel, 12000 nodes in the cluster, 2000 queues and 40 seconds of execution time for a single Container, the scheduling CPS reaches 50, 000, which can fill up the entire cluster resources within one minute and continue to fill up.

In the figure above, at 15:26, the setting value is 0, indicating that all the current resource requirements of the cluster have been scheduled. At 15:27, resourceUsage reaches 1.0, which means that the utilization rate of cluster resources is 100%, and the cluster has no free resources. The pending value reaches 4m (memory requirement of 4 million mb) because there is no resource wait caused by free resources.

The strategy of stable online

The result of offline stress test is very good, and the business goal can only be achieved online. However, it is difficult to go online steadily for the following reasons:

There is a great difference in business between online and offline pressure testing environments. Offline is no problem, online is not necessarily no problem.

At that time, there was only one YARN cluster, so there was only one scheduler. If there is an exception in the scheduler, it will be a disaster for the entire cluster, resulting in the entire cluster unavailable.

In addition to routine unit testing, functional testing, stress testing and setting alarm indicators, we propose an online strategy for the cluster scheduling system according to the business scenario.

Online rollback strategy

The business peak of offline production is in the early morning, so the probability of service failure in the early morning is the highest. In the early morning, the efficiency of RD students receiving the alarm call and performing the usual service rollback process (rollback code, restarting the service) is very low. And during the restart, the service is not available, resulting in a longer unavailable time for the business. So we have parameters configured for each optimization strategy of the scheduler. You only need to modify the parameter configuration and execute the configuration update command, then you can change the execution logic of the scheduler and switch the execution logic back to the pre-optimized process without restarting the service. Is the Gastrointestinal Hospital of Jiaozuo traditional Medicine regular: http://jz.lieju.com/zhuankeyiyuan/37756433.htm

The key problem here is that the system updates the value of a parameter of the scheduler by configuring the load thread, and the scheduling thread is also working according to this parameter value. The value of this parameter may be checked several times during a scheduling process, and the corresponding logic is executed according to the parameter value. A change in the parameter values observed by the scheduling thread during a scheduling process will lead to a system exception.

The way to deal with it is to avoid the problem of data inconsistency caused by multi-thread sharing resources by copying resources. At the beginning of each scheduling, the scheduling thread first copies all the current performance optimization parameters to ensure that the parameters observed in this scheduling process will not change.

Automatic data check strategy

The optimization algorithm is to improve the performance, but it should be noted that the output of the algorithm can not be affected to ensure the correctness of the algorithm. For complex algorithm optimization, it is a very difficult task to ensure the correctness of the algorithm.

In the research and development of "optimized sorting comparison time", the calculation method of queue resourceUsage has been changed from on-site calculation to advance calculation. So how to ensure that the resourceUsage calculated by the optimized algorithm is correct?

Even if you do unit strategy, functional testing, stress testing, but in the face of a complex system, still can not be 100% sure. In addition, future system upgrades may also cause the bug of this part of the function.

After the algorithm changes, if the new resourceUsage is miscalculated, it will cause the scheduling policy to be executed incorrectly all the time. Thus affecting the resource allocation of the queue. Will have a huge impact on the business. For example, the business can not get the original amount of resources, resulting in business delay.

The resourceUsage of all queues obtained from the previous field calculation must be correct, defined as oldResourceUsage. After the optimization of the algorithm, the resourceUsage of all queues is calculated in advance, which is defined as newResourceUsage.

In the system, oldResourceUsage and newResourceUsage are compared periodically. If the data are inconsistent, it shows that the optimized algorithm has bug,newResourceUsage calculation error. At this time, the system will send an alarm notice to RD, and automatically replace all miscalculated data with correct data, so that the error can be automatically corrected in time.

Summary and prospect in the future

This paper mainly introduces the performance optimization practice of Meituan Dianping Hadoop YARN cluster fair scheduler.

To do performance optimization, we must first define macro performance indicators, so as to be able to evaluate the performance of the system.

Define the fine-grained indicators that need to be observed in the pressure test in order to clearly see the bottleneck of the system.

If you want to do good work, you must first sharpen its tools. Efficient stress testing tool is a necessary tool for performance optimization.

The main ideas of the optimization algorithm are: reducing the time complexity of the algorithm; reducing repeated calculations and unnecessary calculations; parallelization.

Performance optimization is endless, so we should reasonably estimate the business pressure according to the real business, and gradually carry out the work of performance optimization.

Code online need to be careful, make a good defense plan.

The performance optimization of a single YARN cluster scheduler is always limited. At present, we can support a cluster size of 10, 000 nodes, so how do we deal with 100000, 1 million nodes in the future?

Our solution is to design a technical solution suitable for Meituan Dianping's business scenario based on the idea of the community. Community Hadoop 3.0 developed Global Scheduling, which completely subverts the current architecture of YARN scheduler and can greatly improve the scheduling performance of a single cluster. We are following up on this Feature. The YARN Federation of the community has been gradually improved. This architecture can support multiple YARN clusters to provide unified cluster computing services, because each YARN cluster has its own scheduler, which is equivalent to horizontally expanding the number of schedulers, thus improving the scheduling capability of the cluster as a whole. Based on the community architecture, combined with Meituan Dianping's business scenario, we are constantly improving Meituan Dianping's YARN Federation.

A brief introduction to the author

Shilong, Ting Wang, Meituan user platform big data and R & D engineer of arithmetic Department.

Team introduction

The goal of the data platform resource scheduling team is to build a super-large-scale, high-performance resource scheduling system that supports heterogeneous computing resources and multi-scenarios. At present, there are nearly 30,000 computing nodes managed, in a single set.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.