Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to analyze and solve Yarn shuffle OOM errors

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

What is the Yarn shuffle OOM error analysis and solution? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Recently, some tasks in the cluster often run the error of Shuffle OOM on the reduce end, as shown below:

2015-03-09 16 org.apache.hadoop.mapred.YarnChild 1915 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#14 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run (Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run (ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run (YarnChild.java:167) at java.security.AccessController .doPrivileged (Native Method) at javax.security.auth.Subject.doAs (Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs (UserGroupInformation.java:1550) at org.apache.hadoop.mapred.YarnChild.main (YarnChild.java:162) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.BoundedByteArrayOutputStream. (BoundedByteArrayOutputStream.java:56) at org.apache.hadoop.io.BoundedByteArrayOutputStream. (BoundedByteArrayOutputStream.java:46) at org.apache. Hadoop.mapreduce.task.reduce.InMemoryMapOutput. (InMemoryMapOutput.java:63) at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve (MergeManagerImpl.java:297) at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve (MergeManagerImpl.java:287) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput (Fetcher.java:411) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost (Fetcher.java:341) at org.apache.hadoop .mapreduce.task.reduce.Fetcher.run (Fetcher.java:165)

Yarn shuffle OOM error analysis and resolution first take a look at the basic process. After processing, the map end puts the result in the map end local path, and the map side keeps reporting the heartbeat to MRAppMaster. At the appropriate stage (you can also write a process description), reduce starts, and reduce sends the heartbeat to MRAppMaster to get the finished map task object. After that, the data of the finished map process is pulled, commonly known as Shuffle, through the Fetcher thread, followed by sort. Several important parameters related to:

Public static final String SHUFFLE_INPUT_BUFFER_PERCENT = "mapreduce.reduce.shuffle.input.buffer.percent"; default is 0.7

Public static final String SHUFFLE_MEMORY_LIMIT_PERCENT = "mapreduce.reduce.shuffle.memory.limit.percent"; default is 0.25

Public static final String SHUFFLE_MERGE_PERCENT = "mapreduce.reduce.shuffle.merge.percent"; default is 0.66

This problem is exposed in the process of Fetcher. First of all, explain the parameter, the first parameter SHUFFLE_INPUT_BUFFER_PERCENT is the percentage of memory in the total HeapSize of shuffle. Our total HeapSize is 1.5g, then Fetcher is about 1.0g. SHUFFLE_MEMORY_LIMIT_PERCENT refers to whether the data from map copy is stored in memory or written directly to disk. Everything that exceeds the 1.5G*0.7*0.25=250M is put on disk, while the others open up memory space and put it in memory.

SHUFFLE_MERGE_PERCENT refers to the percentage of merge. After this percentage, stop fetcher and merge,merge to disk. After running out of OOM, adjust the jvm parameters to obtain heapdump data, and obtain the following data according to MAT.

The data are as follows:

Yarn shuffle OOM error analysis and solution Yarn shuffle OOM error analysis and solution first found that the overall memory is not up to 1.5G. Secondly, looking at the distribution of memory objects, the byte array accounts for a large proportion, which is also normal, all the buffer in memory is in the form of byte array. In comparing the size of the byte array, which is greater than 900m, there is a problem. First of all, the overall HeapSize is 1.5G. The old area is about 1 G. At this time, if the byte array is a 100m + copy of 900m, because it is opened up by large memory, it will not enter the Young area and directly open up the memory space to the Old area, while the Old area does not have so much contiguous space even if fullgc, so the allocation failed and an OOM error was reported. At this time, it is just a hypothesis, adjust the Xmn parameters, reduce the memory size of the Young area, increase the Old area for testing, successful, confirmed the idea.

But for us to run the task to adjust the jvm parameters is not realistic, then we can adjust the SHUFFLE_INPUT_BUFFER_PERCENT parameters according to experience, adjust to 0.6 can solve the problem.

This is the answer to the question about Yarn shuffle OOM error analysis and resolution. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report