What if the shuffle file pull failed caused by JVM GC 02/13 Update SLTechnology News&Howtos

What if the shuffle file pull failed caused by JVM GC

2026-02-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly talks about "what to do when pulling shuffle files caused by JVM GC". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "what to do if the shuffle file pull failure caused by JVM GC"!

A situation that sometimes occurs, which is very common, in spark assignments; shuffle file not found. (very, very common in spark assignments) and, sometimes, it is an occasional occurrence. Sometimes, when this happens, you will resubmit stage and task. Execute it all over again and find it. There is no such mistake.

What does log think?

Use client mode to submit your spark homework. Like standalone client;yarn client. As soon as you submit the assignment, you can see the updated log locally.

Spark.shuffle.io.maxRetries 3Universe means that when pulling a shuffle file, if it is not pulled (pull failed), it will be retried at most (several times will be pulled again), and the default is 3 times. Spark.shuffle.io.retryWait 5s// means that the time interval for each retry to pull a file is 5 seconds by default.

By default, let's say the executor of the first stage is doing a long full gc. The executor of the second stage tries to pull the file but fails to pull it. By default, the pull will be retried three times, each time with an interval of five seconds. You can only wait for 3 * 5s = 15s at most. If the shuffle file is not pulled within 15s. The shuffle file not found will be reported.

In view of this situation, we can adjust the parameters in preparation. Increase the value of the above two parameters to reach a larger value, and try to ensure that the task of the second stage must be able to pull the output file of the previous stage. Avoid applying for shuffle file not found. Stage and task may then be resubmitted for execution. On the contrary, that is not good for performance.

Spark.shuffle.io.maxRetries 60spark.shuffle.io.retryWait 60s

Can bear not pulling shuffle file for up to 1 hour. Just to set the maximum possible value. There's no way that full gc won't be over for an hour (low probability, no absolute). In this way, we can try our best to avoid the problem that shuffle file not found can not be pulled because of gc.

At this point, I believe that everyone on the "JVM GC caused by shuffle file pull failure how to do" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.