How to solve the OOM problem of JobTracker Heap 07/12 Update SLTechnology News&Howtos

How to solve the OOM problem of JobTracker Heap

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, Xiaobian will bring you about how to solve the OOM problem of JobTracker Heap. The article is rich in content and analyzed and described from a professional perspective. After reading this article, I hope you can gain something.

introduction

Recently, a new channel 9107 was tested.

In terms of experimental method, compared to the previous 9105 channel, which simply doubled the positive example data in the past 6 hours, the 9107 uses an exponential function to gain the positive examples in the combined training data based on the current time.

In terms of specific implementation, for simplicity of implementation, the ETL part of the experiment still uses Hadoop for ETL merge processing, but the subsequent Sampling and Filtering are implemented in python stand-alone. After the process and data test, it is online, but the result is that the old cluster JobTracher always reports Heap OOM exception (java.lang.OutOfMemoryError: Java heap space) after running more than 70 jobs.

resolution process

Google directly.

Someone in statckoverflow suggested adjusting HADOOP_CLIENT_OPT to solve the problem by increasing its Xmx, but after trying it did not work.

Suddenly found java_pidxxxxx.hprof file, search for relevant information, found the MAT tool in Eclipse, analysis may be JT problem

hprof is a binary file that automatically dumps Heap when an OOM error occurs in the virtual machine. This feature can be turned on by adding the parameter-XX:+HeapDumpOnOutOfMemoryError when starting the process.

MAT(Memory Analyzer) is a tool for analyzing hprof files in Eclipse, which can be integrated through 'Install New Software' in Eclipse. One thing to note is that when the generated hprof file is too large, you need to appropriately increase the Eclipse startup Xmx, and its configuration file is eclipse.ini under the installation directory.

Open the generated hproblem file through MAT, as shown in the figure below, will give several Problem Suspects, in this article, the explanation is due to JobTracker's excessive memory occupation caused by OOM.

However, JobTracker has been running steadily for a long time, and this phenomenon rarely occurs, so we continue to try to use MAT for further analysis. By analyzing the item 'dominator tree', it is found that most of the memory in JobTracker is occupied by JobInProgress, and the result is shown in the following figure.

At this point, the problem has been identified, but the previous JobTracker ran thousands of jobs and never had this kind of problem. Why did this happen after only 70+ jobs?

Read Inside Hadoop Technology to learn the main role of JobTracker

Google search didn't answer this well, so I looked up the chapter on JobTracker in the book "Inside Hadoop Technology" to learn. One thing that caught my attention in particular was that JobTracker starts some important threads and services when it starts, including a retireJobsThread thread.

For the retireJobsThread thread, it can clean up the Job information that has been running for a long time (that is, the information of the JobInProgress object). The JobInProgress object is stored in memory to facilitate external queries about historical Job information. However, because JobInProgress objects consume too much memory, jobs that satisfy both conditions a,b or a,c are marked as expired.

Job completed, i.e. status SUCCEEDED, FAILS or KILLED

Job completion time is 24 hours away from the current (can be adjusted through mapred.jobtracker.retirejob.interval)

Job owner has more than 100 completed jobs (adjustable via mapred.jobtracker.completeuserjobs.maximum)

Obviously, it is due to Job's retire mechanism that JobTracker avoids wireless expansion of memory consumption. Although JobTracker solves the infinite expansion of memory, why can JobTracker maintain 100 JobInProgress before, but now it can't? Where are the 9105 and 9107 channels?

All of a sudden, I thought about whether the memory information occupied by ETL Job was too large, resulting in the current 2G HeapSize not being able to fit these 100 JobInProgress messages. Therefore, the Hadoop_HEAPSIZE in hadoop-env.sh was adjusted from 2000 to 4000, and the problem magically did not appear again. So why does ETL Job take up more memory than Sampling Job and Filtering Job? There was a knock at the door.

Actual experimental testing, using jmap to generate hprof files, analysis and expected results

After the cluster runs smoothly for 100+ jobs, dump a JobTracker hprof file using the jmap tool. Again, MAT was used to analyze it, and the results are shown in the following figure:

The following points can be seen from the figure:

JobTracker's memory footprint has remained around 1.7G

The reason why JobInProgress occupies too much memory is entirely due to too many TaskInProgress (usually 2K-3K) to schedule, which consumes more memory than Sampling and Filtering.

So far, the problem has been solved, indeed, because the 9107 channel only retains the ETL Job of the 9105 channel, resulting in multiple ETL JobInPregress cumulatively occupying much more memory than the previous 9105 experiment, thus causing JobTracker to generate OOM errors all the time.

Summary

In fact, the solution of the problem is more or less accidental. In the final analysis, it is still unfamiliar with the underlying implementation of some basic components of the Hadoop platform, which leads to slow problem positioning and many detours.

MAT is indeed a particularly powerful tool for JVM OOM errors, and it is very easy to locate the root of the problem using MAT. The subsequent use of MAT needs to be further strengthened.

For a Java process running normally, you can use jmap's jmap -dump:format=b,file=xx pid command to generate hprof files to analyze the detailed memory usage in the process at a certain time.

The above is how to solve the OOM problem of JobTracker Heap shared by Xiaobian. If there is a similar doubt, please refer to the above analysis for understanding. If you want to know more about it, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.