Analysis of overload on Hadoop nodes 04/22 Update SLTechnology News&Howtos

Analysis of overload on Hadoop nodes

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Recently, it has been found that the client machine load of our hadoop cluster often soars to hundreds, resulting in slow machine response, customer response unable to submit job, or job running very slowly.

There are usually several solutions to this situation. One is to increase the number of client machines, put them into a pool, and automatically switch between different client machines according to the system load, which is also called load balancer, which we have done; one is to find out the root cause of the high load, because such a high load is very unusual, usually because the system parameters are incorrect or the application has bug.

Phenomenon

Using perf top to observe the programs that take up the most cpu time, it is found that most of them are caused by the program compaction.c.

You can take a look at the one-minute record with the following command:

$sudo perf record-a-g-F 1000 sleep 60

Here we use Brendan Gregg's 's tool flame graph to analyze the crawled data.

After viewing google, you can see that compaction.c is a program related to Transparent Huge Pages, and THP,THP for short is a function that will appear later in Redhat6. There are two purposes: one is to defragment the physical memory, and the application can be allocated the memory of 2MB when requesting memory (normally 4KB); the other is that the memory allocated by the application cannot be swapped to swap.

This feature certainly uses its application scenario, but it is not good in all cases, so it is up to you to decide whether or not to turn it on.

It is clear that with such a high system load, most cpu time is defragmenting memory, so decisively cancel this feature.

Echo never > / sys/kernel/mm/redhat_transparent_hugepage/enabledecho never > / sys/kernel/mm/redhat_transparent_hugepage/defrag

After a while after the cancellation, we saw the effect, the load came down, and the load went up again after turning on this feature, so the problem was completely solved.

Attached

Here is another scenario where you need to turn on the THP function.

One day it was found that the memory of the oracle machine was almost used up, but under normal circumstances, it was not the case. It was outrageous to find that Pagetables had 5GB through cat / proc/meminfo. Pagetables is a tables that maps the relationship between virtual memory and physical memory address. There are too many tables. By turning on THP, the pagetables has dropped to more than 100 MB.

In the actual scene, it depends on the situation.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.