HBase Full GC downtime 07/16 Update SLTechnology News&Howtos

HBase Full GC downtime

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Recently, we are expanding the capacity of the HBase cluster, but everything is not going well:

1. One of the newly added machines actually restarted for no reason, which was directly referred to the system department.

two。 After the deployment of HDFS and HBase, there was no problem with startup, but after one night, all the HBase nodes were down.

What's even weirder is that the nodes in the old cluster worked fine, only a few newly added nodes crashed, and the HDFS worked fine (except for the node where the machine was rebooted).

So check all kinds of logs.

The HBase log shows as follows: JVM paused for too long, resulting in unable to communicate with zookeeper, and zookeeper thought that the node had been down, so the node was closed.

Did it really happen, Full GC? Why does GC occur and pause the application? Why is there nothing wrong with the machines in the old cluster? Because the understanding of GC is too shallow, there are all kinds of problems, and there are no specific answers online, so we can only inquire, understand and sort out little by little.

2015-10-13 23 util.JvmPauseMonitor 47 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 83095msGC pool 'ParNew' had collection (s): count=2 time=216msGC pool' ConcurrentMarkSweep' had collection (s): count=2 time=330ms2015-10-13 23 util.JvmPauseMonitor 47 WARN [regionserver60020] util.Sleeper: We slept 85995ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad See http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired2015-10-13 23 regionserver60020.compactionChecker 47 regionserver60020.compactionChecker 12295 INFO [regionserver60020-SendThread (zookeeper2:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 95659ms for sessionid 0x25053f6801406ac, closing socket connection and attempting reconnect2015-10-13 23 24 Swiss 47 util.Sleeper [regionserver60020.compactionChecker] util.Sleeper: We slept 89894ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad See http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired2015-10-13 23 regionserver60020-SendThread 47 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 89894ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired2015-10-13 23 15 47 INFO [regionserver60020-SendThread (zookeeper3:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 89644ms for sessionid 0x1505ebc2da3010f Closing socket connection and attempting reconnect2015-10-13 23 regionserver60020 4715 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING regionserver hregion151,60020,1444732375821: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected Currently processing hregion151,60020,1444732375821 as dead server

There is a gclog.0 file under the installation directory of HBase, which records the garbage collection information that occurs when HBase is running.

But after all kinds of inspection, no problems were found. Maybe it's because you don't understand GC. Check and learn if you don't understand.

2015-10-13T18:35:47.314+0800: 165.893: [GC [1 CMS-initial-mark: 32523K (63872K)] 35370K (83008K), 0.0185230 secs] [Times: user=0.01 sys=0.02, real=0.01 secs] 2015-10-13T18:35:47.333+0800: 165.912: [CMS-concurrent-mark-start] 2015-10-13T18:35:47.402+0800: 165.981: [CMS-concurrent-mark: 0.046 13T18:35:47.402+0800 0.069 secs] [Times: user=0.32 sys=0.03] Real=0.07 secs] 2015-10-13T18:35:47.402+0800: 165.982: [CMS-concurrent-preclean-start] 2015-10-13T18:35:47.411+0800: 165.990: [CMS-concurrent-preclean: 0.008 user=0.02 sys=0.00 0.009 secs] [Times: user=0.02 sys=0.00 Real=0.01 secs] 2015-10-13T18:35:47.411+0800: 165.990: [CMS-concurrent-abortable-preclean-start] 2015-10-13T18:35:47.414+0800: 165.993: [GC 165.993: [ParNew: 18503K-> 2112K (19136K), 0.0681050 secs] 51027K-> 37708K (83008K), 0.0682600 secs] [Times: user=0.03 sys=0.07] Real=0.06 secs] 2015-10-13T18:35:47.535+0800: 166.115: [CMS-concurrent-abortable-preclean: 0.028 parallel 0.124 secs] [Times: user=0.15 sys=0.09, real=0.13 secs] 2015-10-13T18:35:47.536+0800: 166.115: [GC [YG occupancy: 14168 K (19136 K)] 166.115: [Rescan (parallel), 0.0024340 secs] 166.117: [weak refs processing 0.0001320 secs] [1 CMS-remark: 35596K (63872K)] 49765K (83008K), 0.0026970 secs] [Times: user=0.03 sys=0.00, real=0.00 secs] 2015-10-13T18:35:47.539+0800: 166.118: [CMS-concurrent-sweep-start] 2015-10-13T18:35:47.554+0800: 166.133: [CMS-concurrent-sweep: 0.014 secs 0.015 secs] [Times: user=0.05 sys=0.01 Real=0.02 secs] 2015-10-13T18:35:47.554+0800: 166.133: [CMS-concurrent-reset-start] 2015-10-13T18:35:47.571+0800: 166.151: [GC 166.151: [ParNew: 19077K-> 2112K (19136K), 0.0028640 secs] 39044K-> 25755K (83008K), 0.0029990 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]

When querying and learning about JVM's garbage collection, I saw the following sentence:

For programs that use CMS for old GC, pay special attention to whether there are promotion failed and concurrent mode failure conditions in the GC log, which may trigger Full GC. Promotion failed is caused by the fact that the survivor space can not be put down and the object can only be put into the old generation when the Minor GC is carried out.

However, promotion failed is found at the end of the log file of HBase's gclog.0, as shown below.

2015-10-14T00:55:40.417+0800: 22958.996: [GC 22958.996: [ParNew (promotion failed): 19133K-> 19116K (19136K), 0.0752040 secs] 22959.071: [CMS: 24832559K-> 11737235K (28395128K), 54.6409350 secs] 24849259K-> 11737235K (28414264K), [CMS Perm: 48374K-> 48253K (80800K)], 54.7223900 secs] [Times: user=7.80 sys=1.13 Real=54.72 secs] 2015-10-14T00:56:41.108+0800: 23019.687: [GC 23019.687: [ParNew: 221568K-> 984K (249216K), 0.0259720 secs] 11958803K-> 11746920K (28644352K), 0.0261620 secs] [Times: user=0.29 sys=0.02, real=0.02 secs] Heap par new generation total 249216K, used 128841K [0x0000000124e00000, 0x0000000135c60000, 0x0000000135c60000) eden space 221568K, 53% used [0x0000000124e00000, 0x000000012c25d198, 0x0000000132660000) from space 27648K, 35% used [0x0000000134160000, 0x0000000134ad53b0, 0x0000000135c60000) to space 27648K 0% used [0x0000000132660000, 0x0000000132660000, 0x0000000134160000) concurrent mark-sweep generation total 28395136K, used 11737235K [0x0000000135c60000, 0x00000007fae00000, 0x00000007fae00000) concurrent-mark-sweep perm gen total 80800K, used 48566K [0x00000007fae00000, 0x00000007ffce8000, 0x0000000800000000)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.