How to realize troubleshooting and repair in HBase 07/15 Update SLTechnology News&Howtos

How to realize troubleshooting and repair in HBase

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces HBase how to achieve troubleshooting and repair, the article is very detailed, has a certain reference value, interested friends must read it!

General criteria

Always start with the log of the primary server. Usually, he repeats the information one by one. If not, if there is a problem, you can Google or use search-hadoop.com to search for exceptions encountered.

Errors rarely occur alone in the HBase, usually something goes wrong somewhere, causing a large number of exceptions and call stack trace information everywhere. The best way to encounter such an error is to look up the log and find the initial exception. For example, the zone server will print some metric information when exiting. The Grep dump should find the initial exception information.

Suicide on regional servers is "normal". When something goes wrong, they kill themselves. If ulimit and xcievers are not modified, HDFS will not work properly, and in HBase's view, HDFS is dead. Imagine what your MySQL would do if he suddenly lost access to its file system. The same thing can happen to HBase and HDFS. Another common reason for regional servers to commit suicide by laparotomy is that they perform a long GC operation, which exceeds the duration of the ZooKeeper session.

Logs

The location of the important log (the user who started the service, the name of the machine)

NameNode: $HADOOP_HOME/logs/hadoop--namenode-.log

DataNode: $HADOOP_HOME/logs/hadoop--datanode-.log

JobTracker: $HADOOP_HOME/logs/hadoop--jobtracker-.log

TaskTracker: $HADOOP_HOME/logs/hadoop--jobtracker-.log

HMaster: $HBASE_HOME/logs/hbase--master-.log

RegionServer: $HBASE_HOME/logs/hbase--regionserver-.log

ZooKeeper: TODO

Log level

Enable RPC level logging

Enabling the RPC-level logging on a RegionServer can often given insight on timings at the server. Once enabled, the amount of log spewed is voluminous. It is not recommended that you leave this logging on for more than short bursts of time. To enable RPC-level logging, browse to the RegionServer UI and click on Log Level. Set the log level to DEBUG for the package org.apache.hadoop.ipc (Thats right, for hadoop.ipc, NOT, hbase.ipc). Then tail the RegionServers log. Analyze.

To disable, set the logging level back to INFO level.

JVM garbage Collection Log

HBase is memory intensive, and using the default GC you can see long pauses in all threads including the Juliet Pause aka "GC of Death". To help debug this or confirm this is happening GC logging can be turned on in the Java virtual machine.

To enable, in hbase-env.sh add:

Export HBASE_OPTS= "- XX:+UseConcMarkSweepGC-verbose:gc-XX:+PrintGCDetails-XX:+PrintGCTimeStamps-Xloggc:/home/hadoop/hbase/logs/gc-hbase.log"

Adjust the log directory to wherever you log. Note: The GC log does NOT roll automatically, so you'll have to keep an eye on it so it doesn't fill up the disk.

At this point you should see logs like so:

64898.952: [GC [1 CMS-initial-mark: 2811538K (3055704K)] 2812179K (3061272K), 0.0007360 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 64898.953: [CMS-concurrent-mark-start] 64898.971: [GC 64898.971: [ParNew: 5567K-> 576K (5568K), 0.0101110 secs] 2817105K-> 2812715K (3061272K), 0.0102200 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]

In this section, the first line indicates a 0.0007360 second pause for the CMS to initially mark. This pauses the entire VM, all threads for that period of time.

The third line indicates a "minor GC", which pauses the VM for 0.0101110 seconds-aka 10 milliseconds. It has reduced the "ParNew" from about 5.5m to 576k. Later on in this cycle we see:

64901.445: [CMS-concurrent-mark: 1.542 secs 2.492 secs] [Times: user=10.49 sys=0.33, real=2.49 secs] 64901.445: [CMS-concurrent-preclean-start] 64901.453: [GC 64901.453: [ParNew: 5505K-> 573K (5568K), 0.0062440 secs] 2868746K-> 2864292K (3061272K), 0.0063360 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.476: [GC 64901.476: [ParNew: 5563K-> 575K (5568K) 0.0072510 secs] 2869283K-> 2864837K (3061272K), 0.0073320 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] 64901.500: [GC 64901.500: [ParNew: 5517K-> 573K (5568K), 0.0120390 secs] 2869780K-> 2865267K (3061272K), 0.0121150 secs] [Times: user=0.09 sys=0.00, real=0.01 secs] 64901.529: [GC 64901.529: [ParNew: 5507K-> 569K (5568K), 0.0086240 secs] 2870200K-> 2865742K (306127K) 0.0087180 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.554: [GC 64901.555: [ParNew: 5516K-> 575K (5568K), 0.0107130 secs] 2870689K-> 2866291K (3061272K), 0.0107820 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 64901.578: [CMS-concurrent-preclean: 0.070 secs 0.133 secs] [Times: user=0.48 sys=0.01] Real=0.14 secs] 64901.578: [CMS-concurrent-abortable-preclean-start] 64901.584: [GC 64901.584: [ParNew: 5504K-> 571K (5568K), 0.0087270 secs] 2871220K-> 2861230K (3061272K), 0.0088220 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.609: [GC 64901.609: [ParNew: 5512K-> 569K (5568K), 0.0063370 secs] 2871771K-> 2867322K (3061272K), 0.0064230 secs] [Times: user=0.06 sys=0.00] Real=0.01 secs] 64901.615: [CMS-concurrent-abortable-preclean: 0.007ax 0.037 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] 64901.616: [GC [YG occupancy: 645K (5568 K)] 64901.616: [Rescan (parallel), 0.0020210 secs] 64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K (3055704K)] 2867399K (3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01] Real=0.01 secs] 64901.621: [CMS-concurrent-sweep-start]

The first line indicates that the CMS concurrent mark (finding garbage) has taken 2.4 seconds. But this is a concurrent 2.4 seconds, Java has not been paused at any point in time.

There are a few more minor GCs, then there is a pause at the 2nd last line:

64901.616: [GC [YG occupancy: 645K (5568 K)] 64901.616: [Rescan (parallel), 0.0020210 secs] 64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K (3055704K)] 2867399K (3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]

The pause here is 0.0049380 seconds (aka 4.9milliseconds) to 'remark' the heap.

At this point the sweep starts, and you can watch the heap size go down:

64901.637: [GC 64901.637: [ParNew: 5501K-> 569K (5568K), 0.0097350 secs] 2871958K-> 2867441K (3061272K), 0.0098370 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]. Lines removed... 64904.936: [GC 64904.936: [ParNew: 5532K-> 568K (5568K), 0.0070720 secs] 1365024K-> 1360689K (3061272K), 0.0071930 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64904.953: [CMS-concurrent-sweep: 2.030 secs 3.332 secs] [Times: user=9.57 sys=0.26, real=3.33 secs]

At this point, the CMS sweep took 3.332 seconds, and heap went from about ~ 2.8GB to 1.3GB (approximate).

The key points here is to keep all these pauses low. CMS pauses are always low, but if your ParNew starts growing, you can see minor GC pauses approach 100ms, exceed 100ms and hit as high at 400ms.

This can be due to the size of the ParNew, which should be relatively small. If your ParNew is very large after running HBase for a while, in one example a ParNew was about 150MB, then you might have to constrain the size of ParNew (The larger it is, the longer the collections take but if its too small, objects are promoted to old gen too quickly). In the below we constrain new gen size to 64m.

Add this to HBASE_OPTS:

Export HBASE_OPTS= "- XX:NewSize=64m-XX:MaxNewSize=64m" jstack

Jstack is one of the most important java tools (besides looking at Log), and you can see what specific Java processes are doing. You can use Jps to see the Id of the process, and then you can use jstack. It displays a list of threads in the order in which they were created, and what the thread is doing.

ScannerTimeoutException or UnknownScannerException

When the RPC request from the client to the RegionServer timed out. For example, if the value of Scan.setCacheing is set to 500, the RPC request will fetch 500 rows of data every 500 times. next () operation. Because the data is sent to the client in large chunks, it may cause a timeout. Reducing the value of this serCacheing is a solution, but setting this value too low will affect performance.

Secure client cannot connect

([caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] There can be several causes that produce this symptom.

First, check that you have a valid Kerberos ticket. One is required in order to set up communication with a secure Apache HBase cluster. Examine the ticket currently in the credential cache, if any, by running the klist command line utility. If no ticket is listed, you must obtain a ticket by running the kinit command with either a keytab specified, or by interactively entering a password for the desired principal.

Then, consult the Java Security Guide troubleshooting section. The most common problem addressed there is resolved by setting javax.security.auth.useSubjectCredsOnly system property value to false.

Because of a change in the format in which MIT Kerberos writes its credentials cache, there is a bug in the Oracle JDK 6 Update 26 and earlier that causes Java to be unable to read the Kerberos credentials cache created by versions of MIT Kerberos 1.8.1 or higher. If you have this problematic combination of components in your environment, to work around this problem, first log in with kinit and then immediately refresh the credential cache with kinit-R. The refresh will rewrite the credential cache without the problematic formatting.

Finally, depending on your Kerberos configuration, you may need to install the Java Cryptography Extension, or JCE. Insure the JCE jars are on the classpath on both server and client systems.

You may also need to download the unlimited strength JCE policy files. Uncompress and extract the downloaded file, and install the policy jars into / lib/security.

HBase hbck

1. Repair the hbase meta table (generate the meta table based on the regioninfo file on hdfs)

Hbase hbck-fixMeta

two。 Redistribute the hbase meta table to the regionserver (assign the region on the meta table to the regionservere according to the meta table)

Hbase hbck-fixAssignments

When there is a loophole

Hbase hbck-fixHdfsHoles (create a new region folder)

Hbase hbck-fixMeta (generate meta table based on regioninfo)

Hbase hbck-fixAssignments (assign region to regionserver)

Region repetition problem

View region in meta

Scan 'hbase:meta', {LIMIT= > 10pm filer = > "PrefixFilter (' INDEX_11')"}

Encountered two duplicate region during data migration

B0c8f08ffd7a96219f748ef14d7ad4f8,73ab00eaa7bab7bc83f440549b9749a3

Delete two duplicate region

Delete 'hbase:meta','INDEX_11,4380_2431,1429757926776.b0c8f08ffd7a96219f748ef14d7ad4f8.','info:regioninfo' delete' hbase:meta','INDEX_11,5479_0041431700000000040100004815E9,1429757926776.73ab00eaa7bab7bc83f440549b9749a3.','info:regioninfo'

Delete two duplicate hdfs

/ hbase/data/default/INDEX_11/b0c8f08ffd7a96219f748ef14d7ad4f8

/ hbase/data/default/INDEX_11/73ab00eaa7bab7bc83f440549b9749a3

Corresponding restart regionserver (just to refresh the status of the RIS reported on the hmaster)

It is certain that the data will be lost, and the data on the duplicate region that is not online will be lost.

New version of hbck

(1) missing hbase.version file

Add option-fixVersionFile solution

(2) if a region is neither in the META table nor above the hdfs, but in the online region collection of the regionserver

Add option-fixAssignments solution

(3) if a region is in the META table and in the online region collection of regionserver, but not on hdfs

Add the option-fixAssignments-fixMeta solution, (- fixAssignments tells regionserver close region), (- fixMeta deletes the record of region in the META table (4) if a region is not recorded in the META table, it is not served by regionserver, but on the hdfs

Add the option-fixMeta-fixAssignments solution, (- fixAssignments for assign region), (- fixMeta for adding region records to the META table)

(5) if a region is not recorded in the META table, it is on hdfs and is serviced by regionserver.

Add the option-fixMeta solution, add the record of the region to the META table, first undeploy region, then assign (6) if a region has a record in the META table but is not on the hdfs and is not served by the regionserver

Add the option-fixMeta solution to delete the records in the META table

(7) if a region is recorded in the META table and also on the hdfs, table is not disabled, but the region is not serviced.

Add the option-fixAssignments solution, assign this region

(8) if a region is recorded in the META table and also on the hdfs, table is disabled, but the region is served by some regionserver

Add the option-fixAssignments solution, undeploy this region

(9) if a region is recorded in the META table and also on the hdfs, table is not disabled, but the region is solved by multiple regionserver services plus the option-fixAssignments, notify all regionserver close region, and then assign region

(10) if a region is in the META table and on the hdfs, it should also be served, but the regionserver recorded in the META table does not match the actual regionserver

Add option-fixAssignments solution

(11) region holes

You need to add-fixHdfsHoles to create a new empty region to fill the hole, but do not assign the region or add information about the region in the META table

(12) region does not have a .regioninfo file on hdfs

-fixHdfsOrphans solution

(13) region overlaps

Need to add-fixHdfsOverlaps

Description:

(1) when repairing region holes, the-fixHdfsHoles option simply creates a new empty region, fills this interval, and needs to add-fixAssignments-fixMeta to solve the problem, (- fixAssignments is used for assign region), (- fixMeta is used to add region records to the META table), so there is a combination of punches-repairHoles to repair region holes, which is equivalent to-fixAssignments-fixMeta-fixHdfsHoles-fixHdfsOrphans.

(2)-fixAssignments, which is used to fix the problem that region has no assign, should not assign and assign for many times.

(3)-fixMeta, if it is not on the hdfs, delete the corresponding record from the META table, and if it is on the hdfs, add the corresponding record information to the META table

(4)-repair opens all repair options, which is equivalent to-fixAssignments-fixMeta-fixHdfsHoles-fixHdfsOrphans-fixHdfsOverlaps-fixVersionFile-sidelineBigOverlaps

The new version of hbck obtains the relevant information of region's Table and Region from (1) hdfs directory (2) META (3) RegionServer, and judges and repair based on this information.

These are all the contents of the article "how to troubleshoot and repair HBase". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.