What are the frequently asked questions about HBase 07/12 Update SLTechnology News&Howtos

What are the frequently asked questions about HBase

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the common questions about HBase". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what are the common problems in HBase.

The main means of problem analysis

1. Monitoring system: first of all, it is used to judge whether the indicators of the system are normal or not, and to clarify the current situation of the system.

2. Server log: check, for example, the moving track of region, what actions have taken place, and which client requests have been processed by the server.

3. Gc log: whether the gc condition is normal

4. Operating system logs and commands: operating system level, whether the hardware is malfunctioning, and what is the current status

5. Btrace: real-time tracking of current server requests and processing

6. Operation and maintenance tools: check the real-time processing status of the server through the functions built into the system

In fact, most of the above methods are available in most systems, but each has its own usage. Below, I will sort out these six means through common questions.

FAQ 1: why are individual requests slow?

Slow individual requests are the most common problems encountered by users. First of all, it is necessary to identify the reasons of the client or the server, and then analyze the situation of the server and capture these requests to clearly locate.

1. Analyze the rules of slow requests through the client log, and try to determine the rowkey and operation type of the request on the client.

2. Determine whether slow requests are concentrated within a period of time, and if so, you can refer to FAQ 2 to solve them.

3. Check the server monitoring to see whether the response time is stable and whether there is a peak in maxResponseTime. If so, it can be initially determined that it is a server-side problem.

4. The client analysis is invalid. The rowkey and operation type of the slow request can be captured on the server through the operation and maintenance tool.

5. Determine the region corresponding to rowkey, and check whether there are any problems such as unreasonable configuration of data table parameters (for example, too many version settings, incorrect blockcache and bloomfilter types), too many storefile, too low hit rate, and so on.

6. Try to retry these requests or analyze the hfile directly to see whether the returned result is too large and whether the request consumes too many resources.

7. Check the hdfs monitoring and log on the server, as well as the datanode log to analyze whether there are hdfs block read slowness or disk failure.

FAQ 2: why are there a large number of errors in client read and write requests?

There are two main types of errors in read and write requests: 1, a large number of server exception 2, a large number of timeouts. Among them, the first one has abnormal information to better judge the problem.

1. A large number of server exception is generally caused by the fact that region is not online. It may be that region takes longer than expected in split, or meta data error causes the client to obtain region location error. The above phenomena can be located through the log.

2. If you encounter a large number of timeouts, you should first rule out whether the server has experienced fullgc or ygc for too long. The former may be caused by memory fragmentation and late cms gc speed, while the latter is generally due to the use of swap memory by the system.

3. Check the system commands and logs to see if there are any machines with too high load, too much disk pressure, and disk failures.

4. Check to see if there is a callqueue backlog in the monitoring. The request cannot be processed in a timely manner. You can further view the call and process stack information being processed through the call viewer or jstack.

5. Use the datanode log and the time when hbase accesses dfs to determine whether the problem is in the HDFS layer.

6. Check the monitoring to determine whether the blocking update,memstore is close to the upper limit set by the system.

FAQ 3: why is the system getting slower and slower?

The system used to be quite fast, why is it getting slower and slower? Most of them are caused by unreasonable server configuration, which can be analyzed through the following aspects.

1. Whether the disk read and write and the system load are higher than before, and preliminarily determine the reason why the system slows down.

2. If disk read and write intensifies, focus on checking whether flush is too small and compact is too frequent, especially whether major compact is necessary. From the test results, the disk io generated by compact has a great impact on system performance.

3. Whether the number of storefile in a single region has increased exponentially.

4. Whether there is a downward trend in the hit rate

5. Whether there is a read-write set caused by uneven distribution of region in regionserver, or whether there is a competition between reading and writing handler.

6. Whether the localization rate of datablock has decreased.

7. If there is any abnormal operation of datanode, you can check whether the block reading time of individual machines is obviously too high.

FAQ 4: why is the data gone, obviously written in?

Data loss is also a common bug in HBase, which is divided into two categories: temporary and permanent. The temporary loss is often due to the instantaneous data reading error caused by the correctness of the hbase itself. Permanent loss is usually the secondary allocation of log recovery bug or region.

1. First of all, you can check whether the secondary distribution has occurred in the region where the lost data is located through hbck or master logs.

2. Whether abort has ever appeared in the regionserver in the cluster, and whether the log has been restored correctly.

3. Scan storefile to determine the current data situation.

4. Scan the files in logs or oldlogs to determine whether and when the data was written, and cooperate with the log of rs to determine the behavior of server at that time

5. According to the time when the data is written, determine whether regionserver completes the flush correctly and writes the data to disk.

FAQ 5: why did a server process fail?

There are many scenarios where abort occurs in regionserver. In addition to those caused by system bug, the most common ones encountered online are zk node timeouts and file system exceptions caused by fullgc.

1. Check the regionserver log and query the FATAL exception to determine the type of exception.

2. Check the gc log to determine whether fullgc or ygc takes too long.

3. If the log is suddenly interrupted without warning, you need to first consider whether OOM has occurred (version 0.94 will directly kill-9).

4. Whether it is occupied or not can be judged by the monitoring of system memory.

5. Check whether there is an exception log in datanode. Regionserver may cause abort due to file system exceptions in roll log or flush.

6. Rule out artificial calls to stop

HBase physical examination

Whether a cluster seems to be healthy can be judged from the following aspects

1. Whether the number of storefile in a single region is reasonable

2. Whether memstore is used reasonably or not is related to the quantity and size of hlog.

3. Whether the traffic ratio of compact to flush is reasonable, it would be an obvious waste if only one gigabyte of flush is needed every day and dozens of gigabytes of compact is needed.

4. Split does not seem to be over-frequent. Can pre-sharding be adopted to pre-allocate region?

5. Whether there are too many region in the cluster, zk cannot support more than 12w region by default, and too much region will also affect the time of regionserver failover.

6. Whether the corresponding time for reading and writing is reasonable, and whether the read delay of datablock meets expectations.

7. Whether the flush queue, callqueue length and compact queue meet expectations. The backlog of the first two will cause system instability.

8. FailedRequest and maxResponseTime

9. Gc condition, excessive ygc and excessive cms all need to be vigilant.

Operation and maintenance tools

The operation and maintainability of the official version of HBase is indeed very poor. Ali has done a lot of constructive work in order to maximize the security of the online system and quickly locate the cause of the fault.

1. A complete monitoring system has been established, and many monitoring points have been added according to daily testing and online operation experience.

2. The granularity of monitoring reaches region level.

3. Call dump and online slow request tracking function

4. Btrace script system, if there is a problem, run directly to view the internal information of the program.

5. Log collection and alarm

6. Online table maintenance tools and storefile, logs analysis tools

At this point, I believe you have a deeper understanding of "what are the common problems with HBase?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.