Practical Information: HBase practice Reading performance Optimization Strategy 04/17 Update SLTechnology News&Howtos

Practical Information: HBase practice Reading performance Optimization Strategy

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Any system will have all kinds of problems, some are the design problems of the system itself, some are the use of posture problems. HBase is the same, in the real production line, we will more or less encounter a lot of problems, some HBase still needs to be improved, some we really know too little about it. To sum up, the main problems encountered are downtime caused by Full GC exceptions, RIT problems, low write throughput and high read latency.

The Full GC problem has been discussed in some articles before. At present, there are two main solutions that need to be paid attention to. On the one hand, you need to check the GC log to confirm which Full GC it is, and tune the JVM parameters according to the Full GC type. On the other hand, you need to confirm whether the offheap mode of BucketCache is enabled. It is recommended to use LRUBlockCache children's shoes to transfer to BucketCache as soon as possible. Of course, we are still looking forward to the official 2.0.0 release of more offheap modules.

The RIT problem, I believe, is more because we don't know about it. The specific principle can be found here. At present, there are two solutions. The priority is to use the official HBCK to fix the problem (HBCK himself has always wanted to share it, but there are not many cases. If there are more cases later, you need to repair the file or metadata table manually. As for the optimization problem that the write throughput is too low and the read delay is too large, the author has also discussed with many friends. This article focuses on read delay optimization, and specifically analyzes the routines of HBase read delay optimization and the specific principles after these routines. Hope that after reading it, you can combine these routines to analyze your own system.

In general, there are three scenarios with large read request latency, which are:

1. Only one service has a large delay, and all other services in the cluster are normal.

two。 All businesses in the whole cluster reflect a large delay.

3. After a certain business starts, there is a large delay in other parts of the cluster business.

These three scenarios are appearances. Usually, the delay of a business response is abnormal. First, you need to specify which scenario it is, and then solve the problem specifically. The following figure is a summary of the idea of read optimization, which is mainly divided into four aspects: client optimization, server optimization, column family design optimization and HDFS related optimization. Each of the following dots will be classified according to the scene, which is summarized at the end of the article. The following will be explained in detail:

HBase read optimization

HBase client optimization

Like most systems, the incorrect use of postures on the client as the entry for business reading and writing usually leads to high read latency for this business. In fact, there are some recommended uses for using postures. Here are four issues to pay attention to:

1. Is the scan cache set properly?

Optimization principle: before explaining this problem, we first need to explain what is scan caching. Generally speaking, a large amount of data will be returned by scan at one time, so when the client initiates a scan request, it will not load all the data locally at once, but will load it into multiple RPC requests. On the one hand, this design is because a large number of data requests may lead to serious consumption of network bandwidth and affect other businesses. On the other hand, OOM may occur on the local client because of the large amount of data. In such a design system, users will first load part of the data to the local, then traverse the processing, and then load the next part of the data to the local processing, and so on, until all the data has been loaded. The data is loaded locally and stored in the scan cache, with a default size of 100 pieces of data.

In general, the default scan cache settings work properly. But in some large scan (a scan may need to query tens of thousands or even hundreds of thousands of rows of data), 100 pieces of data per request means that a scan requires hundreds or even thousands of RPC requests, this kind of interaction is undoubtedly very expensive. So consider increasing the scan cache setting, such as 500 or 1000, which might be more appropriate. The author has done an experiment before. Under the condition of a scan scan of 10w + pieces of data, increasing the scan cache from 100 to 1000 can effectively reduce the overall delay of scan requests, and the delay is basically reduced by about 25%. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

Optimization suggestion: increase the scan cache from 100,500 or 1000 in large scan scenarios to reduce the number of RPC.

2. Can batch requests be used for get requests?

Optimization principle: HBase provides API interfaces for single get and batch get respectively. Using batch get interface can reduce the number of RPC connections between client and RegionServer and improve read performance. It is also important to note that batch get requests either successfully return all request data or throw an exception.

Optimization recommendation: use batch get for read requests.

3. Can the request display the specified column family or column?

Optimization principle: HBase is a typical column family database, which means that the data of the same column family is stored together, and the data of different column families is stored separately in different directories. If a table has multiple column families and only retrieves according to Rowkey without specifying column families, then the data of different column families need to be retrieved independently, and the performance will inevitably be much worse than that of the specified column families, and in many cases there will be a performance loss of 2 ~ 3 times. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

Optimization suggestion: you can specify column families or columns for precise lookups as far as possible.

4. Is the offline bulk read request set to disable caching?

Optimization principle: usually offline batch read data will be an one-time full table scan, on the one hand, a large amount of data, on the other hand, the request will only be executed once. If the default setting of scan is used in this scenario, the data will be loaded from the HDFS and put into the cache. It is conceivable that a large amount of data entering the cache will squeeze out the hot data of other real-time services, and other services will have to load from HDFS, which will cause obvious reading delay burrs.

Optimization recommendation: offline batch read request setting disable caching, scan.setBlockCache (false).

HBase server-side optimization

In general, once the server-side problem leads to a large delay in business read requests, it is usually at the cluster level, that is, the business of the whole cluster will reflect a large read delay. We can start from four aspects:

1. Is the read request balanced?

Optimization principle: in extreme cases, if all read requests fall on a certain Region of a RegionServer, on the one hand, it cannot give full play to the concurrent processing capacity of the entire cluster, on the other hand, it is bound to cause serious consumption of RegionServer resources (such as IO exhaustion, handler exhaustion, etc.), and other services falling on the RegionServer will be greatly affected. It can be seen that the imbalance of read requests will not only cause poor business performance, but also seriously affect other businesses. Of course, unbalanced write requests can cause similar problems, so you can see that load imbalance is a big taboo of HBase.

Observe and confirm: observe the QPS curves of all RegionServer read requests to confirm whether there is an imbalance in read requests.

Optimization recommendation: RowKey must be hashed (such as MD5 hash), and tables must be pre-partitioned.

2. Is the BlockCache set up properly?

Optimization principle: as a read cache, BlockCache is very important for read performance. By default, the configuration of BlockCache and Memstore is relatively balanced (40% each), which can be modified according to the cluster business. For example, the proportion of BlockCache can be increased by reading more and writing less. On the other hand, the strategy choice of BlockCache is also very important. Different strategies do not have a great impact on read performance, but have a significant impact on GC, especially in BucketCache's offheap mode. GC performance is very superior. In addition, HBase 2.0's modification of offheap (HBASE-11425) will improve the read performance of HBase by 2 to 4 times, while GC will perform better!

Observe and confirm: observe the cache miss rate of all RegionServer and the first-level GC log of configuration items related to the configuration file to confirm whether BlockCache can be optimized.

Optimization suggestion: JVM memory allocation < 20g HBase BlockCache strategy choose LRUBlockCache; otherwise choose offheap mode of BucketCache policy; look forward to the arrival of BlockCache 2.0!

3. Are there too many HFile files?

Optimization principle: HBase read data is usually first retrieved in Memstore and BlockCache (read recently written data & hot data), and will be retrieved in the file if it cannot be found. The LSM-like structure of HBase causes each store to contain most HFile files, and the more files there are, the more times IO is required to retrieve and the higher the read latency. The number of files usually depends on the implementation strategy of Compaction, and is generally related to two configuration parameters: hbase.hstore.compactionThreshold and hbase.hstore.compaction.max.size. The former indicates that the number of files in a store exceeds how many files should be merged, and the latter indicates the maximum file size of parameter merging. Files larger than this size cannot participate in the merge. These two parameters can not be set too loose (the former can not be set too large, the latter can not be set too small), resulting in the actual effect of Compaction merge files is not obvious, and many files can not be merged. This will result in a large number of HFile files.

Observe and confirm: observe the number of storefile at RegionServer level and Region level to confirm whether there are too many HFile files.

Optimization suggestion: the hbase.hstore.compactionThreshold setting cannot be too large, and the default is 3. The setting needs to be determined according to the Region size, and you can simply assume that hbase.hstore.compaction.max.size = RegionSize / hbase.hstore.compactionThreshold.

4. Does Compaction consume too much system resources?

Optimization principle: Compaction merges small files into large files to improve the random read performance of subsequent services, but it also brings problems of IO magnification and bandwidth consumption (remote data reading and three-copy writing will consume system bandwidth). Under normal configuration, Minor Compaction will not cause great consumption of system resources, unless Minor Compaction is too frequent due to unreasonable configuration, or Major Compaction occurs when the Region setting is too large.

Observe and confirm: observe the usage of system IO resources and bandwidth resources, and then observe the length of Compaction queue to confirm whether excessive consumption of system resources is caused by Compaction.

Optimization recommendations:

(1) Minor Compaction setting: the hbase.hstore.compactionThreshold setting should not be too small or too large, so it is recommended to set it to 5 percent 6 10 hbase.hstore.compaction.max.size = RegionSize / hbase.hstore.compactionThreshold.

(2) Major Compaction setting: automatic Major Compaction is not recommended for large Region read delay sensitive services (above 100G). Manual trough trigger. Major Compaction can be enabled for small Region or delay-insensitive businesses, but it is recommended to limit traffic.

(3) look forward to more excellent Compaction strategies, similar to stripe-compaction to provide stable services as soon as possible.

Design Optimization of HBase column Family

HBase column family design is also critical to read performance, which is characterized by affecting only a single business and not having much impact on the entire cluster. Column family design is mainly checked from two aspects:

1. Is Bloomfilter set? Is the setting reasonable?

Optimization principle: Bloomfilter is mainly used to filter HFile files where there is no RowKey or Row-Col to be retrieved to avoid useless IO operations. It will tell you whether there may be a KV to be retrieved in this HFile file, and if not, you can open the file for seek without consuming IO. Obviously, you can improve the performance of random reads and writes by setting Bloomfilter.

There are two values for Bloomfilter, row and rowcol, which need to be determined according to the business. If most random queries in the business only use row as the query condition, Bloomfilter must be set to row, otherwise if most random queries use row+cf as the query condition, Bloomfilter needs to be set to rowcol. If you are not sure about the business query type, set it to row.

Optimization suggestion: Bloomfilter should be set for any business, which is usually set to row, unless you confirm that the random query type of business is row+cf and can be set to rowcol.

HDFS related optimization

As the final HBase data storage system, HDFS usually uses a three-copy policy to store HBase data files and log files. From the perspective of HDFS, HBase is its client, and HBase reads and writes data through the client that calls it, so the optimization of HDFS will also affect the read and write performance of HBase. Here we mainly focus on the following three aspects:

1. Is Short-Circuit Local Read enabled?

Optimization principle: the current HDFS needs to read data through DataNode, and the client will send a request to DataNode to read the data. After receiving the request, DataNode will read the file from the hard disk, and then send it to the client through TPC. The Short Circuit policy allows clients to read local data directly, bypassing DataNode.

Optimization suggestion: enable the Short Circuit Local Read feature and stamp the specific configuration here.

2. Is Hedged Read enabled?

Optimization principle: three copies of HBase data are generally stored in HDFS, and it is preferred to try to read locally through the Short-Circuit Local Read function. However, in some special cases, there may be short-time local read failures caused by disk problems or network problems. in order to deal with these problems, community developers have proposed a compensation retry mechanism-Hedged Read. The basic principle of this mechanism is that the client initiates a local read, and once it has not returned after a period of time, the client will send a request for the same data to other DataNode. Which request returns first, the other will be discarded. Welcome to join big data Learning Exchange and sharing Group: 658558542 blow water exchange and study together (click on ☛ to join the group chat)

Optimization suggestion: turn on the Hedged Read function.

3. Is the data localization rate too low?

Data locality rate: three copies of HDFS data are usually stored. If the current RegionA is on Node1, three copies of data an are written (Node1,Node2,Node3), three copies of data b are written (Node1,Node4,Node5), and three copies of data c are written (Node1,Node3,Node5). It can be seen that all data written to the local Node1 will be written to one copy, and the data can be read locally, so the data localization rate is 100%. Now suppose that RegionA is migrated to Node2, only data an is on this node, and other data (b and c) can only be read remotely across nodes, with a local rate of 33% (assuming that the data size of aQuery b and c is the same).

Optimization principle: if the data localization rate is too low, it will obviously generate a large number of cross-network IO requests, which will inevitably lead to high read request latency, so improving the data localization rate can effectively optimize random read performance. The reason for the low data localization rate is generally due to Region migration (automatic balance enabled, RegionServer downtime migration, manual migration, etc.). On the one hand, you can maintain the data localization rate by avoiding Region migration for no reason, and on the other hand, if the data localization rate is very low, you can also increase the data localization rate to 100% by executing major_compact.

Optimization suggestions: avoid Region migration for no reason, such as shutting down automatic balance, pulling up RS downtime and moving back to floating Region, etc., and implement major_compact to improve data localization rate during business trough.

Optimization and induction of HBase reading performance

At the beginning of this article, it is mentioned that there are only three common phenomena of large read delay, namely, the slow reading of a single service, the random reading slow of a cluster, and the impact of other services after a random read, resulting in a large random read delay. After understanding some common problems that may lead to large read delays, we classify these problems as follows, and readers can locate them in the corresponding list of questions after seeing the phenomenon:

Summary of HBase read performance Optimization

Performance optimization is a topic that any system will encounter, and each system has its own way of optimization. HBase as a distributed KV database, the optimization point is very different, more integrated with the distributed characteristics and storage system optimization characteristics. This paper summarizes the basic breakthroughs of reading optimization, what is wrong, you can also discuss and communicate with each other.

Conclusion

Thank you for watching. If there are any deficiencies, you are welcome to criticize and correct them.

If you have a partner who is interested in big data or a veteran driver who works in big data, you can join the group:

658558542 (click on ☛ to join the group chat)

It collates a large volume of learning materials, all of which are practical information, including the introduction to big data's technology, high-level analysis language for massive data, distributed storage for massive data storage, and distributed computing for massive data analysis. for every big data partner, this is not only a gathering place for Xiaobai, but also Daniel online solutions! Welcome beginners and advanced partners to join the group to learn and communicate and make progress together!

Finally, I wish all the big data programmers who encounter bottlenecks to break through themselves and wish you all the best in the future work and interview.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.