How to understand HBase1.x read cache BlockCache 04/26 Update SLTechnology News&Howtos

How to understand HBase1.x read cache BlockCache

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to understand HBase1.x read cache BlockCache. I think it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it.

1. Overview

Caching is very important for any database. If conditions permit, we would rather cache all the data in memory, then there would not be any disk IO, but it is almost impossible for big data to cache all the data. According to the 2008 rule, 80% of our business requests are concentrated in 20% of the data, if we cache these 20% of the data in memory. The performance of the database will be greatly improved.

The memory of Regionserver on HBase is divided into two parts: one is used as Memstore, which is mainly used for writing, and the other, as BlockCache, is mainly used for reading.

1) the write request will be written to Memstore,Regionserver first and each HStore will be provided with a Memstore. When the Memstore is full of 128MB, all the MemStore in the current HRegion will be Flush to the HDFS.

When the total size of all the MemStore in a regionserver exceeds the size of the hbase.regionserver.global.memstore.upperLimit, the default memory usage is 40%. At this point, all the MemStore in the HRegion in the current HRegionServer will be Flush to the HDFS, and the Flush order is the reverse order of the MemStore size until the overall MemStore usage is lower than the hbase.regionserver.global.memstore.lowerLimit, and the default memory usage is 38%.

2) the read request will first check the data in Memstore, and if it cannot be found, it will be checked in BlockCache. If it cannot be found, it will be read on disk, and the read result will be put into BlockCache. Because BlockCache uses the LRU strategy, when BlockCache reaches the upper limit heapsize * hfile.block.cache.size * 0.85, the phase-out mechanism will be activated to phase out the oldest batch of data.

There is a BlockCache and N Memstore on a Regionserver, and the sum of their sizes cannot be greater than or equal to heapsize * 0.8, otherwise the HBase will not start normally.

2.BlockCache mechanism

In order to obtain data efficiently, HBase sets up the BlockCache mechanism, and the in-memory cache block,Block is generally divided into two categories, one is JVM heap memory, the other is heap off memory; the first type of cache strategy is called LRUCache, and the second type of Cache strategy is SlabCache and BucketCache.

BlockCache is Region Server level, a Region Server has only one BlockCache, and the initialization of the BlockCache is completed when the Region Server is started. So far, HBase has implemented three BlockCache schemes successively, LRUBlockCache is the initial implementation scheme and the default implementation scheme; HBase version 0.92 implemented the second scheme SlabCache, see HBASE-4027;HBase 0.96 after the official provision of another alternative BucketCache, see HBASE-7404.

The location of BlockCache in HBase is shown in the following figure:

Three strategies 1. LRUBlockCache

Least-Recently-Used,HBase 's default BlockCache implementation, the LRU cache removes the least recently used data and gives it to the latest read data. And often the most frequently read, is also the largest number of reads, so using LRU cache, we can improve the performance of the system.

LRUBlockCache divides the cache into three parts: single-access, mutil-access and in-memory, which account for 25%, 50% and 25% of the total BlockCache size, respectively.

Single-access priority: when a data block is read from HDFS for the first time, it has this priority, and when the cache space needs to be reclaimed (replaced), it is considered first. Its advantage is that the data blocks that are generally read by scanned should be cleared first over those that will be used later.

Mutil-access priority: if a data block belongs to the Single Access priority but is later accessed again, it is upgraded to the Multi Access priority. When the content in the cache needs to be cleared (replaced), this part of the content belongs to the secondary consideration.

In-memory-access priority: indicates that data can be resident in memory and is generally used to store frequently accessed data with a small amount of data, such as metadata. When creating a table, you can also set the column family attribute IN-MEMORY= true to put the column family into the in-memory area. The two specific implementation methods are as follows:

a. You can call HColumnDescriptor.setInMemory (true) in Java.

b. When you create or modify a table in hbase shell, you can use IN_MEMORY = > true, for example: create't tables, {NANME = > 'fags, IN_MEMORY = >' true'}

The downside: using the LRUBlockCache caching mechanism can lead to too much memory fragmentation due to the CMS GC policy, which may trigger the infamous Full GC and trigger a terrible 'stop-the-world' pause; especially under large memory conditions, a Full GC is likely to last a long time or even reach the level of minutes. We all know that Full GC will pause the whole process (called stop-the-wold pause), so Full GC for a long time will greatly affect the normal read and write requests of the business.

2. SlabCache

It is discarded (HBASE-11307) after version 1.0; the internal structure is divided into two parts, 80% and 20%; if the cached data is less than or equal to blocksize, it will be placed in the front area (80% area); if the block is greater than 1x but less than 2x, it will be placed in the back area (20% area); if it is greater than 2x, it will not be cached.

Like LRUBlockCache, SlabCache also uses the Least-Recently-Used algorithm to phase out expired Block.

Unlike LRUBlockCache, when SlabCache eliminates Block, you only need to mark the corresponding bufferbyte as idle, and the subsequent cache can directly overwrite the memory on it.

In the online cluster environment, the BlockSize settings of different tables and different column families may be different. Obviously, the SlabCache solution, which can only store two fixed-size Block by default, cannot meet some user scenarios, such as setting BlockSize = 256k. Simply using SlabCache scheme cannot achieve the purpose of Block cache. Therefore, SlabCache and LRUBlockCache are used together in the actual implementation of HBase, which is called DoubleBlockCache. In a random read, a Block block is loaded from the HDFS and one is stored in the two Cache. When caching the read, look for it in the LRUBlockCache first. If the Cache Miss looks in the SlabCache again, the Block will be put into the LRUBlockCache again if it is hit.

Disadvantages: after actual testing, the DoubleBlockCache scheme has many drawbacks. For example, fixed memory settings in the SlabCache design will result in low actual memory usage, and using LRUBlockCache cache Block will still result in a large amount of memory fragmentation due to JVM GC. Therefore, after version 0.98 of HBase, this scheme has been deprecated.

3. BucketCache

This strategy is used by the CDH cluster designed by Ali. BucketCache can be configured to work in three modes: heap,offheap and file. No matter which mode you work in, BucketCache will apply for many Bucket with fixed size labels. Like SlabCache, a Bucket stores a data block of a specified BlockSize, but unlike SlabCache, BucketCache will apply for 14 different sizes of Bucket during initialization, and even if a certain kind of Bucket space is insufficient, the system will borrow memory from other Bucket space without low memory usage. Next, let's take a look at the different working modes. Heap mode indicates that these Bucket are applied from JVM Heap, offheap mode uses DirectByteBuffer technology to manage out-of-heap memory storage, and file mode uses cache files like SSD to store data blocks.

The downside: in the actual implementation, HBase uses BucketCache with LRUBlockCache, which is called CombinedBlockCache. Unlike DoubleBlockCache, the system mainly stores Index Block and Bloom Block in LRUBlockCache and Data Block in BucketCache. Therefore, a random read needs to first look up the corresponding Index Block in LRUBlockCache, and then look up the corresponding data block in BucketCache. BucketCache corrects the disadvantages of SlabCache through more reasonable design, which greatly reduces the actual impact of JVM GC on business requests, but there are also some problems, such as the problem of copying memory when using out-of-heap memory, which will affect read and write performance to a certain extent.

The above is how to understand HBase1.x read cache BlockCache. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.