How much hbase.hregion.max.filesize should be set properly 10/17 Update SLTechnology News&Howtos

How much hbase.hregion.max.filesize should be set properly

2025-10-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains how much hbase.hregion.max.filesize should be set appropriately. Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "hbase.hregion.max.filesize should be set how much is appropriate"!

1 hbase.hregion.max.filesize How much should be set appropriately

Default: 256M

Description: Maximum HStoreFile size. If any one of a column families' HStoreFiles has grown to exceed this value, the hosting HRegion is split in two.

Maximum value of HStoreFile. If any Column Family (or HStore) has HStoreFiles larger than this value, then its HRegion splits into two.

Tuning:

The default maximum hfile size (hbase.hregion.max.filesize) in hbase is 256MB, while the maximum tablet size recommended in Google's bigtable paper is also 100-200MB. What's the secret about this size?

It is well known that the data in hbase will be written to memstore at the beginning, and when memstore reaches 64MB, it will be flushed to disk and become a storefile. When the number of storefiles exceeds 3, the compaction process starts to merge them into one storefile. This process deletes some timestamp data that is out of date, such as update data. When the merged storefile size is larger than the hfile default maximum, the split action is triggered, splitting it into two regions.

lz performs continuous insert stress tests and sets different hbase.hregion.max.filesize. According to the results, the following conclusions are drawn: the smaller the value, the larger the average throughput, but the more unstable the throughput; the larger the value, the smaller the average throughput, and the less the time for throughput instability.

Why is this happening? The corollary is as follows:

a When hbase.hregion.max.filesize is small, the probability of triggering split is greater, while split will cause region offline, so requests to access this region will be blocked before the split ends. The client self-block time defaults to 1s. When a large number of regions split at the same time, the overall access service of the system will be greatly affected. Therefore, throughput and response time instability are prone to occur

b When hbase.hregion.max.filesize is large, the probability of triggering split in a single region is small, and the probability of triggering split in a large number of regions at the same time is also small, so the throughput is more stable than that of a small hfile size. However, due to the lack of split for a long time, the chance of multiple compactions within the same region increases. Compaction works by reading the original data once and rewriting it onto hdfs, then deleting the original data. This behavior will undoubtedly slow down io-bottomed systems, so average throughput will suffer some impact.

Combining the above two cases, hbase.hregion.max.filesize should not be too large or too small, and 256MB may be a better empirical parameter. For off-line applications, 128MB is more appropriate, while online applications should not be lower than 256MB unless the split mechanism is modified.

2 Autoflush=false

Both official and many blogs advocate setting autoflush=false in the app code to speed up hbase writing, and lz thinks this setting should be used carefully in online apps. The reasons are as follows:

The principle behind autoflush=false is that when a client submits a delete or put request, the request is cached on the client until the data exceeds 2M(hbase.client.write.buffer decision) or the user executes hbase.flushcommits() before submitting the request to regionserver. So even if htable.put() returns success, it doesn't mean the request really succeeded. If the cache is not reached and the client crashes, that portion of the data will be lost because it was not sent to regionserver. This is unacceptable for zero-tolerance online services.

b autoflush=true Although it will slow down the write speed by 2-3 times, it must be turned on for many online applications, which is why hbase makes it true by default. When this value is true, every request will be sent to regionserver, and the first thing regionserver does after receiving the request is to write hlog, so the requirements for io are very high. In order to improve the write speed of hbase, io throughput should be increased as much as possible, such as increasing disks, using raid cards, and reducing the number of replication factors.

3. From the perspective of performance, discuss the setting of family and qualifier in table

For a table in a traditional relational database, how should family and qualifier be set from a performance point of view when modeling the business transformation to hbase?

At the extreme, ① each column is set to a family, ② a table has only one family, and all columns are one of the qualifiers, so what is the difference?

Consider from the reading point of view:

The more families there are, the more obvious the advantage of getting data from each cell, because there are fewer io and networks.

If there is only one family, then every read will read all the data of the current rowkey, and there will be some loss on the network and io.

Of course, if you want to get fixed columns of data, then writing these columns into a family is better than setting up families separately, because you can get back all the data in one request.

From the perspective of writing:

First of all, in terms of memory, for a Region, each Family of each table is allocated a Store, and each Store is allocated a MemStore, so more families consume more memory.

Secondly, in terms of flush and compaction, in the current version of hbase, both flush and compaction are in units of region, that is to say, when a family reaches the flush condition, all memstores belonging to the family of the region will flush once, even if there is only a small amount of data in the memstore, flush will be triggered and small files will be generated. This increases the probability of the occurrence of the compaction, which is also in the region unit, so it is easy to occur the compaction storm and reduce the overall throughput of the system.

Third, considering from the split aspect, since hfiles are family units, data is scattered into more hfiles for multiple families, reducing the probability of split occurrence. This is a double-edged sword. Fewer splits result in a larger region, and since balancing is done in terms of the number of regions rather than the size, it may cause balance to fail. On the plus side, fewer splits will allow the system to provide more consistent online service. The downside can be avoided by artificially splitting and balancing requests at low times.

Therefore, for systems that write more, if it is offline, we should try to use only one family, but if it is online, it should be reasonably allocated according to the application.

4 hbase.regionserver.handler.count

The number of RPC listener instances opened on RegionServer, i.e., the number of IO request threads RegionServer can handle. The default is 10.

This parameter is memory dependent. When this value is set, monitor memory is the primary reference.

For Big PUT scenarios with high memory consumption per request (large-capacity single PUT or scan with large cache settings, both belong to Big PUT) or scenarios with tight memory of ReigonServer, you can set a relatively small one.

For scenarios where the memory consumption per request is low and TPS (Transaction Per Second) requirements are very high, you can set it relatively large.

At this point, I believe everyone has a deeper understanding of "how much hbase.hregion.max.filesize should be set appropriately," so let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.