How to understand the HFile merging process in HBase 07/12 Update SLTechnology News&Howtos

How to understand the HFile merging process in HBase

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to understand the HFile merger process in HBase? aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

HBase divides Compaction into two categories according to the size of the merger: MinorCompaction and MajorCompaction

Minor Compaction refers to selecting some small, adjacent StoreFile to merge them into a larger StoreFile, and will not deal with Deleted or Expired Cell in the process. The result of a Minor Compaction is less and larger StoreFile.

Major Compaction refers to merging all StoreFile into a single StoreFile, and this process also cleans up three types of meaningless data: deleted data, TTL expired data, and data whose version number exceeds the set version number. In addition, in general, the Major Compaction time will last for a long time, and the whole process will consume a lot of system resources, which will have a great impact on the upper-level business. Therefore, online businesses will turn off the automatic trigger Major Compaction function and manually trigger it during the business trough.

Reasons for merging storefile

Data is loaded into the memstore, more and more data until the memstore is full, and then written to the hard disk storefile, each write to form a separate storefile, when the storefile reaches a certain number, it will begin to merge the small storefile into a large storefile, because Hadoop is not good at dealing with small files, the bigger the file, the better performance.

When do you merge?

There are three ways to trigger compaction: Memstore flushing, background thread periodic check, and manual trigger.

1.Memstore Flush:

It should be said that the source of compaction operation comes from flush operation, memstore flush will produce HFile files, and more and more files need compact. Therefore, after each Flush operation, the number of files in the current Store is judged, and once the number of files is greater than the configuration, compaction will be triggered. It is important to note that compaction is carried out in Store, and when Flush is triggered, all Store of the entire Region will execute compact, so compaction will be executed multiple times in a short period of time.

two。 Background thread checks periodically:

Background threads periodically trigger checks to see if compaction needs to be performed, and the check cycle is configurable. The thread first checks whether the number of files is greater than the configuration, and once it is greater, it will trigger compaction. If not, it then checks whether the major compaction condition is met. To put it simply, if the earliest update time of hfile in the current store is earlier than a certain value mcTime, major compaction will be triggered (triggered once every 7 days by default, and can be triggered manually). HBase expects to delete expired data on a regular basis through this mechanism.

3. Manually triggered:

Generally speaking, manual triggering of compaction is usually used to execute major compaction. Generally speaking, these situations require manual trigger of merge.

Because many businesses worry about the impact of automatic major compaction on read and write performance, they choose to trigger manually during the trough period.

It is also possible that the user wants to take effect immediately after performing the alter operation to manually trigger the major compaction.

It is when the HBase administrator finds that the hard disk capacity is not enough to manually trigger major compaction to delete a large amount of expired data.

How is it sorted?

Because the memstore in memory is sorted in the process of data insertion, that is, the data is inserted sequentially when the data is inserted, so the data in memstore is ordered. When the data of memstore is written to disk, the data in the generated storefile is also ordered, so that the data in each storefile is ordered separately. When merging, you need to merge the ordered storefile into a large ordered storefile.

First, encapsulate each storefile that needs to be merged into StoreFileScanner, and finally form a List to load into memory, and then encapsulate it into a StoreScanner object. When this object is initialized, each StoreFileScanner is sorted into an internal queue, and the sort is sorted by the smallest rowkey of each StoreFileScanner. Then through the next () method of StoreScanner, we can get the KV pair corresponding to the minimum rowkey in the minimum rowkey of each StoreFileScanner. The extracted KV pairs are then appended to the merged storefile. Because the data taken out each time is the smallest in each storefile, the data appended to the merged storefile is ordered data sorted from small to large.

What will be done during the merger?

Lines that delete identities and old versions are discarded during the merge process.

(1) the number of versions can be defined in advance. If you exceed this value, it will be discarded.

(2) you can also define the duration of the version in advance, after which it will be discarded, and a larger storefile will be formed after merging. When the number reaches, until the storefile capacity exceeds a certain threshold, the current Region will be split into two and assigned by Hmaster (hbase database master node) to different HRegionServer servers to achieve load balancing.

If a query about storefile happens to occur in the process of merging, we first load the small storefile into memory for merging. Then, if there is a user access, we can retrieve the relevant data in memory and return it to the user. We can imagine doing an independent mirror backup in memory to provide the query needs, and the other subject merges in another memory space, and frees up the backup memory space when the merge is completed. Return to the original state.

This is the answer to the question on how to understand the HFile merger process in HBase. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.