What is the use of HBase file indexing 07/15 Update SLTechnology News&Howtos

What is the use of HBase file indexing

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain to you in detail about the use of HBase file index, the editor thinks it is very practical, so share it for you to do a reference, I hope you can get something after reading this article.

HBase overall structure diagram

Brief introduction of some terms

HMaster

Responsible for the management of HRegionServer access, responsible for the management and allocation of Region, responsible for managing the creation of Table, deletion and modification and other operations.

HRegion

Each Table can be split into multiple Region, and each Region is a row interval in the Table. For example, a Table with a RowKey of 0-100 can be split into 0-50 and 51-100 Region.

HRegionServer

Each HRegionServer manages multiple Region and is responsible for reading and writing to Region, etc.

HLog

Each HRegionServer has a HLog to record all operations, mainly used to repair data when it is corrupted. Physically, it's Hadoop's Sequence File.

Store

Multiple Store are managed under each HRegion, and each Store corresponds to a Family in the Table for data management.

StoreFile

Persistent data class of Family

MemStore

There is a MemStore in each Store, which is used to cache the operation on the Family. When the MemStore is cached to a point size, it will be converted into StoreFile Flush to HDFS.

HFile

StoreFile is only a lightweight package of HFile. The Table data files saved in HDFS are all in HFile format.

The overall structure of the index

In HBase, from the perspective of the overall framework, the distribution of indexes is divided into the following layers.

A. in Zookeeper, according to the startup of HMaster, the address of the RegionServer assigned-ROOT- Table is saved.

B. In the-ROOT- Table, the information of the RegionServers distributed after meta Table split into multiple region is saved.

C. In .meta, the RegionServers address distributed by the regions of each Table is kept.

D,-ROOT- exist only one Region, while .meta can be split into multiple.

-the table structure of ROOT- and META is the same, as follows

Rowkey

Info

Regioninfo

Server

Serverstartcode

TableName

StartKey

TimeStamp

Startkey

Endkey

Family list

Address

Load the startup time of the current shard

-ROOT- example:

Suppose. Meta splits into two Region, distributed on two RegionServer.

Rowkey

Info

Regioninfo

Server

Serverstarcode

.META Table1

Pk0

12345278

RegionServer1

.META Table1

Pk1000

123451278

RegionServer2

.META Table2

Pk0

123431278

RegionServer1

.META Table2

Pk1000

123457278

RegionServer2

Example of .meta:

Rowkey

Info

Regioninfo

Server

Serverstarcode

Table1

Pk0

12345278

RegionServer1

Table1

Pk1000

123451278

RegionServer2

Table1

Pk2000

12345878

RegionServer3

……

Table2

Pk0

12345278

RegionServer1

Table2

Pk1000

12345478

RegionServer2

Table2

Pk2000

12345778

RegionServer3

Positioning process of RegionServer

When Client provides TableName and RowKey for put, get and delete operations on the data in a Table, Client obtains the RegionServer information of-ROOT- from Zookeeper, and then obtains the ReginoServer of .meta from-ROOT- according to RowKey, and then locates to the RegionServer where RowKey is located. Because Client caches location information such as-ROOT-, .META and Region during interaction, you only need to query the location once in the best case and 6 times in the worst case [you need to query it recursively from Table Region].

The data storage structure of HFile:

As shown in the figure above, the file length of HFile is longer, where File Info and Trailer are fixed length, and Trailer has a starting point pointing to File Info\ Data Index\ Meta Index. The Index block records the starting point of the Data\ Meta block. In Data blocks, Magic is used to identify whether the data is corrupted, and multiple KeyValue information is stored in each Data block.

The above picture shows the data structure of KeyValue.

The above picture also shows the data structure of HFile.

The entire region file path looks like this:

There are HFile data files under each column-family, and the name of the file is based on any number generated by the built-in random number generator in Java. The code ensures that there is no collision, for example, when it finds that a newly generated number already exists, it continues to look for an unused number.

Operation of Region

When you locate the RegionServer where the RowKey is located, you can get the corresponding Region,RegionName from the RowKey saved in .meta according to the RegionName.

Get:

1. The HRegion.get API detects the Family first to ensure that the Family in Get is consistent with that in Table.

2. According to the information of Family, find the corresponding Store, get the StoreScanner instance in Store, and add it to a scanners queue.

3. In StoreScanner, there are two instances, MemstoreScanner and HFileScanner, which are used to traverse the keyValue values in MemStore and HFile, respectively.

4. Because there are multiple HFile, the HFileScanner will be filtered once. Through the DataIndex of HFile, the position will be pointed to the StarRow,DataIndex where the firstKey information of the current DataBlock is saved. If the KeyValue is not in the current HFile, the search for HFileScanner will be turned off.

5. It should be noted that after RegionServer starts, the DataIndex of HFile is saved in memory.

6. When StoreScanner queries the corresponding keyValue, first use MemstoreScanner to find from MemStore, if there is no corresponding data, then use HFileScanner to traverse from the DataBlock of HFile, DataIndex can quickly locate the location of Block.

7. Because HFile has been persisted into HDFS, each IO read of HFile only reads the size of a Data data block. The location of Data can be queried according to the DataIndex information of HFile.

8. If you are configured to use Bloom Filters, you can quickly confirm whether a RowKey or value is in a HFile.

On "what is the use of HBase file index" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.