Hbase distributed storage mechanism (detailed explanation of how it works) 04/26 Update SLTechnology News&Howtos

Hbase distributed storage mechanism (detailed explanation of how it works)

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Wednesday, 2019-2-20

Reference link

Https://www.cnblogs.com/qingyunzong/p/8692430.html

Hbase distributed storage mechanism (detailed explanation of how it works)

Hbase system architecture diagram

The role of each component in a hbase cluster

Client

/ / contains interfaces to access hbase. Client maintains some cache to speed up access to hbase, such as regione location information.

1. HBase has two special tables:

Meta.: records the Region mapping information split from all the tables of the user,. META. Can have multiple Regoin

-ROOT-: recorded. META. The Region information of the table,-ROOT- has only one Region and will not be split anyway

2. Before Client accesses user data, you need to visit ZooKeeper first, find the location of the Region of-ROOT- table, then access-ROOT- table, and then visit .meta. Table, finally, you can find the location of user data to access, which requires multiple network operations, but the client side will do cache cache.

/ / the client contacts regionserver directly

Note: the-ROOT- table has been removed since version 0.96. As to why, please refer to hbase's addressing process for details.

The information of the two tables of .meta.-ROOT- is found through the information in zookeeper, and the two tables of .META.-ROOT- are specifically on the server of regionserver.

Hmaster:

1 assign region to Region server

2 be responsible for load balancing of region server

3 find the failed region server and redistribute the region on it

4 garbage file collection on GFS

5 processing schema (schema) update request / / can be understood as managing user's operations to add, delete and change tables.

Regionserver

1 Region server maintains the region assigned to it by Master and processes IO requests for these region

2 Region server is responsible for shredding region / / that really manages the actual data of the table that becomes too large during operation

3 regionsever is responsible for responding to users'IO requests and reading and writing data to hdfs

/ / as you can see, master is not required for client to access data on hbase.

(addressing access to zookeeper and regionserver, data read-write access to regione server), master only maintains metadata information for table and region, and all loads are very low.

Tip: in general, regionserver is juxtaposed with datanode in hdfs. Datanode stores the data being managed by regionserver, and hbase data is ultimately stored on hdfs.

Zookeeper

1 guarantee that there is only one master in the cluster at any time

2 store the addressing entry of all Region.

3 monitor the status of Region Server in real time, and inform Master of the online and offline information of Region server in real time.

4. Store the schema (mode) of Hbase, including what table there are and what column family each table has.

HRegion

The table is separated into multiple Region in the direction of the row. Region is the smallest unit of distributed storage and load balancing in HBase, that is, different region can be on different Region Server, but the same Region will not be split into multiple server. Region is separated by size, and each table typically has only one region. As the data continues to be inserted into the table, the region grows, and when a column family of region reaches a threshold, it is split into two new region. Each region is identified by the following information:

< 表名,startRowkey,创建时间>

By the catalog table (- ROOT- and .meta.) Record the endRowkey of the region

Memstore store and storefile

A region consists of a plurality of store, and each store contains all the data of a column family, including the memstore located in the memory and the storefile located on the hard disk.

The write operation first writes to memstore. When the amount of data in memstore reaches a certain threshold, regionserver starts the flashcache process to write to storefile, and each write forms a separate storefile.

When the storefile size exceeds a certain threshold, the current region will be divided into two, and the Hmaster will assign the corresponding region server to achieve load balancing.

When the client retrieves data, look for it in memstore first, and then find storefile if you can't find it.

HLog (WAL log)

WAL means Write ahead log (http://en.wikipedia.org/wiki/Write-ahead_logging), similar to binlog in mysql), which is used for disaster recovery only. Hlog records all changes to the data, and once the data is modified, it can be recovered from log.

Each Region Server maintains one Hlog instead of one per Region. This mixes logs from different region (from different table) so that continuously appending a single file reduces the number of disk addressing times compared to writing multiple files at the same time, thus improving write performance to table. The trouble is that if a regionserver is offline, in order to restore the region on it, the log on the regionserver needs to be split and then distributed to other regionserver for recovery.

A HLog file is an ordinary Hadoop Sequence File,Sequence File Key is a HLogKey object, and the HLogKey records the attribution information of the written data, in addition to the names of table and region, but also includes that sequence number and timestamp,timestamp are "write time", the starting value of sequence number is 0, or the last time it is stored in the file system sequence number. The Value of HLog Sequece File is the KeyValue object of HBase, that is, the KeyValue in the corresponding HFile, as described above.

HFile:

The storage format of KeyValue data in HBase. HFile is the binary format file of Hadoop. In fact, StoreFile makes a lightweight package for Hfile, that is, the bottom layer of StoreFile is HFile.

Format of the Trailer section:

HFile is divided into six parts:

1. Data Block segment-saves the data in the table, which can be compressed

2. Meta Block section (optional)-saves user-defined kv pairs that can be compressed.

3. File Info segment-the meta-information of Hfile is not compressed, and users can also add their own meta-information in this section.

4. Data Block Index segment-the index of Data Block. The key of each index is the key of the first record of the indexed block.

5. Meta Block Index section (optional)-the index of the Meta Block.

6. Trailer- this paragraph is of fixed length. Save the offset of each segment, when reading a HFile, you will first read the Trailer,Trailer to save the starting position of each segment (the Magic Number of the segment is used for security check), and then the DataBlock Index will be read into memory, so that when retrieving a key, you do not need to scan the entire HFile, but only need to find the block where the key is located in memory, read the entire block into memory through a disk io, and then find the needed key. DataBlock Index was eliminated by LRU mechanism. The Data Block,Meta Block of HFile is usually stored by compression. After compression, the network IO and disk IO can be greatly reduced. The resulting overhead, of course, is to spend cpu for compression and decompression.

The compression of the target Hfile supports two ways: Gzip,Lzo.

Detailed explanation of hbase distributed storage mechanism

1 has already mentioned that all lines in Table are arranged according to the dictionary order of row key.

2 Table is split into multiple region in the direction of the row.

3 region is divided by size, and there is only one region in each table at the beginning. As the data continues to be inserted into the table, the region continues to grow. When it increases to a threshold, region will divide into two new region. When there are more rows in the table, there will be more and more region.

4 region is the smallest unit of distributed storage and load balancing in Hbase. The smallest unit means that different region can be distributed on different HRegion server. However, a region will not be split into multiple server.

5. When the amount of region data on regionserver increases to a threshold, it will split, and then continue to grow and split. (it is recommended that each regionserver manage 1000 region)

To get to the details, what kind of mechanism will be used when data from a region is eventually stored on hdfs?

6. Although Region is the smallest unit of distributed storage, it is not the smallest unit of storage. In fact, a Region consists of one or more Store, and each store holds a columns family (column family). Each Strore in turn consists of a memStore and 0 to more StoreFile. Figure: StoreFile is saved on HDFS in HFile format.

A column family in / / region is a physical unit of storage, and it is absolutely impossible for all two different column families to be stored in the same file.

7. Each Strore consists of a memStore and 0 to more StoreFile. Memstore is shared, which is equivalent to the in-memory cache of the entire store. If the client reads the data, it will take or write the data from the memstore first if it hits anywhere. If there is no data in memstore that we want to fetch, then we will look for it in storefile. Memstore can be understood as a caching mechanism. If you are writing data, you will first write to memstore, and then memstore will be brushed as storefile after a period of time (this is actually region flush: when writing data, it will be written to memstore first, and then flush will be converted to storefile storefile after a period of time. It will be converted to hfile and finally stored on hdfs) / / StoreFile will be saved on HDFS in HFile format.

8. When multiple storefile exceeds the threshold we set, there will be a compactition (compression) operation. Will allow all storefile files to be rewritten as a storefile (one storefile per memstore refresh) you can specify a larger amount of deferred compression, but the compression will run longer and updates will not be refreshed to disk during compression. Long-term compression requires enough memory, and then record all updates within the duration of the compression. If it is too large, the client will time out during compression.

9, compression will also occur on the hfile side: hbase will automatically pick up some smaller hfile and write them into some larger hFile, this process is called minor compacition: minor compacatition through the rewrite operation, the use of merge sorting to convert smaller files into a larger number of files but a smaller number of files.

Tip: in some materials, it will be mentioned separately that blockcache is the cache of read operations, and it holds some information that is often read in memory. Memstore is the cache for write operations

We understand them as the same, and the function is caching. No need to delve into it.

For an in-depth understanding of the specific differences between memstore and blockcache, please refer to the answer to question 1 and question 2 in this article

HBase principle-belated 'data reading process' partial details http://hbasefly.com/2017/06/11/hbase-scan-2/

HBase BlockCache Series-stepping into BlockCache http://hbasefly.com/2016/04/08/hbase-blockcache-1/

1. BlockCache is Region Server-level.

2. A Region Server has only one Block Cache, and the initialization of the Block Cache is completed when the Region Server starts.

3. Up to now, HBase has implemented three BlockCache schemes successively. LRUBlockCache is the initial implementation scheme and the default implementation scheme. HBase version 0.92 implements the second scheme, SlabCache. See BucketCache, another option provided by the government after HBASE-4027;HBase 0.96, see HBASE-7404.

Specific differences between memstore and blockcache, please see the follow-up blog!

Question 1:

It is often said that HBase data reading should read Memstore, HFile and Blockcache, why there are only StoreFileScanner and MemstoreScanner in the above Scanner? No BlockcacheScanner?

Answer

1. Data in HBase only exists independently in Memstore and StoreFile.

2. The data in Blockcache is only part of the data in StoreFile (hot data), that is, all the data that exists in Blockcache must exist in StoreFile.

3. So MemstoreScanner and StoreFileScanner can cover all the data.

4. When you first check the data, you will first look for it in blockcache, if it exists, take it out directly; if you do not look it up in storefile again.

Question 2:

The data update (write) operation first writes the data to Memstore, and then drops the disk. Do you need to update the corresponding kv in Blockcache after setting the disk? If you don't update, will you read dirty data?

1. There is no need to update the data in blockcache

2. Dirty data will not be read, because the data is written to the memstore to form a new file.

3. The new file formed by memstore and the data in blockcache are independent of each other and exist in version.

Region management mechanism

(1) region allocation

A region can only be assigned to one region server at any one time. Master keeps track of what region server is currently available. And which region is currently assigned to which region server and which region has not yet been assigned. When there is an unallocated region and there is free space on a region server, the master sends a load request to the region server and assigns the region to the region server. After the region server gets the request, it starts to region this

Provide services.

(2) region server is online

Master uses zookeeper to track region server status. When a region server starts, it first creates a file that represents itself in the server directory on zookeeper, and acquires an exclusive lock on the file. Because master subscribes to change messages in the server directory, master can get real-time notification from zookeeper when files are added or deleted in the server directory. So as soon as region server comes online, master can get the news immediately.

(3) region server offline

When region server goes offline, it disconnects its session with zookeeper, and zookeeper automatically releases the exclusive lock on the file that represents the server. On the other hand, master constantly polls the lock status of files in the server directory. If master finds that a region server has lost its own exclusive lock (or if master has been unable to communicate with region server several times in a row), master is trying to acquire a read-write lock on behalf of that region server. Once successful, it can be determined:

1 the network between region server and zookeeper is disconnected.

2 region server is dead.

In either case, region server can no longer provide services for its region, and master will delete the file representing the region server in the server directory and assign the region of this region server to other comrades who are still alive.

If a brief problem with the network causes region server to lose its lock, after region server reconnects to zookeeper, as long as the file representing it is still there, it will keep trying to acquire the lock on that file, and once it has acquired it, it can continue to provide services.

HMaster working mechanism

(1) Hmaster is online

Master starts with the following steps:

1 obtain the lock of the only code master from the zookeeper to prevent other master from becoming master.

2 scan the server directory on zookeeper to get a list of currently available region server.

Each region server in 3 and 2 communicates to obtain the corresponding relationship between the currently allocated region and region server.

4 scan the collection of .META.region, calculate the currently unallocated region, and put them in the list of region to be allocated.

(2) master offline

Since master only maintains the metadata of table and region, but does not participate in the process of table data IO, master offline only causes all metadata modifications to be frozen (cannot create deleted table, cannot modify table schema (schema), cannot load balance of region, cannot handle region online and offline, and cannot merge region. The only exception is that region split can proceed normally, because only region server participates. The data reading and writing of the table can be carried out normally. Therefore, master offline has no impact on the entire hbase cluster in a short period of time. As you can see from the launch process, all the information saved by master can be redundant (all can be collected or calculated from other parts of the system). Therefore, in general, there is always a master providing services in a hbase cluster, and there is more than one 'master' waiting for the opportunity to seize its location.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.