In-depth Analysis of HBase Architecture (1) 07/01 Update SLTechnology News&Howtos

In-depth Analysis of HBase Architecture (1)

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Antecedent

The company uses the MapR version of the Hadoop ecosystem, so I read this article from MapR's official website: An In-Depth Look at the HBase Architecture, originally wanted to translate the full text, but if the translation requires all kinds of literal, it is too troublesome, so most of this article uses its own language, and adds the reference understanding of other resources and my own understanding of the source code, which belongs to semi-translation and semi-original.

Composition of HBase architecture

HBase uses Master/Slave architecture to build a cluster, which belongs to the Hadoop ecosystem and is composed of the following types of nodes: HMaster node, HRegionServer node and ZooKeeper cluster. At the bottom, it stores data in HDFS, so it involves NameNode, DataNode and so on of HDFS. The overall structure is as follows:

The HMaster node is used to:

Manage HRegionServer to achieve its load balance.

Manage and assign HRegion, such as assigning a new HRegion; when HRegion split and migrating the HRegion within it to other HRegionServer when HRegionServer exits.

Realize DDL operation (add, delete and modify Data Definition Language,namespace and table, add, delete and modify column familiy, etc.).

Manage metadata for namespace and table (actually stored on HDFS).

Access control (ACL).

The HRegionServer node is used to:

Store and manage local HRegion.

Read and write HDFS, manage the data in Table.

Client reads and writes data directly through HRegionServer (after getting metadata from HMaster and finding the HRegion/HRegionServer where RowKey is located).

The ZooKeeper cluster is a coordination system used to:

Stores the metadata of the entire HBase cluster and the status information of the cluster.

Implement the failover of the HMaster master-slave node.

HBase Client communicates with HMaster and HRegionServer through RPC; a HRegionServer can store 1000 underlying Table data of HRegion; in HDFS, and the data processed by HRegion can be stored together with the DataNode where the data resides to achieve data localization; data localization is not always achieved, for example, when the HRegion is moved (such as due to Split), you need to wait for the next Compact to return to localization.

In line with the principle of semi-translation, post another "An In-Depth Look At The HBase Architecture" architecture map:

This architecture diagram clearly shows that both HMaster and NameNode support multiple hot backups and use ZooKeeper for coordination. ZooKeeper is not a cloud mystery. It is generally composed of three machines in a cluster. Internally, it uses PAXOS algorithm to support one of the three Server downtime, and some use five machines. At this time, it can support two downtime at the same time, which is less than half of the downtime. However, as the number of machines increases, its performance will also decline. RegionServer and DataNode generally localize data on the same Server.

HRegion

HBase uses RowKey to cut the table horizontally into multiple HRegion. From a HMaster point of view, each HRegion records its StartKey and EndKey (the StartKey of the first HRegion is empty, and the EndKey of the last HRegion is empty). Because RowKey is sorted, Client can quickly locate which HRegion each RowKey is in through HMaster. The HRegion is assigned to the corresponding HRegionServer by the HMaster, and then the HRegionServer is responsible for initiating and managing the HRegion, communicating with the Client, and reading the data (using HDFS). Each HRegionServer can manage about 1000 HRegion at the same time (where did this number come from? Is it because of experience that you don't see restrictions in the code? More than 1000 can cause performance problems? To answer this question: it feels like the 1000 number is from BigTable's paper (section 5 Implementation): Each tablet server manages a set of tablets (typically we have somewhere between ten to a thousand tablets per tablet server).

HMaster

HMaster has no single point of failure problem, and multiple HMaster can be started. Through ZooKeeper's Master Election mechanism, only one HMaster is in Active state at the same time, and other HMaster is in hot backup state. In general, two HMaster will be started, and non-Active HMaster will communicate with Active HMaster periodically to get its latest status, so as to ensure that it is updated in real time, so if you start multiple HMaster, it will increase the burden of Active HMaster. The previous article has introduced that HMaster is mainly used for the allocation and management of HRegion, and the implementation of DDL (Data Definition Language, the creation, deletion and modification of Table, etc.). It has two main responsibilities:

Coordinate HRegionServer

The allocation of HRegion at startup, and the redistribution of HRegion during load balancing and repair.

Monitor the status of all HRegionServer in the cluster (via Heartbeat and monitor the status in the ZooKeeper).

Admin function

Create, delete, and modify the definition of Table.

ZooKeeper: coordinator

ZooKeeper provides coordination services for HBase clusters, which manages the status of HMaster and HRegionServer (available/alive, etc.) and notifies HMaster when they are down, so that HMaster can implement failover between HMaster, or fix HRegion collections in down HRegionServer (assign them to other HRegionServer). The ZooKeeper cluster itself uses the consistency protocol (PAXOS protocol) to ensure the consistency of each node state.

How The Components Work Together

ZooKeeper coordinates the sharing of information among all nodes in the cluster, creates an Ephemeral node after HMaster and HRegionServer are connected to ZooKeeper, and uses the Heartbeat mechanism to maintain the survival of this node. If an Ephemeral node is effective, HMaster will receive a notification and deal with it accordingly.

In addition, HMaster monitors HRegionServer joining and downtime by listening on the Ephemeral node in the ZooKeeper (default: / hbase/rs/*). When the first HMaster connects to the ZooKeeper, an Ephemeral node (default: / hbasae/master) is created to represent the HMaster of the Active, and then the added HMaster listens to the Ephemeral node. If the HMaster of the current Active goes down, the node disappears and other HMaster is notified, while the HMaster that converts itself into Active creates its own Ephemeral node under / hbase/back-masters/ before it becomes the HMaster of Active.

The first read and write of HBase

Before HBase .96, HBase had two special Table:-ROOT- and .meta. (such as the design in BigTable), in which the location of-ROOT- Table is stored in ZooKeeper, it stores the RegionInfo information of .META. Table, and it can only exist one HRegion, while .meta. Table stores the RegionInfo information of user Table, which can be split into multiple HRegion, so when accessing user Table for the first time, read the HRegionServer of-ROOT- Table from ZooKeeper first. Then read the HRegionServer; of .meta. Table from the HRegionServer according to the requested TableName,RowKey and finally read the contents of .meta. Table from the HRegionServer to get the location of the HRegion that the request needs to access, and then access the HRegionSever to get the requested data, which takes three requests to find the location of the user Table, and then the fourth request starts to get the real data. Of course, to improve performance, the client caches the contents of-ROOT- Table location and-ROOT-/.META. Table. As shown in the following figure:

But even if the client has a cache, it takes three requests in the initial phase until the real location of the user Table is slow, and is it really necessary to support so many HRegion? It may be necessary for a company like Google, but it doesn't seem to be necessary for a general cluster. In BigTable's paper, it is said that each row of METADATA stores about 1KB data, the medium size Tablet (HRegion) is about 128MB, and the Schema design with 3-tier location can support 2 ^ 34 Tablet (HRegion). Even if-ROOT- Table is removed, it can still support 2 ^ 17 (131072) HRegion. If each HRegion is still 128MB, that is 16TB. This does not seem to be big enough, but now the maximum size of HRegion is set to be relatively large. For example, if we set 2GB, the supported size becomes 4PB, which is enough for general clusters, so-ROOT- Table is removed after HBase 0.96. All that's left is this special catalog table called Meta Table (hbase:meta), which stores the location information of all user HRegion in the cluster, while the node (/ hbase/meta-region-server) of ZooKeeper stores directly the location of this Meta Table, and this Meta Table is as split-free as the previous-ROOT- Table. In this way, the process for the client to access the user's Table for the first time becomes:

Get the location of the hbase:meta (the location of the HRegionServer) from ZooKeeper (/ hbase/meta-region-server) and cache the location information.

Query the HRegionServer where the user Table corresponds to the requested RowKey from the HRegionServer, and cache the location information.

Read Row from query to HRegionServer.

From this process, we find that the customer will cache the location information, but the second step is only to cache the location of the HRegion corresponding to the current RowKey, so if the next RowKey to be checked is not in the same HRegion, we need to continue to query the HRegion where the hbase:meta is located. However, with the passage of time, the location information cached on the client side is more and more, so that there is no need to look up the hbase:meta Table information again. Unless a HRegion is moved due to downtime or Split, you need to re-query and update the cache.

Hbase:meta table

The hbase:meta table stores the location information of all users' HRegion. Its RowKey is: tableName,regionStartKey,regionId,replicaId and so on. It only has an info column family, which contains three columns. They are: info: regioninfo column is the proto format of RegionInfo: regionId,tableName,startKey,endKey,offline,split,replicaId;info:server format: the server:port;info:serverstartcode format corresponding to HRegionServer is the startup timestamp of HRegionServer.

HRegionServer detailed explanation

HRegionServer generally runs on the same machine as DataNode, realizing the locality of the data. HRegionServer contains multiple HRegion, which is composed of WAL (HLog), BlockCache, MemStore and HFile.

WAL, or Write Ahead Log, is called HLog in earlier versions, and it is a file on HDFS. As its name suggests, all write operations ensure that data is written to the Log file before the MemStore is actually updated and finally written to HFile. Using this mode, we can ensure that after HRegionServer downtime, we can still read data from the Log file and Replay all the operations without data loss. This Log file periodically Roll new files and delete old ones (those Log that have been persisted into HFile can be deleted). WAL files are stored in the / hbase/WALs/$ {HRegionServer_Name} directory (prior to .94, in the / hbase/.logs/ directory). Generally, there is only one WAL instance for a HRegionServer, which means that all WAL writes to a HRegionServer are serial (just as log writes to log4j are serial), which certainly causes performance problems, so multiple WAL parallel writes (MultiWAL) are implemented through HBASE-5699 after HBase 1.0. The implementation uses multiple pipes of HDFS to write in a single HRegion. You can refer to Wikipedia's Write-Ahead Logging for WAL. By the way, the English version of Wikipedia has been able to access Wikipedia without pressure. Is this the negligence of a certain GFW or the norm in the future?

BlockCache is a read cache, that is, the principle of "reference locality" (also applied to CPU, which is divided into spatial locality and temporal locality. Spatial locality means that CPU needs some data at a certain time, so there is a good chance that the data it needs will be near it at the next moment. Time locality means that after a data has been accessed once, it has a high probability that it will be accessed again in the near future), pre-read the data into memory, in order to improve the read performance. Two implementations of BlockCache are available in HBase: the default on-heap LruBlockCache and BucketCache (usually off-heap). Usually, the performance of BucketCache is worse than that of LruBlockCache, but due to the influence of GC, the latency of LruBlockCache will become unstable, while BucketCache manages BlockCache itself and does not need GC, so its latency is usually stable, which is why BucketCache is sometimes needed. In this article, BlockCache101 makes a detailed comparison of the BlockCache of on-heap and off-heap.

HRegion is the expression of a Region in a Table in a HRegionServer. A Table can have one or more Region, they can be on the same HRegionServer, or they can be distributed on different HRegionServer, a HRegionServer can have multiple HRegion, they belong to different Table. HRegion consists of multiple Store (HStore), and each HStore corresponds to a Column Family of a Table in this HRegion, that is, each Column Family is a centralized storage unit, so it is best to store the Column with similar IO characteristics in a Column Family to achieve efficient reading (the principle of data locality can improve the cache hit rate). HStore is the core of storage in HBase, which implements the function of reading and writing HDFS. A HStore consists of a MemStore and 0 or more StoreFile.

MemStore is a write cache (In Memory Sorted Buffer). After completing the WAL log writing, all data will be written to MemStore, and MemStore will Flush the data to the stratum HDFS file (HFile) according to a certain algorithm. Usually, each Column Family in each HRegion has its own MemStore.

HFile (StoreFile) is used to store HBase data (Cell/KeyValue). The data in HFile is sorted by RowKey, Column Family, Column, and for the same Cell (that is, all three values are the same), it is sorted in reverse order of timestamp.

Although the above picture shows the latest HRegionServer architecture (but not so accurate), I have always preferred to look at the following diagram, even if it is supposed to show the architecture before 0.94.

Diagram of data Writing flow in HRegionServer

When the client initiates a Put request, it first finds out the HRegionServer that the Put data ultimately needs to go to from the hbase:meta table. The client then sends the Put request to the appropriate HRegionServer, and in HRegionServer it first writes the Put operation to the WAL log file (Flush to disk).

After writing the WAL log file, HRegionServer finds the corresponding HRegion according to the TableName and RowKey in the Put, finds the corresponding HStore according to the Column Family, and writes the Put to the MemStore of the HStore. At this point, the write is successful and the notification client is returned.

MemStore Flush

MemStore is an In Memory Sorted Buffer, and there is a MemStore in each HStore, that is, it is a Column Family of a HRegion corresponding to an instance. It is arranged in the order of RowKey, Column Family, Column, and the reverse order of Timestamp, as follows:

Each Put/Delete request is first written to the MemStore, and when the MemStore is full, it will Flush into a new StoreFile (the underlying implementation is HFile), that is, a HStore (Column Family) can have 0 or more StoreFile (HFile). There are three situations in which the Flush action of MemStore can be triggered, and it is important to note that the minimum Flush unit of MemStore is HRegion rather than a single MemStore. It is said that this is one of the reasons why there is a limit on the number of Column Family, probably because too many Column Family together Flush will cause performance problems? The specific reasons need to be verified.

When the sum of all the MemStore sizes in a HRegion exceeds the size of the hbase.hregion.memstore.flush.size, the default 128MB. At this point, all MemStore in the current HRegion will be Flush to the HDFS.

When the size of the global MemStore exceeds the size of the hbase.regionserver.global.memstore.upperLimit, the default memory usage is 40%. At this point, all the MemStore in the HRegion in the current HRegionServer will be Flush to the HDFS, and the Flush order is the reverse order of the MemStore size (the sum of all the MemStore in a HRegion is the size of the MemStore of the HRegion, or the largest MemStore is selected as the reference? (to be verified), until the overall MemStore usage is lower than hbase.regionserver.global.memstore.lowerLimit, the default memory usage is 38%.

The size of the WAL in the current HRegionServer exceeds the number of hbase.regionserver.hlog.blocksize * hbase.regionserver.max.logs. All the MemStore in the HRegion in the current HRegionServer will be Flush to the HDFS. Flush uses the chronological order. The earliest MemStore is Flush first until the number of WAL is less than hbase.regionserver.hlog.blocksize * hbase.regionserver.max.logs. It is said here that the default size of the two multiplications is 2GB, look up the code, the default value of hbase.regionserver.max.logs is 32, and hbase.regionserver.hlog.blocksize is the default blocksize,32MB of HDFS. But in any case, because the Flush caused by this size exceeding the limit is not a good thing and may cause long delays, the advice given in this article is: "Hint: keep hbase.regionserver.hlog.blocksize * hbase.regionserver.maxlogs just a bit above hbase.regionserver.global.memstore.lowerLimit * HBASE_HEAPSIZE." It is also important to note that the description given here is incorrect (although it is an official document).

During the MemStore Flush process, some meta data is appended to the tail, including the maximum WAL sequence value at Flush, to tell HBase the sequence of the latest data written by StoreFile, so it will be at Recover until where to start. When the HRegion starts, the sequence is read and the maximum is taken as the starting sequence for the next update.

HFile format

The data of HBase is stored in HFile in the form of KeyValue (Cell), and HFile is generated in the Flush process of MemStore. Because the Cell stored in MemStore follows the same order, the Flush process is written sequentially. The performance of sequential write until the disk is very high, because there is no need to constantly move the disk pointer.

HFile refers to BigTable's SSTable and Hadoop's TFile implementation. HFile has experienced three versions since HBase, of which V2 was introduced in 0.92 and V3 in 0.98. First, let's take a look at the V1 format:

The HFile of V1 is composed of multiple Data Block, Meta Block, FileInfo, Data Index, Meta Index and Trailer, of which Data Block is the smallest storage unit of HBase, and the BlockCache mentioned earlier is based on Data Block cache. A Data Block consists of a magic number and a series of KeyValue (Cell). The magic number is a random number used to indicate that this is a Data Block type to quickly monitor the Data Block format to prevent data corruption. The size of the Data Block can be set (HColumnDescriptor.setBlockSize ()) when the Column Family is created, and the default value is 64KB, large Block is good for sequential Scan, and small Block is good for random queries, so it needs to be weighed. Meta block is optional, FileInfo is a fixed length block, it records some Meta information of the file, such as: AVG_KEY_LEN, AVG_VALUE_LEN, LAST_KEY, COMPARATOR, MAX_SEQ_ID_KEY and so on. Data Index and Meta Index record the actual point, uncompressed size, Key (starting RowKey?) of each Data block and Meta block. Wait. Trailer records the starting position of FileInfo, Data Index, Meta Index blocks, the number of Data Index and Meta Index indexes, and so on. Where FileInfo and Trailer are of fixed length.

Each KeyValue pair in HFile is a simple byte array. But this byte array contains a lot of items and has a fixed structure. Let's take a look at the specific structure inside:

It begins with two fixed-length values that represent the length of the Key and the length of the Value. This is followed by Key, starting with a fixed-length number, indicating the length of RowKey, followed by RowKey, then a fixed-length value, representing the length of Family, then Family, then Qualifier, then two fixed-length values, representing Time Stamp and Key Type (Put/Delete). The Value part doesn't have such a complex structure, so it's pure binary data. With the migration of the HFile version, the format of KeyValue (Cell) has not changed much, except that in the V3 version, an optional Tag array has been added to the tail.

In the actual use of the HFileV1 version, it is found that it takes up a lot of memory, and Bloom File and Block Index will become very large, resulting in longer startup time. The Bloom Filter of each HFile can grow to 100MB, which will cause performance problems when querying, because each query requires loading and querying the Bloom Filer of Bloom Filter,100MB will cause great delay; on the other hand, Block Index in a HRegionServer may grow to a total of 6GB. HRegionServer needs to load all these Block Index at startup, thus increasing startup time. To solve these problems, the HFileV2 version was introduced in version 0.92:

In this version, Block Index and Bloom Filter are added to the middle of Data Block, and this design also reduces the amount of memory used for writing; in addition, in order to speed up startup, a delayed read feature is introduced in this version, that is, HFile is parsed only when it is actually used.

The FileV3 version basically does not change much from the V2 version. It adds support for Tag arrays at the KeyValue (Cell) level, and adds two fields related to Tag to the FileInfo structure. For a specific introduction to the evolution of HFile format, you can refer to it here.

According to the concrete analysis of HFileV2 format, it is a multi-tier B + tree-like index. with this design, we can find without reading the whole file:

The Cell in Data Block is arranged in ascending order, each block has its own Leaf-Index, the last Key of each Block is put into the Intermediate-Index, and the Root-Index points to the Intermediate-Index. At the end of the HFile there is also a Bloom Filter for quick positioning so the Row;TimeRange information that is not in a Data Block is used for reference to those usage time queries. When HFile is opened, this index information is loaded and stored in memory to increase future read performance.

This is the end of this article, to be continued.

Reference:

Https://www.mapr.com/blog/in-depth-look-hbase-architecture#.VdNSN6Yp3qx

Http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Http://hbase.apache.org/book.html

Http://www.searchtb.com/2011/01/understanding-hbase.html

Http://research.google.com/archive/bigtable-osdi06.pdf

Original text transferred from: http://www.blogjava.net/DLevin/archive/2015/08/22/426877.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.