What is the HBase architecture? 07/09 Update SLTechnology News&Howtos

What is the HBase architecture?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about what the HBase architecture is like. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

HBase is a common non-relational data in Apache hadoop cluster. It is an open source, distributed, multi-version, column-oriented database.

Its source code is on https://github.com/apache/hbase, properly open source ah.

Distribution is because its data is finally stored on HDFS, so it inherits the fine tradition of hadoop distribution (I don't know if I can understand it that way).

Multi-version will not be introduced, and the version will be updated quickly.

Column-oriented storage is one of the biggest differences between column-oriented storage and traditional relational databases. It gets your final value based on rowKey, columnfamily, quaifer, timestamp (if you have multiple versions).

The server architecture of hbase is also master-slave server architecture, which is divided into HBase Master server and HRegionServers.

HBase Master server: master server, mainly responsible for managing HRegionServers, personal understanding: only the decision-making power about HRegionServer is done by it.

The specific functions are:

1. Users add, delete, change and check Table.

two. HRegionServer's load balancer adjusts the distribution of HRegion. If a HRegionserver fails, master will retrieve the HRegion on the failed HRegionserver, mark it as unallocated, and then assign a live HRegionServer. Of course, it has to ask whether it wants this HRegionServer or not.

3. After the HRegion split, responsible for the allocation of the new HRegion. The default HRegion size is 64m, and when it exceeds this size, it will automatically split into two, and this split is very fast, because it will first create two HRegion, and the two HRegion will first save the reference to the original HRegion, and then remove the reference and delete the original HRegion when the two new HRegion data is split. Now the split is done, but HRegion has nothing to do after it has been split. It has to be managed by HRegionServer, so, and assigned by HBase Master Server again. Ditto ~

After the 4.HRegionServer stops, it is responsible for the HRegion allocation on the failed HRegionServer. The service hung up, but the HRegion didn't hang up. HRegion can be said to be a storage folder. If a service dies, find another one (hbase master server), and then continue to do the work of saving or querying the data being extracted.

The composition of HRegionServer:

Ps: there may be some objections to this place, because some pictures on the Internet show that HLog is only included in HRegion, and the information on the Internet is relatively old, so here I refer to the material in "hadoop practice". Its version is 0.92, which may be quite close to the current 0.96 version. no, no, no.

The HLog section keeps the log of users' operations on hbase (not excluded, it also includes master management hregion operations, which we will see tomorrow). Users' actions will be recorded in HLog first and then saved to HRegion.

And HRegion is actually the actual data stored. It contains multiple HStore.

HStore: each column family forms a HStore, which in turn consists of MemStore and multiple HFile.

MemStore resides in memory, and when the data is saved, the data is first stored in MemStore, and then saved to HFile according to the explicit or implicit write mode set by the user. The default storage mode is implicit storage. I'll introduce this later when I write client api. Of course, there is another thing to notice in this place. When the data was saved to MemStore, but not to HFile, it crashed. The role of HLog comes, the user's instructions are saved in HLog, it will execute the instructions, and then re-save to MemStore, so that you can complete the following operations. No, no, no.

HFile is responsible for the storage of actual data, which is the smallest unit in HBase. It can also be split, known as partitioning, to make data more decentralized and read data more efficiently.

Make up the part that was left unfinished yesterday.

The two main structures of HBase have been introduced, so let's start to introduce its storage flow and read process.

Let's first talk about the role of Zookeeper:

Stores the address of the Root table and the address of HMaster. Storing the address of ROOT can query which table has data more quickly and improve the operation efficiency. The HMaster address is stored to determine which HMaster is available.

Manages MHaster. When HMaster fails, it can find another HMaster to avoid a single point of failure of HMaster.

ROOT has the same structure as .meta and Region, saving data in the form of key-value pairs.

The corresponding .meta address and start and end information are stored in the ROOT (such as 1-5, indicating that five .meta address information is stored in the ROOT).

Meta also stores the corresponding HRegion address and start and end information.

All right, now let's start the analysis from the perspective of reading information from the client:

To read the information in client, you should first query whether there is data in the cache on client. If so, the data will be returned directly. If it does not exist, go to zookeeper and find the address in the Root table where the corresponding data exists.

Find .meta by storing the data in the address in the ROOT table, and finally find HRegion. If not, then look for the data in HFile and put the data in MemStore.

Finally, the data is returned to the client for display.

Storage data flow:

Because the default method of brushing in Hbase is implicit brushing, when you put () data, it will be automatically saved to HRegion, but when you batch process the data, it will save the data to cache on the client side first. When you turn off implicit flashing, your put () data is saved to client cache and not to HRegion until you invoke the overwrite command. Specific orders will be posted when I go to the company tomorrow. There is no environment at home.

Storage in the HRegion section: the data to be written is first written to HMemcache and Hlog, HMemcache establishes the cache, Hlog synchronizes the transaction logs of Hmemcache and Hstore, when Flush Cache is initiated, the data is persisted to Hstore, and the HMemecache is cleared.

Attention needs to be paid here:

HBase writes data, first writes to Memcache and counts into Log, and finally writes to HStore. If a system exception occurs when writing HStore, the data can be recovered from Log and rewritten into HStore.

Thank you for reading! This is the end of this article on "what the HBase architecture is like". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.