Understanding of Hbase architecture 07/09 Update SLTechnology News&Howtos

Understanding of Hbase architecture

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

1. Hbase cluster architecture

First, hbase is a component of hadoop. There are many components within hadoop, almost all of which rely on the two core things of hadoop, one is the hdfs file system, the other is mapreduce. Of course, hbase is no exception.

Hbase is actually a non-relational database system. It may be convenient to compare it with the relational database mysql.

Understand.

(this picture is quoted in Baidu encyclopedia)

There are three ways to build Hbase: local mode, pseudo-distributed mode and cluster mode. Then in general, it is easy for us to learn the various characteristics of Hbase, and we only need to build pseudo-distribution patterns. The difference between pseudo-distribution mode and cluster mode is that all daemons of the hbase system run on one physical node and are actually distributed on different physical nodes.

So both pseudo-distribution and cluster mode. Its cluster consists of several main daemons: HMaster,HRegionserver,zookeeper, and client. So when we use Hbase from the command line or JavaApi or other means, we send commands to hbase through client.

A brief understanding of the role of each daemon: unlike Namenode in hdfs, HMaster maintains the metadata of the whole system and is responsible for interacting with client to achieve all the overall functions such as the management and control of sub-nodes. Hmaster is not responsible for having too much interaction with client. In most cases, zookeeper communicates with client, and then implements the use and management of hbase. At the same time, metadata is also maintained by zookeeper, including the address of each regionserver, and so on.

Here is a simple summary of the division of labor between the two:

Zookeeper

Ensure that there is only one running master in the cluster at any time

Store the addressing entry for all Region

Monitor the status of Region Server in real time, and inform Master of the online and offline information of Region server in real time.

Schema for storing Hbase, including what table it has and what column family it has for each table

Master can start multiple HMaster, and ensure that there is always a Master running through Zookeeper's Master Election mechanism.

Assign region to Region server

Responsible for load balancing of region server

Discover failed region server and reallocate region on it

2. Storage mode and structure of Hbase

To talk about the storage mode of Hbase, we describe it from two aspects, one is logical storage and the other is physical storage. (for example, two-dimensional relational tables in mysql are the logical storage structure of relational databases, and the actual storage form of these tables on the hard disk is the so-called physical storage.)

The logical storage method of Hbase:

Tables are sparse, so you can understand them by categorizing mysql's relational tables, but it's not really the same thing. First of all, the table has several concepts, such as row key, column cluster, column name, time cluster and so on. At the beginning of a table definition, its table name and column cluster information should be given. In a row, there are multiple fixed column clusters, and any column name can be set when data is inserted under each column cluster. A row key, a column cluster + column name can be used to calibrate a cell in a table, and a cell has a value that distinguishes different versions by timestamp. All the information in the table is actually stored in binary form. There are not as many data types supported as in relational tables. So if you want to use a variety of data types, you can only maintain it in the program.

The physical storage method of Hbase:

The data of Hbase is actually configured by Hmaster and is actually stored in each Hregionserver. A sparse table is sorted and stored by row key. These tables are split into multiple region storage channels in different regionserver by row, and multiple region can be stored in one regionserver. These region correspond to the rows of different parts of the table, and these rows are actually divided into different file stores, which are divided into column clusters, one column cluster and one file, so they are called column-oriented databases. So these files are actually stored in hdfs, that is, stored on the hard disk in the form of block.

Then the tables are stored in the dictionary order of rowkey, in short, for rowkey, from left to right, translated into ascii codes to sort by size. Therefore, for this reason, when designing a rowkey, it is best not to directly use data such as timestamps as rowkey, because this will cause the newly generated data to be crowded into a certain regionserver because its timestamps are all in the same range. Generally, you can add a hash value before the timestamp, which is designed to be in the form of (hash value: timestamp)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.