What is the data distribution model in HBase? 10/14 Update SLTechnology News&Howtos

What is the data distribution model in HBase?

2025-10-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "what the data distribution model in HBase is like". It is easy to understand and clear. I hope it can help you solve your doubts. Let me lead you to study and learn this article "what the data distribution model is like in HBase".

A brief introduction of data Distribution

The root of distribution is "scale", which can be understood as the need for computing and storage. When the stand-alone capacity can not carry the growing computing storage requirements, it is necessary to find a way to expand the system. There are usually two ways to expand: improving stand-alone capacity (scale up) and adding machines (scale out, horizontal scaling). Limited to hardware technology, the improvement of stand-alone capacity has an upper limit in one stage, while horizontal expansion can be unlimited in theory, and at the same time, it is cheaper and easier to land. Horizontal expansion can effectively solve the problem of rapid business growth through fast and simple "adding machines", which is almost a necessary capability of modern distributed systems. For explosive growth, horizontal expansion seems to be the only option.

For storage systems, data that was originally stored on one machine is now stored on multiple machines. At this point, two problems must be solved: sharding and copying.

Sharding, also known as partition, splits a dataset "reasonably" into multiple shards, several of which are responsible for each machine. In order to break through the limit of single machine capacity, but also improve the overall access capability. In addition, slicing also reduces the influence range of a single shard fault.

Data replication (replica), also known as "copy". Slicing can not solve the problem of data loss in a single machine, so it is necessary to solve the problem of high availability of the system through redundancy. At the same time, copy mechanism is also an important means to improve system throughput and solve hot issues.

Shards and copies are orthogonal, which means we can use only one or both, but usually at the same time. Because fragmentation solves the problems of scale and scalability, replicas solve the problems of reliability and availability. For a production available system, both must be available at the same time.

From the user / client point of view, shards and copies can boil down to the same problem: request routing, that is, which machine the request should be sent to for processing.

When reading data, there is a mechanism to ensure that there is an appropriate shard / copy to provide the service.

When writing data, the same mechanism can be used to ensure that it is written in an appropriate place and that the copy is consistent.

Whether the request from the client goes directly to the server (such as HBase/cassandra) or through a proxy (such as gateway-based access on the public cloud), request routing is a problem that must be solved in distributed systems.

Whether it is a fragment or a copy, it is essentially the embodiment of data distribution. Let's take a look at HBase's data distribution model.

Data Distribution Model of HBase

The data of HBase is shredded according to the table, and is split according to the rowkey range at behavioral granularity. Each shard is called a region. A cluster has multiple tables, each table is divided into multiple region, and each server serves many region. Therefore, the server of HBase is called RegionServer, or RS for short. RS is orthogonal to the table, that is, the region of one table will be distributed to multiple RS, and a RS will also schedule the region of multiple tables. As shown in the following figure:

"at behavioral granularity" means that a row is the smallest unit divided by region, that is, a row of data belongs to either A region or Bregion and will not be split into two region. (the way to split rows is "vertical split", which usually can only be done at the business level, while HBase is split horizontally)

The copy mechanism of HBase is implemented through the underlying HDFS. Therefore, the copy of the HBase is decoupled from the shard, and the storage and computing are separated. This allows region to move flexibly between RS without the need for data migration, which gives HBase the ability to expand capacity in seconds and great flexibility.

For a single table, a "good" data distribution should be that the amount of data per region is similar, the number of requests (throughput) is similar, and the number of region scheduled by each machine is roughly the same. In this way, the data and access of this table can be evenly distributed in the whole cluster, so as to get the best resource utilization and service quality, that is, to achieve load balancing. When the cluster capacity is expanded and reduced, we hope that this "balance" can be maintained automatically. If the data distribution fails to achieve load balancing, the machine with high load can easily be called the bottleneck of the whole system. The slow response of this machine may cause most of the threads on the client side to wait for the machine to return, thus affecting the overall throughput. Therefore, load balancing is an important goal of region partition and scheduling.

This involves three levels of load balancing:

Logical distribution of data, that is, region partition / distribution, is a mapping problem from rowkey to region.

Physical distribution of data: the scheduling problem of region on RS

The distribution of access: that is, the distribution of system throughput (requests) on each RS, involving the relationship between the amount of data and visits, access hotspots and so on.

It can be seen that there are two levels of routing for the distribution of a row of data (finding the RS where the row of data is located): one is the route from rowkey to region, the other is the route from region to RS. This is the key to flexible scheduling and second capacity expansion of HBase. We will discuss it in detail later. This article only discusses the first two issues, and the third one will be discussed in subsequent articles.

Region partition based on rowkey range

First, let's look at the logical distribution of data, that is, how a table is divided into multiple region.

The granularity of region partitioning is rows, and region is a collection of consecutive rows in this table. The unique identifier of the row is rowkey, so region can be understood as a collection of continuously distributed rowkey. Therefore, this approach is called partition based on rowkey scope.

The rowkey range for which a region is responsible is a left closed and right open interval, so the start key of the latter region is the end key of the previous region. Note that the first region has no start key and the last region has no end key. In this way, all the region of this table can be added together to cover any range of rowkey values. As shown in the following figure:

In the figure above, region1 is the first region, no startKey,region3 is the last region, and there is no endKey. The region distribution in the figure is relatively uniform, that is, the number of rows per region is equal, so how do you get this distribution? Or, how is the boundary of region determined?

Generally speaking, there are three ways to generate region:

Pre-partition when building the table: pre-partition the rowkey by predicting the region

Region splitting: manual splitting, or automatic splitting when certain conditions are reached (for example, region size exceeds a threshold)

Region merging: merging manually

If the region distribution is not explicitly specified when creating the table, HBase will create only one region, and this region can only be scheduled by one machine (the case where a region is scheduled by multiple RS will be discussed later). Then the upper limit of the throughput of this region is the upper limit of the throughput of a single machine. If the table is divided into 8 region and distributed on 8 RS by reasonable pre-partition, the upper limit of throughput of the whole table is the upper limit of 8 machines.

Therefore, in order to make the table have good throughput and performance from the beginning, pre-partitioning is usually needed to build the table in the actual production environment. However, there are some exceptions, such as it is impossible to predict the rowkey range in advance, or it is not easy to split the rowkey range evenly, at this point, you can create a table with only one region, split by the system itself, and gradually form a "uniform" region distribution.

For example, a table that stores employee information for multiple companies, the rowkey consists of orgId + userid, where orgId is the company's id. Since the number of people in each company is uncertain and can vary widely, it is difficult to determine how many orgId are appropriate in a region. At this point, you can create a single region table for it, and then import the initial data. With the automatic splitting of region with the import of the data, you can usually get an ideal region distribution. If the personnel of the follow-up company changes greatly, the split and merger of region can be carried out at any time to obtain the best distribution.

Comparison between dictionary order and rowkey

We mentioned in the previous section that the rowkey range of region is a left-closed and right-open interval, and all rowkey falling in this range belong to this region. In order to make this judgment, it must be compared to the start and end rowkey of the region. In addition to judging the ownership of region, within region, we also need to rely on the comparison rules of rowkey to sort rowkey.

Many people will think that the comparison of rowkey is very simple and there is no need to discuss it. But it is because of its simplicity that it can be used flexibly and diverse, which makes HBase have unlimited possibilities. It can be said that the comparison rule of rowkey is the core of the whole HBase data model, which directly affects the design of the whole request routing system, read-write link, rowkey design, the use of scan and so on, throughout the whole HBase. For users, an in-depth understanding of this rule and its application will help to make good table design and write accurate and efficient scan.

The rowkey of HBase is a string of binary data, which in Java is a byte [], which is the unique identifier of a row of data. The primary key of the business may have various data types, so there are two problems to be solved here:

Convert various actual data types to and from byte []

Order preservation: the sorting result of rowkey in the form of byte [] is consistent with that of the original data.

The comparison of rowkey is the comparison of byte [], which is compared in dictionary order (binary sort). To put it simply, it is the memcmp function in c language. Through the following example, we understand this comparison rule and data type conversion by sorting the results.

(1) the size comparison of ascii codes 1234-> 0x31 32 33 345-> 0x35 from the number represented by the ascii code, 1234 > 5, but from the dictionary order, 1234

< 5 (2)具有相同前缀的ascii码比较1234 ->

0x31 32 33 3412340-> 0x31 32 33 34 00 in C, the string usually ends with 0 itself. Although the two strings in this example have the same prefix, but the second has an extra 0 bytes at the end, the second is "larger".

(3) comparison of positive and negative numbers: int type 100-> 0x00 00 00 64int type-100-> 0xFF FF FF 9C100 >-100, but in its binary expression, 100

< -100 我们可以将这个比较规则总结如下：从左到右逐个字节进行比较，以第一个不同字节的比较结果作为两个byte[]的比较结果字节的比较是按无符号数方式进行的 "不存在"比"存在"小常见的rowkey编码问题：有符号数：二进制表示中，有符号数的首bit是1，在字典序规则下，负数比正数大，所以，当rowkey的值域同时包含正数和负数时，需要对符号位进行反转，以确保正数比负数大倒序：通常用long来描述时间，一般都是倒排的，假设原始值是v，则v的倒序编码是Long#MAX_VALUE - v。下面通过一个前缀扫描的案例来体会一下这个比较规则的应用。示例：前缀扫描 Hbase的rowkey可以理解为单一主键列。如果业务场景需要多列一起构成联合主键(也叫多列主键，组合主键，复合主键等等)，就需要将多列拼接为一列。一般来说，直接将二进制拼接在一起即可。例如： rowkey组成：userId + ts 为了简单，假设userid和ts都是定长的，且只有1个字节。例如：现在，我们要做的事情是，查找某个userid = 2的所有数据。这是一个典型的前缀扫描场景，我们需要构造一个Scan操作来完成：设置正确扫描范围[startRow, stopRow)，与region的边界一样，scan的范围也是一个左闭右开区间。一个直接的思路是找到最小和最大的ts，与userid = 2拼接，作为查询范围，即[0x02 00, 0x02 FF)。由于scan是左臂右开区间，则0x02 FF不会被作为结果返回。所以，这个方案不可行。正确的scan范围必须满足： startRow：必须必任何userId = 2的rowkey都小，且比任何userId = 1的rowkey都大 stopRow：必须必任何userId = 2的rowkey都大，且比任何userId = 3的rowkey都小那如何利用rowkey的排序规则来"找到"这样一个扫描范围呢?

The correct scan range is [0x02, 0x03].

0x02 is smaller than any line with userid = 2. Because the ts column is missing. Similarly, 0x03 is larger than any row with userid = 2 and smaller than any row with userId = 3. As you can see, to implement prefix scanning, you can get the required startRow and stopRow based only on the values of the prefix, without knowing the following columns and their meaning.

Please take a closer look at this example, and then think about how to construct startRow and stopRow in the following scenarios (see the end of the article for the answer).

Where userid = 2 and ts > = 5 and ts

< 20 where userid = 2 and ts >

5 and ts

< 20 where userid = 2 and ts >

5 and ts 2 and userid < 4

There are also the following combined scenarios:

Where userid in (3,5,7,9)

Where userid = 2 and ts in (10,20,30)

Now, you can feel the difficulties and pain points of using scan. In the above example, there are only two fixed-length columns, but in the actual business, the columns may be longer, with a variety of data types and rich query patterns. At this point, it is difficult to construct a correct and efficient scan. Then why are there these problems? Is there a systematic solution?

From a formal point of view, this is a problem of "how to transform business query logic into HBase query logic", which is essentially a mapping problem from relational table model to KV model. HBase only provides API for the KV layer, so users have to implement the conversion between the two models themselves. That is why there are so many difficult problems above. Not only HBase, but all KV storage systems face the same dilemma when faced with complex business models.

The solution to this problem is SQL on NoSQL. There are many such solutions in the industry (such as Hive,presto, etc.), and the solution on top of HBase is Phoenix. This kind of scheme solves the problem of ease of use of NoSQL by introducing SQL. For the traditional relational database, although it has strong SQL and transaction support, its scalability and performance are limited. In order to solve the performance problem, MySQL provides KV access based on Memcached. In order to solve the scalability problem, there are a variety of NewSQL products, such as Spanner/F1,TiDB,CockroachDB. NoSQL is doing SQL, and those who support SQL are doing KV. We can imagine what the future storage and database system will be like. This topic is so large that it is outside the scope of this article, so it will not be discussed here.

Metadata Management and routing of region

We discussed earlier that by dividing the rows of a table into reasonable region, we can get a region distribution with a similar amount of data. Through reasonable operation and maintenance means (splitting and merging of region), we can ensure the uniform distribution of region during the continuous operation of the system. At this point, the logical split of the data can be achieved evenly. In this section, we take a look at how region is distributed over RS and how clients locate region.

Because of the uncertainty or subjectivity of the rowkey scope of the region (artificial split), it is impossible to calculate which region the rowkey belongs to by a mathematical formula (comparing the fragmentation of the consistent hash). Therefore, the scope-based fragmentation approach requires a metadata table to record which region a table is divided into and what the start and end rowkey of each region is. This metadata table is the meta table, and in the HBase1.x version the table name is "hbase:meta" (in version 094 or older, it is-ROOT- and .meta. Two metadata tables).

We briefly understand the positioning process of region from the Put operation.

Find the RS where the meta table is located on ZK (cache)

Go to the meta table to find the region where rowkey is located and the RS (cache) where the region is located.

Send a Put request to the RS,RS to perform the write operation according to the region name

If RS finds that the region is not here, an exception is thrown and the client reroutes

Whether you read or write, the logic of locating region is the same. In order to reduce client access to the meta table, the client caches region location information, and only if the cache is incorrect, it needs to access the meta table to get the latest information. Therefore, the request routing of HBase is a solution based on routing table. Correspondingly, the slicing method based on consistent Hash is calculated to get the distribution information.

This routing table-based approach

Advantages: the attribution RS of region can be changed arbitrarily, that is to say, the scheduling of region on RS is flexible and can be intervened manually.

Disadvantages: the meta table is a single point, and its limited throughput limits the size of the cluster and the number of clients

The flexible scheduling of region, combined with the architecture of separation of storage and computing, gives HBase extremely powerful capabilities.

Second capacity expansion: the newly added RS can be put into production immediately by moving region, which does not depend on data migration (subsequent slow migration)

Manual isolation: for problematic region (such as hot spots, abnormal requests), you can manually move to a separate RS to quickly isolate the failure domain.

These two points can not be achieved by many slicing schemes based on consistent hash. Of course, the price HBase pays for this flexibility is the complex meta table management mechanism. The key problem is the single point problem of meta table. For example, a large number of clients will request the meta table to obtain the region location,meta table with a high load, which will limit the overall throughput of obtaining location, thus limiting the size of the cluster and the size of the client.

For a cluster with hundreds of machines and hundreds of thousands of regions, this mechanism works well. However, when the size of the cluster expands further and the upper limit of access to the meta table is reached, the service will be affected by the access blocking of the meta table. Of course, the vast majority of business scenarios cannot reach this critical scale.

There are many ways to solve the problem of meta tables, and the easiest way is to copy. For example, in TiDB's PD service, a request to get location can be sent to any PD server.

Scheduling of region

Let's discuss the region scheduling problem:

Load balancing of region between RS

The same region is scheduled on multiple RS

For the first question, the default balancing strategy for HBase is to schedule as many region as possible on each RS, per table.

This strategy assumes that the amount of data in each region is relatively evenly distributed and the requests for each region are relatively uniform. At this point, the strategy is very effective. This is also the most widely used one at present. At the same time, HBase also provides load-based scheduling (StochasticLoadBalancer), which takes into account a variety of factors to make scheduling decisions, but there is a temporary lack of cases and data used in the production environment.

For the second problem, region is scheduled on only one RS at a time, so that HBase provides strong consistent semantics when the request is successful, that is, the successfully written data can be read immediately. The cost is the single point scheduling of region, that is, the server where the region is located jitters for various reasons, which will affect the quality of service of the region. We can divide the issues affecting region services into two categories:

Unexpected: downtime recovery, GC, network problems, disk jitter, hardware problems, etc.

Predictable (or artificial): region movement caused by capacity expansion / reduction, region split/merge, etc.

When these events occur, they will have more or less an impact on the region service. Especially in downtime scenarios, it usually takes more than 1 minute to complete a series of steps from the discovery of node downtime by ZK to the re-assign,split log,log replay of region. For the region on the down node, it means that none of these region can be served during this period of time.

The solution is still a replica solution, which allows region to schedule on multiple RS, and the client chooses one of them for access, a feature called "region replia". The introduction of copies will inevitably lead to additional costs and consistency problems. At present, the implementation of this feature does not reduce MTTR time, memory water level control, dirty reading, so that this feature has not been used on a large scale in production.

These are all the contents of the article "what is the data distribution model in HBase?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.