How does hbase determine which region the data is written to? 07/15 Update SLTechnology News&Howtos

How does hbase determine which region the data is written to?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "hbase how to determine which region data to write", the content is easy to understand, clear, hope to help you solve doubts, the following let the editor lead you to study and learn "hbase how to determine which region data is written into which region" this article.

In HBase, the table is divided into 1.. n Region, which is hosted in the RegionServer. Two important attributes of Region: StartKey and EndKey represent the rowKey scope maintained by Region. When we want to read / write data, if the rowKey falls within a certain start-end key range, then we will locate the target region and read / write to the relevant data. To put it simply, there is a little similar crowd division, with children aged 1-15, young people aged 16-39, middle-aged 40-64 and old people over 65. (these numbers are all clapped on the head, just for example, not true.) then someone finds a team, and then finds the team it belongs to according to its age and range.

Then, by default, when we just use HBaseAdmin to specify TableDescriptor to create a table, there is only one region, which is in a period of chaos, and the start-end key has no boundaries. Any kind of rowKey can be accepted, all installed in this region, however, when the data is more and more, the size of the region is getting larger and larger, to a certain threshold, hbase thinks that stuffing data into this region is no longer appropriate, then you will find a midKey to split the region into two region, a process called region-split. And midKey is the critical value of these two region.

How do I find midKey? The content involved is more, let's not discuss it for the time being, the simplest one can be thought of as the rowKey of the row of data of the total number of rows / 2 of region. Although it's actually a little more complicated than it is.

If we build tables and Put data constantly by default, what is more serious is that our rowkey is still increasing sequentially, which is quite scary. The shortcomings are obvious.

The first is hot writing, we always write to the region where the largest start-key is located, because our rowkey is always larger than before, and hbase's are sorted in ascending order. So the write operation is always located in that region.

Second, because of the write hotspot, we always write records to the region of the largest start-key. The previously split region will no longer be written, and they are a bit left out in the cold. They are all in a half-full state, and this distribution is also disadvantageous.

If the data grows rapidly and the number of split increases rapidly in the scenario of frequent write comparison, split is time-consuming and resource-consuming, so we don't want this to happen frequently.

Seeing these shortcomings, we know that in order to achieve better parallelism in a clustered environment, we want to have a good load blance so that the request processing provided by each node is equal. We also hope that region will not split too often, because split will bring server to a standstill for a period of time. How can we do that?

Random hashing and pre-partitioning. The combination of the two is relatively perfect, pre-partition at the beginning of the pre-built part of the region, these region maintain their own start-end keys, coupled with random hashing, write data can evenly hit these pre-built region, can solve the above shortcomings, greatly improve the performance.

Sometimes we only need to hash the rowkey randomly according to our business scenario, and the random rowkey may be smaller than the startkey of the second region and larger than the startkey of the first region, so we communicate with the first region, and so on. Although rowkey randomization does not distribute the entire incremental rowkey data to all region, it can guarantee a general distribution.

The above is all the contents of the article "how does hbase determine which region the data is written into?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.