In-depth study of the hot issues of hbase, the design of hbase table rk and manual partitioning region 07/11 Update SLTechnology News&Howtos

In-depth study of the hot issues of hbase, the design of hbase table rk and manual partitioning region

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Wednesday, 2019-2-20

In-depth study of the hot issues of hbase, the design of hbase table rk and manual partitioning region

Recorded on Friday, 2019-1-25

Hot issues of hbase:

Hbase hotspot resolution (pre-partitioning) https://blog.csdn.net/qq_31289187/article/details/80869906

Three ways of Hbase split and process https://www.cnblogs.com/niurougan/p/3976519.html of split

Several GC strategies (flush,compact,split) http://www.cnblogs.com/juncaoit/p/6170642.html of 082 HBase

It explains these, the use of the hbase command.

4 methods of pre-partitioning https://www.cnblogs.com/juncaoit/p/6170510.html of 081 Region

What is the hot issue of hbase

Causes of hot issues

1. The data in hbase is sorted according to dictionary order. When a large number of continuous rowkey are written in individual region, the data distribution among each region is uneven.

2. When creating a table, there is no pre-partition in advance. By default, there is only one region for the created table, and a large amount of data is written to the current region.

3. The creation table has been pre-partitioned in advance, but the designed rowkey has no rules to follow, and the designed rowkey should be composed of regionNo+messageId.

How to solve hot issues

To solve this problem, the key is to design a rowkey that can distribute data evenly. Like relational databases, rowkey is the primary key used to retrieve records. To access rows in hbase table, rowkey can be any string (the maximum length is 64KB, and the length in practical application is usually 10-100bytes). Inside hbase, rowkey is saved as a byte array, and when stored, the data is sorted and stored according to the dictionary order of rowkey.

Create table command:

Create 'testTable', {NAME = >' cf', DATA_BLOCK_ENCODING = > 'NONE', BLOOMFILTER = >' ROW', REPLICATION_SCOPE= > '0mm, VERSIONS = >' 1mm, COMPRESSION = > 'snappy', MIN_VERSIONS = >' 0mm, TTL = > '15552000mm, KEEP_DELETED_CELLS = >' false', BLOCKSIZE = > '65536', IN_MEMORY = >' false', BLOCKCACHE = > 'true', METADATA = > {' ENCODE_ON_DISK' = > 'true'}}, {SPLITS_FILE= >' / app/soft/test/region.txt'}

Https://blog.csdn.net/weixin_41279060/article/details/78855679 hbase series-Hbase hot issues, data skew and hash design of rowkey

Hash Design of pre-partitioning and rowkey-- solving data skew and Hot issues

Pre-partitioning, so that the data of the table can be evenly distributed in the cluster, instead of default that only one region is distributed on one node of the cluster. (number of pre-partitions = multiple of nodes. According to the estimation of the amount of data, the region will be sorted if it is insufficient. After pre-partition, the rowkey of each region is still ordered.)

How to pre-partition a hbase table

HBase pre-partition method https://www.cnblogs.com/quchunhui/p/7543385.html *

Design principles of Hbase table-summarize https://blog.csdn.net/m0_37138008/article/details/78985946

Row key (RowKey) design

HBase's rows are sorted by line keys in dictionary order, which optimizes scanning, allowing storage of related rows or adjacent rows that will be read together.

However, poorly designed row keys are a common cause of hots potting (hot issues). Hots potting occurs when a large amount of client traffic (traffic) is directed to one or more nodes on the cluster. This traffic may represent read, write, or other operations. The traffic exceeds the capacity of the single machine that hosts the region, which can lead to performance degradation and possibly the unavailability of the region. Other region on the same RegionServer may also be adversely affected because the host is unable to provide the load requested by the service. It is critical to design data access patterns that enable clusters to be fully and evenly used.

Hash Design of pre-partitioning and rowkey-- solving data skew and Hot issues

Pre-partition

After a RegionServer can manage 10-1000 Region,0.92.x versions, the default Region size is 10G, which supports 256MB downwards and 20g upwards, that is, the amount of data that each RegionServer can manage is 2.5GB-20TB.

If there are 5 nodes and the amount of data within 3 years is 5T, then the number of partitions can be preset as:

5000G/10G=500 region

The Region will be evenly distributed among the nodes of the cluster (depending on the performance and storage space of the machine). If the machine hard disk is insufficient, you can add a hard disk, and if the performance is insufficient, you can add a new node (add a new machine).

Rowkey length principle (preferably no more than 16 bytes)

Rowkey is a binary stream, and the length of Rowkey is recommended by many developers to be designed at 10 to 100 bytes, but it is recommended that the shorter the better, no more than 16 bytes.

The reasons are as follows:

(1) the persistence file HFile of data is stored according to KeyValue. If the Rowkey is too long, for example, 100 bytes, 10 million columns of data alone Rowkey will occupy 100 * 10 million = 1 billion bytes, nearly 1G of data, which will greatly affect the storage efficiency of HFile.

(2) MemStore will cache part of the data to memory. If the effective utilization of memory in Rowkey fields is too long, the system will not be able to cache more data, which will reduce the efficiency of retrieval. So the shorter the byte length of the Rowkey, the better.

(3) at present, the operating systems are all 64-bit systems with 8-byte memory alignment. Controlled at 16 bytes, 8-byte integer multiples take advantage of the best features of the operating system.

Rowkey hashing principle

Treat the primary key hash as the head of rowkey

The unique principle of rowkey

It must be designed to ensure its uniqueness. Rowkey is sorted and stored in dictionary order. Therefore, when designing rowkey, we should take full advantage of this sorting feature, store frequently read data together, and put together data that may be accessed recently.

Timestamp reversal

If multiple versions of the data need to be retained, you can use the reversed timestamp as part of the rowkey and append it to the end of the key with Long.Max_Value-timestamp, such as [key] [reverse_timestamp]. The latest value of [key] can be obtained from the first record of [key] through scan [key], because the rowkey in the HBase is ordered, and the first record is the last data entered.

The entire rowkey (timestamp is not necessary, depending on the business)

Rowkey= hash (primary key) + Long.Max_Value-timestamp

Author: boat824109722

Source: CSDN

Original: https://blog.csdn.net/weixin_41279060/article/details/78855679

Copyright notice: this article is the original article of the blogger, please attach a link to the blog article to reprint it!

Summary of rk Design 1:

1. First, plan the size of the hbase table and calculate the reasonable number of region.

2. Rk length design (preferably no more than 16 bytes)

3. Rk hashing principle (after the primary key hash is regarded as the header of rk, the hash here is understood as a random number assigned by the prefix is added before the rk)

4. Rk unique principle (put frequently read data together, data that may be accessed recently in a block)

5. The number of versions is 3, if the out-of-date data is not very important.

Summary of the design of row key rk 2:

When designing row keys, you should try to write data to multiple region at the same time, instead of writing to only one region (avoiding hot issues in hbase), you can add random numbers assigned with prefixes to the front of the rk, so that they can be distributed to different region (salting). The key that uses the sequence will make the data that is out of order in order and put the load on a machine. So try to avoid line keys like timestamps or sequences (e.g. 1, 2, 3). (reduce monotonously increasing row keys / time series data).

Rule of thumb of table schema

1. The region size is between 10 and 50GB.

2. The size of the unit should not exceed 10MB. If you use Object Store (described below), it can be relaxed to 50MB; otherwise, consider storing the unit data in HDFS or storing a pointer to it in HBase.

3. A typical schema contains 1-3 column families in each table

4. For a table with only 1 or 2 column families, 50 to 100 region is an appropriate number. It should be reminded that each region is a contiguous segment of the column family.

5. The shorter the name of the column family, the better, because for each value (ignoring prefix coding, prefix encoding), the column family name is saved once. They should not self-documenting and describe themselves like a typical RDNMS.

6. If data or log information is stored on a time-based machine, the Row Key is obtained by adding time to the device ID or server ID, then you end up with the pattern that the old data region has no additional writes except for a specific time period. In this case, you get a small amount of active region and a large number of old region with no new writes. At this time, since the resource consumption only comes from the active region, a large amount of region can be accommodated.

Most of the time, slight inefficiencies don't make a big difference. But unfortunately, it cannot be ignored here. Column families, properties, and row keys are repeated hundreds of millions of times in the data.

1. Column family: make the column family name as small as possible, preferably one character. (e. G. "d" means data/default).

2. Attribute: the detailed attribute name (for example, "myVeryImportantAttribute") is easy to read, and it is best to save it to HBase with a short attribute name (e.g., "via").

3. Row key length: make the row key short enough to be readable, which is helpful to get data (e.g., Get vs. Scan). Short keys are useless for accessing data and are no better for get/scan than long keys. Designing row keys requires tradeoffs.

4. Byte mode: the long type has 8 bytes. Unsigned numbers can be saved to 18446744073709551615 within 8 bytes. If you save in a string-- let's say one byte and one character-- it takes nearly three times the number of bytes.

Row keys never change: row keys cannot be changed. The only way to "change" is to delete and then insert. This is a frequently asked question, so be careful to start by making the row keys correct (and / or before inserting a lot of data).

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.