How does HBase design rowkey? 07/01 Update SLTechnology News&Howtos

How does HBase design rowkey?

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "HBase how to design rowkey", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "how HBase designs rowkey" this article.

The rowkey in HBase is sorted in dictionary order, and the rowkey query can achieve millisecond response to tens of millions of data. However, if the design of rowkey is unreasonable, there is often a very common problem-hot spots. When a large number of client requests (read or write) are directed to only one node in the cluster, or a very small number of nodes, it also represents a hot issue.

The way to avoid hot spots is to distribute rowkey evenly across all region as much as possible. Here are several common ways to design rowkey:

First: add salt (salting)

Adding salt means adding random data to the prefix of rowkey so that rowkey can be distributed to other regionserver as much as possible.

Suppose you encounter the following rowkey, and the pre-partitioning of the table is set to one region for each letter. The prefix "a" is a region, the prefix "b" is another region, and so on. Then in this table, all rowkey that starts with "f" will be located in the same region. For example:

Foo0001

Foo0002

Foo0003

Foo0004

So, if you want to spread them into four different region, you can use four different prefixes: a, b, c, and d to add salt. After adding salt, rowkey becomes like this.

A-foo0003

B-foo0001

C-foo0004

D-foo0002

(ps: since data can now be written to four region, in theory, the performance is four times better than the previous write throughput to the same region)

And, if new data is written later, rowkey will randomly add prefixes to different region

Disadvantages: adding salt can greatly avoid hot issues and improve writing efficiency, but because the salt value is randomly added to rowkey, it costs extra overhead when reading. How to read the salted data, which will be introduced later

Second: hash (hashing)

There are many algorithms for hashing, and MD5 is probably more used in rowkey design, but it should be noted that the MD5 hash still has the possibility of collision, the probability is very small, but not zero.

Therefore, when you use MD5 for rowkey hashing, you will attach a unique field, such as the account field account, MD5 the account, intercept the 6-bit md5 return value, and then splice the account field, that is:

Substr (md5 (account)) + account

In addition, with rowkey after the md5 hash, you can use the HexStringSplit method that comes with hbase when creating table pre-partitions

Third: reverse (Reversing)

If the defined rowkey field, the first part of the data changes very slowly, and the tail data changes more frequently, you can consider reversing the field, especially for data similar to timestamps.

No matter which way you design the rowkey, you have to do the corresponding data processing when you query, for example, when you do hash, you also need to hash the data first, and then query the rowkey designed by rowkey; by inversion.

Fourth: minimize rowkey and column cluster length

Rowkey can be any string with a maximum length of 64KB, but it is recommended to keep it as short as possible when designing rowkey, for reasons:

1.hbase data storage is stored in the form of key-value. If the rowkey is relatively long, for example, 100 bytes, then 1000W rows of data are needed for rowkey storage alone, 100cm 1000w = 1 billion bytes, nearly 1G of data.

2.memstore caches data to memory. If the rowkey is longer, it will also take up more space.

3. It is recommended that rowkey be designed at an integer multiple of 8 bytes and controlled at 16 bytes, because most of the current operating systems are 64-bit, and integer multiples make better use of the characteristics of the operating system.

ColumnFamily is the same, as short as possible, preferably one character, such as f or d

Fifth: Byte Patterns

We know that the long type is 8 bytes, and you can store an unsigned number with a maximum of 18446744073709551615 through the long type, using only 8 bytes, but if you store such a number in the form of the string type, it requires almost 3 times the size of the space (assuming each character occupies one byte)

Let's give an example to verify:

/ / long

/ /

Long l = 1234567890L

Byte [] lb = Bytes.toBytes (l)

System.out.println ("long bytes length:" + lb.length); / / returns 8

String s = String.valueOf (l)

Byte [] sb = Bytes.toBytes (s)

System.out.println ("long as string length:" + sb.length); / / returns 10

/ / hash

/ /

MessageDigest md = MessageDigest.getInstance ("MD5")

Byte [] digest = md.digest (Bytes.toBytes (s)

System.out.println ("md5 digest bytes length:" + digest.length); / / returns 16

String sDigest = new String (digest)

Byte [] sbDigest = Bytes.toBytes (sDigest)

System.out.println ("md5 digest as string length:" + sbDigest.length); / / returns 26

However, there is also a disadvantage that when using this type of binary representation, when looking up the data in the hbase shell interface, the readability is poor, such as:

Hbase (main): 002 table1', 0 > get 'table1',' rowkey1'

COLUMN CELL

FRV Q timestamp=1369163040570, value=\ x00\ x01

1 row (s) in 0.0310 seconds

The above is all the content of the article "how HBase designs rowkey". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.