HBase Rowkey design specification 07/09 Update SLTechnology News&Howtos

HBase Rowkey design specification

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

What is 1.Rowkey?

It can be understood as the primary key of the relational database MySQL Oracle, which is used to identify unique rows.

Is entirely a string of non-repeating strings specified by the user.

The data in HBase is always sorted according to the dictionary sort of Rowkey.

The role of 2.Rowkey

When reading and writing data, use RowKey to find the corresponding Region. For example, if you need to find a piece of data, you must know its RowKey, and when you write data, you should also write according to RowKey.

The data in MemStore is sorted in Rowkey dictionary order. When writing data, the data is first put into MemStore, that is, memory, and the data in memory is sorted according to Rowkey dictionary order.

The data in HFile is sorted in RowKey dictionary order, and the data in memory is eventually persisted to disk, and the disk data HFile is sorted in RowKey dictionary order.

The influence of 3.RowKey on query

Example: RowKey is composed of uid+phone+name

1. Scenarios that can be well supported.

Uid=111 AND phone = 123AND name = abc

Uid=111 AND phone = 123

Uid=111 AND phone = 12?

Uid=111

In this scenario, we all specify the uid part, that is, the first part of RowKey. The RowKey of the first query is in a complete format, so the query efficiency is the best. Although the latter three do not specify a complete RowKey, the support of the query is also good. A scene that is difficult to support

Phone = 123AND name = abc

Phone = 123

Name = abc

In this scenario, the first part of RowKey, uid, is not specified, and only phone and name are used to do the query, that is, no leading part is specified. Then this scenario will cause HBase to scan the full table during the query, which reduces the query efficiency. 4. The effect of RowKey on Region partition

The data of HBase table is distributed to different Region according to RowKey. Unreasonable RowKey design will lead to hot issues, which is that a large number of Client directly access one or a few nodes of the cluster, while other nodes in the cluster are relatively idle, thus affecting the read and write performance of the HBase table.

5.RowKey 's design technique, 1.Salting (salted) Salting, works by placing a fixed-length random number at the beginning of a row key, specifically by assigning a random prefix to rowkey to make it different from the previous order. The number of prefixes assigned should be the same as the number of prefixes you want to spread out to different region. If you have some hot rowkey that appears repeatedly in other evenly distributed rwokey, adding salt is useful.

Ex.: if you have the following rowkey, each region in your table corresponds to every letter in the alphabet. The same region begins with'a 'and the same region begins with' b'. In the table, all the things that start with'f 'are on the same region, and their rowkey looks like this:

Foo0001 a-foo0001foo0002 = = > b-foo0002foo0003 c-foo0003foo0004 d-foo0004

If you need to spread the above region into 4 region. You can use four different salts: 'await,' baked, 'cached,' dashed. Under this scheme, each letter prefix will be in a different region. After adding salt, it is like the example above.

So, you can write to four different region, and in theory, if everyone writes to the same region, you will have four times the throughput before.

Pros and cons: because prefixes are randomly generated, more work is needed to find these lines in dictionary order. From this point of view, salting increases the throughput of write operations, but also increases the overhead of read operations. 2.hashing

The principle of Hashing is to calculate the hash value of RowKey, and then take part of the string of hash and the original RowKey to concatenate. Hash here includes algorithms such as MD5, sha1, sha256 or sha512, and is not limited to the calculation of Java hash values.

For example, we have the following RowKey:

Foo0001 95f18cfoo0001 foo0002 = = > 6ccc20foo0002 foo0003 b61d00foo0003 foo0004 1a7475foo0004

We use md5 to calculate the hash values of these RowKey, and then splice the first 6 bits with the original RowKey to get a new RowKey, as above

Pros and cons: it can break up the entire data set to a certain extent, but it is not good for Scan;. For example, we use the md5 algorithm to calculate the MD5 value of Rowkey, and then intercept the first few strings. Common usage: subString (MD5 (device ID), 0, x) + device ID, where x usually takes 5 or 6. 3.Reversing (reverse) Reversing works by reversing a fixed length or all of the keys.

For example, we have the following URL as RowKey:

Flink.iteblog.com moc.golbeti.knilf www.iteblog.com = = > moc.golbeti.www carbondata.iteblog.com moc.golbeti.atadnobrac def.iteblog.com moc.golbeti.fed

These URL actually belong to the same domain name, but due to the previous differences, the data are not stored together. We can reverse it, as above, after this, the prefix will be the same, and the URL data can be put together.

Pros and cons: effectively scramble row keys, but at the expense of row sorting attributes. 6. Length of RowKey

RowKey can be any string with a maximum length of 64KB (because Rowlength occupies 2 bytes). The shorter the recommendation, the better, for the following reasons:

The persistence file HFile of data is stored according to KeyValue. If the rowkey is too long, such as more than 100byte, 1000W rows of data, rowkey alone will occupy 100cm 1000w = 1 billion bytes, nearly 1G of data, which will greatly affect the storage efficiency of HFile; MemStore will cache part of the data to memory, if the rowkey field is too long, the effective utilization of memory will be reduced, the system can not cache more data, which will reduce the efficiency of retrieval. At present, the operating systems are 64-bit systems, 8-byte memory alignment, controlled at 16 bytes, 8-byte integer times take advantage of the best features of the operating system. 7. Design case analysis 1. Rowkey Design of transaction Class Table

Query the transaction records of a seller within a certain period of time

SellerId + timestamp + orderId

Inquire about the transaction records of a buyer within a certain period of time

BuyerId + timestamp + orderId

Query according to the order number

OrderNo

If a merchant sells a lot of goods, in the first way, there may be a lot of data with the same RowKey prefix on the same Region, causing hot problems. You can design Rowkey to quickly search salt + sellerId + timestamp, where salt is a random number.

We add a salt salt operation before the original structure, such as adding a random number, so that the data can be distributed to different Region.

Scenarios that can be supported:

Full table Scan, because of the salt operation, the data is distributed to different Region, Scan will go to different Region to Scan, so as to improve the high concurrency and retrieval efficiency. Query according to sellerId query according to sellerId + timestamp query 2. Rowkey Design of Financial risk Control

Query the user profile data of a user

Prefix + uidprefix + idcardprefix + tele

Where the prefix generation prefix = substr (md5 (uid), 0, x), x takes 5-6. Uid, idcard and tele represent the user's unique identifier, × ×, and mobile phone number, respectively.

3. Rowkey Design of vehicle networking

Query the data of a car in a certain time range, such as engine data

CarId + timestamp

There are too many cars in a certain batch, causing hot spots.

Prefix + carId + timestamp

Where prefix = substr (md5 (uid), 0, x)

4. Reverse timestamp (reverse time)

Query the user's latest operation record or query the user's operation record for a certain period of time. The design of RowKey is as follows:

Uid + Long.Max_Value-timestamp

Supported scenarios

Query the user's latest operation records

Scan [uid] startRow [uid] [00000000000] stopRow [uid] [uid] [Long.Max_Value-timestamp]

So that we can find out, for example, the last 100 pieces of data.

Query the user's operation records for a certain period of time

Scan [uid] startRow [uid] [Long.Max_Value-startTime] stopRow uid [uid] [Long.Max_Value-endTime] 5. Secondary index

Ex.: there is a HBase table with the following structure and data

Q: how do I find phone=13111111111 users?

When this kind of demand is encountered, the design of HBase can not be satisfied. At this time, it is necessary to introduce the secondary index and build the secondary index with phone as the column name as RowKey,uid/name.

If you do not rely on third-party construction, you can code your own secondary index, and you can also create a secondary index through Phoenix or Solr.

SQL+OLTP = = > Phonenix

Full-text search + secondary index = > Solr/ES

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.