Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the RowKey design methods of HBase?

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the RowKey design methods of HBase". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Hbase is stored in 3D order, and the data in HBase can be quickly located through three dimensions: rowkey (row key), column key (column family and qualifier) and TimeStamp (timestamp).

Rowkey in HBase can uniquely identify a row of records. When querying with HBase, there are the following ways:

Through get, specify rowkey to get the only record

Set startRow and stopRow parameters for range matching in scan mode

Full table scan, that is, directly scan all row records in the entire table

(newer hbase can also be indexed through column and values, but indexing without rowkey is slower.)

Rowkey length principle

Rowkey is a binary code stream, which can be any string with a maximum length of 64kb. In practical applications, it is generally 10-100bytes, saved in the form of byte [], and is generally designed to have a fixed length.

It is recommended that the shorter the better, no more than 16 bytes, for the following reasons:

The persistence file HFile of data is stored according to KeyValue. If the rowkey is too long, such as more than 100byte, 1000W rows of data, rowkey alone will occupy 100cm 1000w = 1 billion bytes, nearly 1G of data, which will greatly affect the storage efficiency of HFile.

MemStore will cache part of the data to memory, if the rowkey field is too long, the effective utilization of memory will be reduced, the system can not cache more data, which will reduce the efficiency of retrieval.

At present, the operating systems are all 64-bit systems, with 8 bytes of memory aligned and controlled at 16 bytes. The integer multiple of 8 bytes makes use of the best features of the operating system.

Rowkey hashing principle

If the rowkey is incremented by timestamp, do not put the time in front of the binary code. It is recommended to use the high bit of the rowkey as a hash field, which is randomly generated by the program, and the low bit time field. This will increase the probability that the data will be evenly distributed in each RegionServer to achieve load balancing. If there is no hash field, the first field is directly time information, and all data will be concentrated on a single RegionServer, so that the load will be concentrated on individual RegionServer during data retrieval, which will cause hot issues and reduce query efficiency.

The unique principle of rowkey

It must be designed to ensure its uniqueness. Rowkey is sorted and stored in dictionary order. Therefore, when designing rowkey, we should take full advantage of this sorting feature, store frequently read data together, and put together data that may be accessed recently.

What is a hot spot?

The rows in HBase are sorted according to the dictionary order of rowkey, and this design optimizes the scan operation so that the related rows and the rows that will be read together can be saved nearby, which is convenient for scan. However, poor rowkey design is the source of the hot spots. Hotspots occur in a large number of client direct access to one or a few nodes of the cluster (access may be read, write, or other operations). A large number of visits will cause the single machine where the hotspot region resides beyond its capacity, resulting in performance degradation and even region unavailability, which will also affect other region on the same RegionServer, because the host cannot serve the requests of other region. A good data access pattern is designed so that the cluster can be fully and evenly utilized.

To avoid writing hotspots, rowkey is designed so that different peers are in the same region, but in more cases, data should be written to multiple region of the cluster instead of one.

Here are some common ways to avoid hotspots and their advantages and disadvantages:

Salt

The addition of salt here is not the addition of salt in cryptography, but the addition of a random number before rowkey, specifically assigning a random prefix to rowkey to make it different from the beginning of the previous rowkey. The number of prefixes assigned should be the same as the number of prefixes you want to use to spread across different region. After adding salt, the rowkey will be scattered to each region according to the randomly generated prefixes to avoid hot spots.

Hash

The hash will always salt the same line with the same prefix. Hashes can also spread the load across the cluster, but reading is predictable. Using a determined hash allows the client to reconstruct the complete rowkey, and the get operation can be used to accurately obtain a row of data.

Reverse

The third way to prevent hotspots is to reverse fixed-length or digital format rowkey. This puts the frequently changed parts of the rowkey (the most meaningless parts) first. This can effectively random rowkey, but at the expense of the ordering of rowkey.

In the example of reversing rowkey, the mobile phone number is rowkey, and the string reversed by the mobile phone number can be used as the rowkey, which avoids the hot issues caused by starting with a fixed mobile phone number.

Timestamp reversal

A common data processing problem is to quickly get the latest version of the data. Using inverted timestamps as part of rowkey is very useful for this problem. You can append to the end of key with Long.Max_Value-timestamp, for example, [key] [reverse_timestamp], the latest value of [key] can get the first record of [key] through scan [key], because the rowkey in HBase is ordered. The first record is the last data entered.

For example, you need to save a user's operation record and sort it in reverse order according to the operation time. When designing a rowkey, you can design it like this.

[userId inversion] [Long.Max_Value-timestamp], when querying the recorded data of all user operations, directly specify that the inverted userId,startRow is [userId inversion] [000000000000] and stopRow is [userId inversion] [Long.Max_Value-timestamp]

If you need to query the operation records of a certain period of time, startRow is [user inversion] [Long.Max_Value-start time] and stopRow is [userId inversion] [Long.Max_Value-end time] (when hot spots are dispersed, the query can support diversity)

Some other suggestions

Minimize the size of rows and columns in HBase, value is always transmitted with its key. When a specific value is transferred between systems, its rowkey, column name, and timestamp are also transferred. If your rowkey and column names are so large that they can even be compared to specific values, you will encounter some interesting problems. The index in HBase storefiles (which facilitates random access) ends up taking up a lot of memory allocated by HBase because the specific value and its key are large. You can increase the block size to increase the storefiles index at larger intervals, or modify the table schema to reduce the size of the rowkey and column names. Compression also helps with larger indexes.

The column family is as short as possible, preferably one character

Long attribute names are readable, but shorter attribute names are better stored in HBase

Rowkey length principle

Rowkey is a binary code stream, which can be any string with a maximum length of 64kb. In practical applications, it is generally 10-100bytes, saved in the form of byte [], and is generally designed to have a fixed length.

It is recommended that the shorter the better, no more than 16 bytes, for the following reasons:

The persistence file HFile of data is stored according to KeyValue. If the rowkey is too long, such as more than 100byte, 1000W rows of data, rowkey alone will occupy 100cm 1000w = 1 billion bytes, nearly 1G of data, which will greatly affect the storage efficiency of HFile.

MemStore will cache part of the data to memory, if the rowkey field is too long, the effective utilization of memory will be reduced, the system can not cache more data, which will reduce the efficiency of retrieval.

At present, the operating systems are all 64-bit systems, with 8 bytes of memory aligned and controlled at 16 bytes. The integer multiple of 8 bytes makes use of the best features of the operating system.

This is the end of the content of "what are the RowKey design methods of HBase". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report