In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
What is 1.Rowkey?
It can be understood as the primary key of the relational database MySQL Oracle, which is used to identify unique rows.
Is entirely a string of non-repeating strings specified by the user.
The data in HBase is always sorted according to the dictionary sort of Rowkey.
The role of 2.Rowkey
When reading and writing data, use RowKey to find the corresponding Region. For example, if you need to find a piece of data, you must know its RowKey, and when you write data, you should also write according to RowKey.
The data in MemStore is sorted in Rowkey dictionary order. When writing data, the data is first put into MemStore, that is, memory, and the data in memory is sorted according to Rowkey dictionary order.
The data in HFile is sorted in RowKey dictionary order, and the data in memory is eventually persisted to disk, and the disk data HFile is sorted in RowKey dictionary order.
The influence of 3.RowKey on query
Example: RowKey is composed of uid+phone+name
1. Scenarios that can be well supported.
Uid=111 AND phone = 123AND name = abc
Uid=111 AND phone = 123
Uid=111 AND phone = 12?
Uid=111
In this scenario, we all specify the uid part, that is, the first part of RowKey. The RowKey of the first query is in a complete format, so the query efficiency is the best. Although the latter three do not specify a complete RowKey, the support of the query is also good. A scene that is difficult to support
Phone = 123AND name = abc
Phone = 123
Name = abc
In this scenario, the first part of RowKey, uid, is not specified, and only phone and name are used to do the query, that is, no leading part is specified. Then this scenario will cause HBase to scan the full table during the query, which reduces the query efficiency. 4. The effect of RowKey on Region partition
The data of HBase table is distributed to different Region according to RowKey. Unreasonable RowKey design will lead to hot issues, which is that a large number of Client directly access one or a few nodes of the cluster, while other nodes in the cluster are relatively idle, thus affecting the read and write performance of the HBase table.
5.RowKey 's design technique, 1.Salting (salted) Salting, works by placing a fixed-length random number at the beginning of a row key, specifically by assigning a random prefix to rowkey to make it different from the previous order. The number of prefixes assigned should be the same as the number of prefixes you want to spread out to different region. If you have some hot rowkey that appears repeatedly in other evenly distributed rwokey, adding salt is useful.
Ex.: if you have the following rowkey, each region in your table corresponds to every letter in the alphabet. The same region begins with'a 'and the same region begins with' b'. In the table, all the things that start with'f 'are on the same region, and their rowkey looks like this:
Foo0001 a-foo0001foo0002 = = > b-foo0002foo0003 c-foo0003foo0004 d-foo0004
If you need to spread the above region into 4 region. You can use four different salts: 'await,' baked, 'cached,' dashed. Under this scheme, each letter prefix will be in a different region. After adding salt, it is like the example above.
So, you can write to four different region, and in theory, if everyone writes to the same region, you will have four times the throughput before.
Pros and cons: because prefixes are randomly generated, more work is needed to find these lines in dictionary order. From this point of view, salting increases the throughput of write operations, but also increases the overhead of read operations. 2.hashing
The principle of Hashing is to calculate the hash value of RowKey, and then take part of the string of hash and the original RowKey to concatenate. Hash here includes algorithms such as MD5, sha1, sha256 or sha512, and is not limited to the calculation of Java hash values.
For example, we have the following RowKey:
Foo0001 95f18cfoo0001 foo0002 = = > 6ccc20foo0002 foo0003 b61d00foo0003 foo0004 1a7475foo0004
We use md5 to calculate the hash values of these RowKey, and then splice the first 6 bits with the original RowKey to get a new RowKey, as above
Pros and cons: it can break up the entire data set to a certain extent, but it is not good for Scan;. For example, we use the md5 algorithm to calculate the MD5 value of Rowkey, and then intercept the first few strings. Common usage: subString (MD5 (device ID), 0, x) + device ID, where x usually takes 5 or 6. 3.Reversing (reverse) Reversing works by reversing a fixed length or all of the keys.
For example, we have the following URL as RowKey:
Flink.iteblog.com moc.golbeti.knilf www.iteblog.com = = > moc.golbeti.www carbondata.iteblog.com moc.golbeti.atadnobrac def.iteblog.com moc.golbeti.fed
These URL actually belong to the same domain name, but due to the previous differences, the data are not stored together. We can reverse it, as above, after this, the prefix will be the same, and the URL data can be put together.
Pros and cons: effectively scramble row keys, but at the expense of row sorting attributes. 6. Length of RowKey
RowKey can be any string with a maximum length of 64KB (because Rowlength occupies 2 bytes). The shorter the recommendation, the better, for the following reasons:
The persistence file HFile of data is stored according to KeyValue. If the rowkey is too long, such as more than 100byte, 1000W rows of data, rowkey alone will occupy 100cm 1000w = 1 billion bytes, nearly 1G of data, which will greatly affect the storage efficiency of HFile; MemStore will cache part of the data to memory, if the rowkey field is too long, the effective utilization of memory will be reduced, the system can not cache more data, which will reduce the efficiency of retrieval. At present, the operating systems are 64-bit systems, 8-byte memory alignment, controlled at 16 bytes, 8-byte integer times take advantage of the best features of the operating system. 7. Design case analysis 1. Rowkey Design of transaction Class Table
Query the transaction records of a seller within a certain period of time
SellerId + timestamp + orderId
Inquire about the transaction records of a buyer within a certain period of time
BuyerId + timestamp + orderId
Query according to the order number
OrderNo
If a merchant sells a lot of goods, in the first way, there may be a lot of data with the same RowKey prefix on the same Region, causing hot problems. You can design Rowkey to quickly search salt + sellerId + timestamp, where salt is a random number.
We add a salt salt operation before the original structure, such as adding a random number, so that the data can be distributed to different Region.
Scenarios that can be supported:
Full table Scan, because of the salt operation, the data is distributed to different Region, Scan will go to different Region to Scan, so as to improve the high concurrency and retrieval efficiency. Query according to sellerId query according to sellerId + timestamp query 2. Rowkey Design of Financial risk Control
Query the user profile data of a user
Prefix + uidprefix + idcardprefix + tele
Where the prefix generation prefix = substr (md5 (uid), 0, x), x takes 5-6. Uid, idcard and tele represent the user's unique identifier, × ×, and mobile phone number, respectively.
3. Rowkey Design of vehicle networking
Query the data of a car in a certain time range, such as engine data
CarId + timestamp
There are too many cars in a certain batch, causing hot spots.
Prefix + carId + timestamp
Where prefix = substr (md5 (uid), 0, x)
4. Reverse timestamp (reverse time)
Query the user's latest operation record or query the user's operation record for a certain period of time. The design of RowKey is as follows:
Uid + Long.Max_Value-timestamp
Supported scenarios
Query the user's latest operation records
Scan [uid] startRow [uid] [00000000000] stopRow [uid] [uid] [Long.Max_Value-timestamp]
So that we can find out, for example, the last 100 pieces of data.
Query the user's operation records for a certain period of time
Scan [uid] startRow [uid] [Long.Max_Value-startTime] stopRow uid [uid] [Long.Max_Value-endTime] 5. Secondary index
Ex.: there is a HBase table with the following structure and data
Q: how do I find phone=13111111111 users?
When this kind of demand is encountered, the design of HBase can not be satisfied. At this time, it is necessary to introduce the secondary index and build the secondary index with phone as the column name as RowKey,uid/name.
If you do not rely on third-party construction, you can code your own secondary index, and you can also create a secondary index through Phoenix or Solr.
SQL+OLTP = = > Phonenix
Full-text search + secondary index = > Solr/ES
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.