How to use HBase optimization 04/26 Update SLTechnology News&Howtos

How to use HBase optimization

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to use HBase optimization". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. High availability

In HBase, Hmaster is responsible for monitoring the life cycle of RegionServer and balancing the load of RegionServer. If Hmaster dies, the whole HBase cluster will fall into an unhealthy state, and the working state will not last long. So HBase supports highly available configurations for Hmaster.

Shut down the HBase cluster (skip this step if it is not turned on)

[atguigu@hadoop102 hbase] $bin/stop-hbase.sh

Create a backup-masters file in the conf directory

[atguigu@hadoop102 hbase] $touch conf/backup-masters

Configure highly available HMaster nodes in the backup-masters file

[atguigu@hadoop102 hbase] $echo hadoop103 > conf/backup-masters sets 103 as the backup master

Scp the entire conf directory to another node

[atguigu@hadoop102 hbase] $scp-r conf/ hadoop103:/opt/module/hbase/ [atguigu@hadoop102 hbase] $scp-r conf/ hadoop104:/opt/module/hbase/

Open the page to test and view

Http://hadooo102:16010

Among them, the election mechanism refers to the election function of zookeeper.

two。 Pre-partitioning (important)

If the partition rule Region Split is not set, it is possible that the old version of HBase will have a 10G partition of 2, while the new version will definitely skew the data according to the partition such as 64J. 10G. Set up the partition when you create the table. According to the size of the data and the size of the machine. Refer to a pre-partitioning technique

1. Manually set the pre-partition create 'staff1','info','partition1',SPLITS = > [' 1000', 2000', 2000, 3000, 3000, and 4000]

Partitions have negative infinity and positive infinity, and keep in mind that they are compared according to the string comparison order of rowKey. For example, 1512123 is divided into the partition 1000-2000. But 40400 this kind of storage is a bit awkward, so rowkey should try to keep the length consistent, high zero filling, 0040 0400 like this.

two。 Generate hexadecimal sequence pre-partition create 'staff2','info','partition2', {NUMREGIONS = > 15, SPLITALGO = >' HexStringSplit'}

This is partitioned according to hexadecimal data.

3. Pre-partition according to the rules set in the file

Create the splits.txt file as follows:

Aaaabbbbccccdddd

Then execute:

Create 'staff3','partition3',SPLITS_FILE = >' splits.txt'

The contents of the following partition do not match.

Aabbddcc4. API partition hAdmin.createTable (tableDesc); / / create table hAdmin.createTable (tableDesc, start,end,numsplit) by default; / / divide into numsplithAdmin.createTable (tableDesc, splitKeys) based on both start and end; / /

PS:

What we pass in on the command line is, for example, [100200300], but the underlying HBase only knows byte arrays, so we combine the data into two-dimensional arrays such as [], [], [].

/ / Custom algorithm to generate a series of Hash hash values to be stored in a two-dimensional array byte [] [] splitKeys = a hash value function / / create a HBaseAdmin instance HBaseAdmin hAdmin = new HBaseAdmin (HBaseConfiguration.create ()); / / create a HTableDescriptor instance HTableDescriptor tableDesc = new HTableDescriptor (tableName); / / create a HBase table with pre-partition hAdmin.createTable (tableDesc, splitKeys) through a HTableDescriptor instance and a two-dimensional array of hash values; 3.RowKey design (important)

The unique identification of a piece of data is rowkey, so the partition in which the data is stored depends on which pre-partition the rowkey is in. The main purpose of designing rowkey is to make the data evenly distributed in all region (hash, uniqueness, length (maybe even 70mm 100 bits in production), to prevent data skew to a certain extent. Next, let's talk about the common design schemes of rowkey.

Generate random numbers, hash, hash values

The original rowKey is 1001, after SHA1, it becomes: dd01903921ea24941c26a48f2cec24e0bb0e8cc7, the original rowKey is 3001, after SHA1, it becomes: 49042c54de64a1e9bf0b33e00245660ef92dc7bd, the original rowKey is 5001, after SHA1, it becomes: 7b61dec07e02c188790670af43e717f0f46e8913.

Before doing this, we usually choose to take samples from the dataset to determine what kind of rowKey to Hash.

Is the critical value for each partition.

two。 String reversal (timestamp flipping)

20170524000001 to 1000004250710220170524000002 to 20000042507102

To some extent, it is also possible to hash the data coming in from put step by step.

3. String concatenation

20170524000001_a12e20170524000001_93i7

The principle is to follow hashing, uniqueness, and length. Then rowkey is set according to the actual business requirements.

For example, the partition has a total of 300 partition keys: 000001002zone. 2982mobile phone number: 0009mobile phone number (mobile phone number + year / month)% 299looking for partition 000_ mobile phone number _ year / month hash (mobile phone number + year / month) 299 to find the partition is to plan the partition and then put the important data first. Let's do it like this. 4. Memory optimization

HBase operation requires a lot of memory overhead. After all, Table can be cached in memory. Generally, 70% of the available memory is allocated to HBase's Java heap. However, it is not recommended to allocate very large heap memory. There is a RegionServer-level refresh, because the GC process lasts too long and the RegionServer will be in a long-term unavailable state. Generally, 16048GB of memory is fine. If the frame takes up too much memory and the system runs out of memory, the framework will also be dragged to death by the system service.

5. Foundation optimization

Allow content to be appended to HDFS files

Hdfs-site.xml and hbase-site.xml attributes: dfs.support.append explains: enabling HDFS append synchronization can work well with HBase's data synchronization and persistence. The default is true.

Optimize the maximum number of file openings allowed by DataNode

Hdfs-site.xml attribute: dfs.datanode.max.transfer.threads explains: HBase generally operates a large number of files at the same time, which is set to 4096 or higher depending on the number and size of the cluster and the data action. Default value: 4096

Optimize the waiting time for data operations with high latency

Hdfs-site.xml attribute: dfs.image.transfer.timeout explains: if the delay for a data operation is very high and socket needs to wait longer, it is recommended to set this value to a higher value (default is 60000 milliseconds) to ensure that socket will not be dropped by timeout.

Optimize the writing efficiency of data

Mapred-site.xml attribute: mapreduce.map.output.compressmapreduce.map.output.compress.codec explanation: opening these two data can greatly improve the writing efficiency of the file and reduce the writing time. The first property value is changed to true, and the second property value is changed to: org.apache.hadoop.io.compress.GzipCodec or other compression method.

Set the number of RPC snoops

Hbase-site.xml attribute: hbase.regionserver.handler.count explanation: the default value is 30, which is used to specify the number of RPC listeners, which can be adjusted according to the number of requests from the client. When there are more read and write requests, this value is increased.

Optimize HStore file size

Hbase-site.xml attribute: hbase.hregion.max.filesize explains: the default is 10737418240 (10GB). If you need to run a MR task of HBase, you can reduce this value, because one region corresponds to one map task. If a single region is too large, the execution time of the map task will be too long. This value means that if the size of the HFile reaches this value, the region will be cut into two Hfile.

Optimize hbase client cach

Hbase-site.xml attribute: hbase.client.write.buffer explanation: used to specify the HBase client cache. Increasing this value reduces the number of RPC calls, but consumes more memory, and vice versa. Generally, we need to set a certain cache size in order to reduce the number of RPC.

Specify the number of rows obtained by the scan.next scan HBase

Hbase-site.xml attribute: hbase.client.scanner.caching explanation: used to specify the default number of rows fetched by the scan.next method. The higher the value, the greater the memory consumption.

Flush, compact, split mechanism

When MemStore reaches the threshold, the mechanism of Flush the data in Memstore into Storefile;compact is to merge the small files out of flush into large Storefile files. Split means that when the Region reaches the threshold, it will split the excessive Region in two.

Attributes involved:

That is, 128m is the default threshold for Memstore

Hbase.hregion.memstore.flush.size:134217728

That is, the purpose of this parameter is to flush all the memstore of the HRegion when the sum of all the memstore sizes in that HRegion exceeds the specified value. RegionServer's flush processes requests asynchronously by adding a queue to simulate the production and consumption model. Then there is a problem, when the queue is too late to consume, resulting in a large backlog of requests, may lead to a sharp increase in memory, the worst case is to trigger OOM.

Hbase.regionserver.global.memstore.upperLimit:0.4hbase.regionserver.global.memstore.lowerLimit:0.38

That is, when the total memory used by MemStore reaches the value specified by hbase.regionserver.global.memstore.upperLimit, there will be multiple MemStores flush into the file, and the MemStore flush order will be executed in descending order of size until the memory used by MemStore is slightly less than lowerLimit.

That's all for "how to use HBase Optimization". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.