How to achieve performance tuning in HBase 04/28 Update SLTechnology News&Howtos

How to achieve performance tuning in HBase

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces HBase how to achieve performance tuning, the article introduces in great detail, has a certain reference value, interested friends must read it!

Node management

Node offline

You can stop RegionServer by running the following script on a specific node of HBase:

$. / bin/hbase-daemon.sh stop regionserver

RegionServer first shuts down all region and then shuts itself down, and in the process of stopping, RegionServer will report to Zookeeper that it has expired. Master will find that RegionServer is dead and treat it as if it were a crashed server. He will assign the region to other nodes.

Stop Load Balancer before going offline

If a node is to be shut down while running load balancer, the recovery of Load Balancer and Master may compete for the Regionserver to be taken offline. To avoid this problem, stop load balancer first, see the following Load Balancer.

One drawback of RegionServer offline is that the Region will be offline for a long time. Regions is shut down sequentially. If there are many region on a server, from the first region will be offline until the last region is closed, and the Master confirms that he is dead, the region can only be online, the whole process will take a long time. In HBase 0.90.2, we added a feature that allows the node to gradually get rid of its load and finally shut down. HBase 0.90.2 adds a graceful_stop.sh script, which can be used like this

/ bin/graceful_stop.sh Usage: graceful_stop.sh [--config & conf-dir >] [--restart] [--reload] [--thrift] [--rest] & hostname > thrift If we should stop/start thrift before/after the hbase stop/start rest If we should stop/start rest before/after the hbase stop/start restart If we should restart after graceful stop reload Move offloaded regions back on to the stopped server debug Move offloaded regions back On to the stopped server hostname Hostname of server we are to stop

To go offline, a RegionServer can do this.

$. / bin/graceful_stop.sh HOSTNAME

The HOSTNAME here is the host of the RegionServer you want to offline.

The HOSTNAME passed from HOSTNAME to graceful_stop.sh must be the same as the hostname used by hbase, which hbase uses to distinguish between RegionServers. You can use master's UI to check the RegionServers's id. It's usually hostname, or it could be FQDN. No matter which one HBase uses, you can pass it to the graceful_stop.sh script, which currently does not support the use of IP addresses to infer hostname. So if you use IP, you will find that server is not running and there is no way to go offline.

The graceful_stop.sh script removes the region from the RegionServer one by one to reduce the load of the modified RegionServer. He will first remove a region, then place the region in a new place, and then remove the next one until it is all removed. Finally, the graceful_stop.sh script will let RegionServer stop.,Master notice that RegionServer has been taken offline and that all region has been redeployed. RegionServer can end cleanly, and there are no WAL logs to split.

Load Balancer turn off Region Load Balancer when executing the graceful_stop script (otherwise balancer and offline scripts will conflict over region deployment):

Hbase (main): 001in 0 > balance_switch falsetrue0 row (s) in 0.3590 seconds

The above is to turn off the balancer, to turn it on:

Hbase (main): 001in 0 > balance_switch truefalse0 row (s) in 0.3590 seconds

14.3.2. Restart in turn you can also have the script restart a RegionServer without changing the location of the Region above. To keep the location of the data, you can Rolling Restart it in turn, like this:

$for i in `cat conf/regionservers | sort`; do. / bin/graceful_stop.sh-- restart-- reload-- debug $I; done & > / tmp/log.txt &

Tail / tmp/log.txt to see how the script runs. The above script operates only on RegionServer. Make sure load balancer is turned off. You also need to update master before. Here is a pseudo-script that restarts in turn, which you can learn from:

Confirm your version and make sure that the configuration has been rsync to the entire cluster. If the version is 0.90.2, you need to patch HBASE-3744 and HBASE-3756.

Run hbck to make sure your cluster is consistent

$. / bin/hbase hbck

When inconsistencies are found, you can fix the reference http://my.oschina.net/drl/blog/683885.

Restart Master:

$. / bin/hbase-daemon.sh stop master;. / bin/hbase-daemon.sh start master

Turn off region balancer:

$echo "balance_switch false" |. / bin/hbase

Run graceful_stop.sh on each RegionServer:

$for i in `cat conf/regionservers | sort`; do. / bin/graceful_stop.sh-- restart-- reload-- debug $I; done & > / tmp/log.txt &

If you still drive thrift and rest server in RegionServer. You also need to add the-- thrift or-- rest option (see graceful_stop.sh script usage).

Restart Master again. This will empty the list of dead server and reopen balancer.

Running hbck ensures that the cluster is consistent

Configuration optimization

Zookeeper.session.timeout

Default: 3 minutes (180000ms)

Description: the connection timeout between RegionServer and Zookeeper. When the timeout expires, the ReigonServer will be removed from the RS cluster list by Zookeeper. When HMaster receives the removal notification, the regions responsible for the server will re-balance and let other surviving RegionServer take over.

Tuning: this timeout determines whether RegionServer can failover in time. Set to 1 minute or less to reduce the extended failover time due to wait timeout. However, it should be noted that for some Online applications, the time from downtime to recovery of RegionServer is very short (network flash, crash and other failures, and OPS can intervene quickly). If you lower the timeout time, the loss will outweigh the gain. Because when ReigonServer is officially removed from the RS cluster, HMaster starts doing balance (having other RS recover based on the WAL logs recorded by the failed machine). When the faulty RS is manually involved in the recovery, the balance action is meaningless, on the contrary, it will make the load uneven and bring more burden to the RS. Especially those scenarios where regions is assigned to a fixed number.

Hbase.regionserver.handler.count

Default value: 10

Description: the number of I / O threads for request processing in RegionServer.

Tuning: tuning this parameter is closely related to memory. Fewer IO threads are suitable for Big PUT scenarios with high memory consumption for a single request (large capacity single PUT or scan with larger cache, both belong to Big PUT) or ReigonServer scenarios where memory is tight. More IO threads are suitable for scenarios with low memory consumption and high TPS requirements for a single request. When setting this value, take the monitoring memory as the main reference. It should be noted that if the number of region in server is small and a large number of requests fall on one region, the read and write locks caused by quickly filling memstore and triggering flush will affect the global TPS. It is not that the higher the number of IO threads, the better. When pressure testing is enabled, Enabling RPC-level logging can monitor the memory consumption and GC status of each request at the same time, and finally adjust the number of IO threads reasonably through the results of multiple pressure tests. Here is a case? Hadoop and HBase Optimization for Read Intensive Search Applications, the author sets the number of IO threads to 100 on SSD's machine for reference only.

Hbase.hregion.max.filesize

Default value: 256m

Description: the maximum storage space of a single Reigon on the current ReigonServer. When a single Region exceeds this value, the Region will be automatically split into a smaller region.

Tuning: small region is friendly to split and compaction because of the high speed and low memory footprint of splitting storefile in region or compact small region. The downside is that split and compaction are frequent. In particular, a large number of small region constantly split, compaction, will lead to cluster response time fluctuations, too many region will not only bring trouble to management, and even lead to some Hbase bug. Generally speaking, those under 512 are considered as small region.

Large region is not suitable for frequent split and compaction, because doing a compact and split will cause a long pause and have a great impact on the read and write performance of the application. In addition, it is also a memory challenge when large region means large storefile,compaction. Of course, the Big region also has opportunities to display its talents. If the traffic is low at a certain point in your application scenario, doing compact and split at this time can not only successfully complete split and compaction, but also ensure smooth read and write performance most of the time.

Since split and compaction affect performance so much, is there any way to get rid of it? Compaction is inevitable, but split can be adjusted from automatic to manual. You can disable automatic split indirectly by increasing this parameter value to a value that is difficult to reach, such as 100G (RegionServer does not split region that does not reach 100G). In conjunction with the RegionSplitter tool, manual split is needed when split is needed. Manual split is much more flexible and stable than automatic split. On the contrary, the management cost does not increase much, so it is recommended to use online real-time system.

In terms of memory, a small region is more flexible in setting the size of memstore, while a large region is not too big or too small. Too much flush leads to an increase in the IO wait of app, and too small will affect the read performance because of too much store file.

Hbase.regionserver.global.memstore.upperLimit/lowerLimit

Default value: 0.4 apperance 0.35

Upperlimit description: the function of the parameter hbase.hregion.memstore.flush.size is to flush all the memstore of the region when the sum of all the memstore sizes in a single Region exceeds the specified value. RegionServer's flush processes requests asynchronously by adding a queue to simulate production and consumption patterns. Then there is a problem, when the queue is too late to consume, resulting in a large backlog of requests, may lead to a sharp increase in memory, the worst case is to trigger OOM. The purpose of this parameter is to prevent excessive memory footprint. When the total memory occupied by the memstores of all the region in the ReigonServer reaches 40% of the heap, the HBase will force all updates of the block and flush these region to release the memory occupied by all the memstore.

LowerLimit description: same as upperLimit, except that lowerLimit does not flush all memstores of region when the memory occupied by all memstore reaches 35% of Heap. It will find a region that takes up the largest amount of memstore memory and do individual flush. At this time, writing updates will still be block. LowerLimit is a remedy before all region forces flush to cause performance degradation. In the log, it appears as "* * Flush thread woke up with memory above low water."

Tuning: this is a Heap memory protection parameter, and the default value is already applicable to most scenarios. Parameter adjustment will affect read and write. If the write pressure often exceeds this threshold, turn down the read cache hfile.block.cache.size to increase the threshold, or do not modify the read cache size when there is a large margin of Heap. If you do not exceed this threshold in the case of high pressure, it is recommended that you properly reduce this threshold and then do the pressure test to ensure that the number of triggers is not too much, and then increase the hfile.block.cache.size to improve the reading performance when there is more Heap margin. Another possibility is that the hbase.hregion.memstore.flush.size remains the same, but RS maintains too much region, knowing that the number of region directly affects the amount of memory consumed.

Hfile.block.cache.size

Default value: 0.2

Description: the read cache of storefile occupies a percentage of the size of the Heap, and 0.2 represents 20%. This value directly affects the performance of data reading.

Tuning: of course, the bigger the better, if writing is much less than reading, it is no problem to open to 0.4-0.5. If reading and writing are more balanced, about 0.3. If you write more than you read, acquiesce decisively. When setting this value, you should also refer to? hbase.regionserver.global.memstore.upperLimit?,. This value is the maximum percentage of memstore in heap. One of the two parameters affects reading and the other affects writing. If the two values add up to more than 80-90%, there is a risk of OOM, so set it carefully.

Hbase.hstore.blockingStoreFiles

Default value: 7

Note: in flush, when there are more than 7 storefile in the Store (Coulmn Family) of a region, all block write requests are compaction to reduce the number of storefile. Tuning: block write requests can seriously affect the response time of the current regionServer, but too much storefile can also affect read performance. From a practical point of view, in order to obtain a smoother response time, the value can be set to infinity. If you can tolerate large peaks and troughs in the response time, you can adjust it by default or according to your own scene.

Hbase.hregion.memstore.block.multiplier

Default value: 2

Description: when the memstore in a region occupies more than twice the memory size of the hbase.hregion.memstore.flush.size, block all the requests of the region, flush, and release memory. Although we set the total amount of memstores memory occupied by region, such as 64m, imagine that in the last 63.9m, I Put a 200m data, and the size of the memstore will suddenly soar to several times the expected hbase.hregion.memstore.flush.size. The effect of this parameter is that when the size of the memstore increases to more than 2 times the size of the hbase.hregion.memstore.flush.size, all block requests further expand the risk of containment.

Tuning: the default value of this parameter is relatively reliable. If you estimate that your normal application scenario (excluding exceptions) will not have sudden writes or the amount of writes can be controlled, then keep the default value. If under normal circumstances, your write requests will often soar to several times the normal number, then you should increase this multiple and adjust other parameter values, such as hfile.block.cache.size and hbase.regionserver.global.memstore.upperLimit/lowerLimit, to reserve more memory and prevent HBase server OOM.

Hbase.hregion.memstore.mslab.enabled

Default value: true

Description: reduce Full GC caused by memory fragmentation and improve overall performance.

Tuning: see http://kenwublog.com/avoid-full-gc-in-hbase-using-arena-allocation for details

Other

Enable LZO compression

LZO has higher performance than Hbase's default GZip, while the latter has higher compression. For more information, please see? Using LZO Compression. For developers who want to improve the read and write performance of HBase, LZO is a better choice. For developers who care about storage space very much, it is recommended to keep it by default.

Don't define too many Column Family in one table

Hbase currently does not handle more than 2-3 CF tables. Because when a CF occurs in flush, its adjacent CF will also be triggered flush because of the correlation effect, which will eventually lead to more IO in the system.

Batch import

Before importing data into Hbase in bulk, you can balance the load of data by pre-creating regions. See Table Creation: Pre-Creating Regions for details

Avoid CMS concurrent mode failure

HBase uses CMS GC. The default time to trigger GC is when the old memory reaches 90%. This percentage is set by the parameter-XX:CMSInitiatingOccupancyFraction=N. Concurrent mode failed occurs in a scenario where CMS begins concurrent garbage collection when the memory of the older generation reaches 90%. At the same time, the new generation is rapidly promoting objects to the older generation. At that time, when the older generation of CMS had not yet completed the concurrent marking, the older generation was full, and a tragedy occurred. CMS has to pause mark because there is no memory available, trigger a stop the world (suspend all JVM threads), and then clean up all junk objects in a single-threaded copy. It's going to be a long process. In order to avoid concurrent mode failed, it is recommended that GC be triggered when it is less than 90%.

By setting-XX:CMSInitiatingOccupancyFraction=N

This percentage can be simply calculated like this. If your hfile.block.cache.size and hbase.regionserver.global.memstore.upperLimit add up to 60% (the default), then you can set 70-80, which is generally about 10% higher.

Hbase client optimization

AutoFlush

Set the setAutoFlush of HTable to false to support batch updates on the client. That is, when the Put fills the client flush cache, it is sent to the server. The default is true.

Scan Caching

How much data is cached by scanner at a time to scan (how much data is fetched back from the server at a time to scan). The default value is 1, only one at a time.

Scan Attribute Selection

When scan, it is recommended to specify the required Column Family to reduce traffic, otherwise the scan operation will return all the data of the entire row (all Coulmn Family) by default.

Close ResultScanners

After fetching data through scan, remember to close ResultScanner, otherwise there may be problems with RegionServer (the corresponding Server resources cannot be released).

Optimal Loading of Row Keys

When you scan a table and only need row key (not CF, qualifier,values,timestaps) to return the result, you can add a filterList to the scan instance and set the MUST_PASS_ALL operation, add?FirstKeyOnlyFilter or KeyOnlyFilter in filterList. This reduces network traffic.

Turn off WAL on Puts

When Put some unimportant data, you can set writeToWAL (false) to further improve write performance. WriteToWAL (false) will give up writing WAL log when Put. The risk is that when RegionServer goes down, the data you just had on Put may be lost and cannot be recovered.

Enable Bloom Filter

Bloom Filter improves the performance of read operation by exchanging space for time.

Major_compact 'testtable'

Usually, the production environment will turn off automatic major_compact (hbase.hregion.majorcompaction is set to 0 in the configuration file), and select a time window with fewer users in the evening to manually major_compact. If hbase updates are not too frequent, you can major_compact all tables once a week. After doing major_compact, you can watch all the number of storefile, if the number of storefile increases to nearly twice the number of storefile after major_compact. Major_compact can be done on all tables for a long time, and the operation should try to avoid the high front period.

Migration of hbase

1 copy table mode

Bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable-- peer.adr=zookeeper1,zookeeper2,zookeeper3:/hbase 'testtable'

Currently, versions prior to 0.92 do not support multiple versions of replication, while 0.94 already supports multiple versions of replication. Of course, this operation needs to add the conf/mapred-site.xml in the hbase directory, and you can copy the hadoop. 2,Export/Import

Bin/hbase org.apache.hadoop.hbase.mapreduce.Export testtable / user/testtable [versions] [starttime] [stoptime] bin/hbase org.apache.hadoop.hbase.mapreduce.Import testtable / user/testtable

I think cross-version migration is a good choice, and copytable does not support multiple versions, while export supports multiple versions, which is more practical than copytable. 3. Directly copy the files corresponding to hdfs

First copy the hdfs file, such as bin/hadoop distcp hdfs://srcnamenode:9000/hbase/testtable/ hdfs://distnamenode:9000/hbase/testtable/, and then execute bin/hbase org.jruby.Main bin/add_table.rb / hbase/testtable on the destination hbase

After the meta information is generated, it is a simple way to restart hbase. Before the operation, you can close the writing of hbase, execute all flush tables (described above), and then copy distcp. If the version of hadoop is inconsistent, you can use the hftp interface. I recommend this method with low cost.

Jvm adjustment:

A, memory size: master defaults to 1G, can be increased to 2G by default, and can be adjusted to 10G by default, or even larger. Zk does not consume resources and can be adjusted without adjustment. It should be noted that after adjusting the memory size of rs, you need to adjust the parameters hbase.regionserver.maxlogs and hbase.regionserver.hlog.blocksize. The maximum value of WAL is determined by hbase.regionserver.maxlogs * hbase.regionserver.hlog.blocksize (default 32*32M~=1G). Once this value is reached, Flush memstore will be triggered. If the memory of memstore increases, but these two parameters are not adjusted, there is actually no improvement for a large number of small files. Adjust the strategy: hbase.regionserver.hlog.blocksize * hbase.regionserver.maxlogs is set to slightly larger than hbase.regionserver.global.memstore.lowerLimit * HBASE_HEAPSIZE.

B. Garbage collection: export HBASE_OPTS= "$HBASE_OPTS-Xms30720m-Xmx30720m-XX:NewSize=512m-XX:MaxPermSize=128m-XX:PermSize=128m-XX:+UseParNewGC-XX:+UseConcMarkSweepGC-XX:CMSInitiatingOccupancyFraction=70-XX:+UseCMSCompactAtFullCollection-XX:+ExplicitGCInvokesConcurrent-XX:+PrintGCDetails-XX:+PrintGCTimeStamps-Xloggc:/usr/local/hbase-0.94.0/logs/gc-hbase.log".

Other tuning

1) the column family and rowkey should be as short as possible. Each cell value will store the column family name and rowkey once, even the column name should be as short as possible.

2) the number of region of RS: generally, each RegionServer should not exceed 1000. Too much region will result in more small files, which will lead to more compact. When there are a large number of region exceeding 5G and the total region number of RS reaches 1000, you should consider expanding the capacity.

3) when building tables: a, if multiple versions are not needed, version=1; b should be set, lzo or snappy compression should be enabled, compression will consume a certain amount of CPU, but disk IO and network IO will be greatly improved, roughly 4 times and 5 times of compression; c, reasonable design of rowkey requires full understanding of existing business and reasonable prediction of future business when designing rowkey, unreasonable rowkey design will lead to very poor hbase operation performance. D, reasonably plan the amount of data, pre-partition, avoid constant split in the process of using the table, and disperse the reading and writing of data to different RS, and give full play to the role of cluster; e, column family names as short as possible, such as "f", and only one column family as far as possible; f, open bloomfilter depending on the scene to optimize read performance.

Client-side tuning

1. Hbase.client.write.buffer: write cache size. Default is 2m. It is recommended to set it to 6m in bytes. Of course, it is not the bigger the better. If it is too large, it will take up too much memory.

2. Hbase.client.scanner.caching:scan cache, which defaults to 1 and is too small, can be configured according to specific business characteristics. In principle, it should not be too large to avoid taking up too much memory of client and rs. Generally, a maximum of several hundred should be set. If a piece of data is too large, you should set a smaller value, which is usually the number of data items for a query of business requirements. For example, if the business characteristics determine a maximum of 50 items at a time, you can set it to 50.

3. Set a reasonable timeout and retry times, which will be explained in detail in the following blog.

4. Client applications are separated from reading and writing, which are located in different tomcat instances. The data is first written to the redis queue, and then asynchronously written to the hbase. If the write fails, it is then saved back to the redis queue. Read the data cached by redis first (if there is a cache, please note that the redis cache here is not a redis queue). If not, read the hbase again. When the hbase cluster is not available, or a RS is not available, because the number of retries and the timeout time of the HBase are relatively large (to ensure normal business access, it is impossible to adjust to a small value. If a RS is down, a read or write, after several retries and timeouts may last for dozens of seconds, or a few minutes), so an operation may last for a long time, causing the tomcat thread to be occupied by a request for a long time. The number of threads in tomcat is limited and will be quickly occupied, resulting in no free threads to do other operations. After the separation of read and write, due to the use of writing redis queues first, and then writing hbase asynchronously, there will be no problem that tomcat threads are full. Applications can also provide writing services. If it is a recharge and other business, there will be no loss of revenue, and the time for reading services to be occupied by tomcat threads will be longer, if the operation and maintenance staff intervene in time. The impact of the reading service is also relatively limited.

5. If org.apache.hadoop.hbase.client.HBaseAdmin is configured as the bean of spring, it needs to be configured to load lazily to avoid startup failure caused by Master failure to link hbase at startup, thus some degradation operations cannot be carried out.

6. Scan query programming optimization:

1) adjust caching;2) if it is a query such as a full table scan, or a regular task, you can set the setCacheBlocks of scan to false to avoid useless cache; 3) turn off scanner to avoid wasting client and server memory; 4) limit the scan scope: specify column clusters or specify columns to query; 5) if you only query rowkey, using KeyOnlyFilter can greatly reduce network consumption

7. Online real-time systems with strict requirements for response time can encapsulate hbase client api, perform operations such as get through thread pool, set timeout time through Future.get (), and execute other logic if no return is returned, so as to achieve the effect of rapid failure.

ZK, the state coordinator on which hbase depends, and the storage of data, HDFS, also need to be tuned:

ZK tuning:

1. Zookeeper.session.timeout: the default value is 3 minutes, which cannot be configured too short to avoid session timeout and hbase stop service. The online production environment has been configured for 1 minute and the hbase stops service twice, and it cannot be configured too long. If it is too long, when the rs hangs up, the zk cannot know quickly, resulting in master unable to migrate the region in time.

2. Number of zookeeper: at least 5 nodes. Give each zookeeper about 1G of memory, preferably a separate disk. Stand-alone disks can ensure that zookeeper is not affected. If the cluster is heavily loaded, do not run Zookeeper and RegionServer on the same machine. Just like DataNodes and TaskTrackers, only more than half of the zk will provide services. For example, if you have a total of 5 sets, you can only run a maximum of 2 sets, configure 4 sets the same as 3 sets, and only run a maximum of 1 set.

3. The maximum number of connections for hbase.zookeeper.property.maxClientCnxns:zk is 300 by default, which should be adjusted according to the size of the cluster.

Hdf tuning:

Dfs.name.dir: the data storage address of namenode, which can be configured on different disks and configured with a NFS remote file system, so that nn data can have multiple backups.

Dfs.data.dir:dn data storage address, each disk is configured with a path, which can greatly improve the ability to read and write in parallel

The number of processing threads of dfs.namenode.handler.count:nn node RPC. Default is 10, which needs to be increased. For example, 60

The number of processing threads of dfs.datanode.handler.count:dn node RPC. Default is 3, which needs to be increased, for example: 20.

The maximum number of files to be processed by dfs.datanode.max.xcievers:dn at the same time. Default is 256, which needs to be increased, such as 8192.

The size of dfs.block.size:dn data block is 64m by default. If the files stored are relatively large, you can consider enlarging them. For example, when using hbase, you can set it to 128m. Note that the unit is bytes.

Dfs.balance.bandwidthPerSec: controls the speed of file transfer when load balancer is done through start-balancer.sh. Default is 1M/s, which can be configured to dozens of Ms, such as 20M/s.

Dfs.datanode.du.reserved: the free space reserved for each disk should be reserved for non-hdfs files. The default is 0.

Dfs.datanode.failed.volumes.tolerated: the number of bad disks that will cause dn to hang up at startup. The default is 0, that is, if one disk is broken, you can hang up dn without adjusting it.

These are all the contents of the article "how to achieve performance tuning in HBase". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.