How to optimize Hbase 07/01 Update SLTechnology News&Howtos

How to optimize Hbase

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to optimize Hbase. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

First, software and hardware optimization: 1. Configuring memory, cpuHBase's LSM tree structure, caching mechanism and logging mechanism consume a lot of memory, so the larger the memory, the better. Among them, the scenes such as filter, data compression and multi-condition combination scanning are all cpu-intensive, so cpu should be strong enough. The operating system chooses the mainstream linux release, and JVM recommends using the Sun HotSpot64 bit, which can give full play to the best performance of hadoop using noatime mount disk: general database mount disk without special requirements had better be set to noatime to improve performance to close the system swap area: repeated swapping of Linux memory will affect JVM performance, a typical exception is to cause zookeeper timeout. So it is better to set the vm.swappiness setting lower. 3. Network communication because hdfs has high requirements for cluster network throughput, the network must ensure low latency and high throughput to add rack awareness: rack awareness is to improve the localization of hadoop writes and reads. Configure topology.script.file.name4. Net in core-site.xml. JVM optimization according to many mature references on the network to verify the better combination of garbage collectors with CMS+ Parn II, enter the topic: Hbase itself optimization 1. Hbase query optimization: a. Set the scan cache: set the cache size b. SetCaching when scan. Determine the required columns: scan when addColumn to add the required columns to reduce data transmission c. Disable block caching if you do a batch full table scan, because each record is read only once for the full table scan. Optimize row key queries: if only row keys are needed for full table scan, you can use filters to reduce the amount of data returned by the server. e. Access through HBaseTool: the HTable object is not thread-safe for the client to read and write data, and a HBase object is created for each thread when multithreaded. The HBaseTool link pool mechanism can solve the thread safety problem, and colleagues maintain a certain number of HBasef. Use batch read: HTable.get (List) g. Use Coprocessor to count the number of rows: see the coprocessor principle h. Cache query results: for frequent query scenarios 2. HBase write optimization: a. Close the WAL log: if you can tolerate a certain risk of data loss, you can close WALb. Set AutoFlush: turn off this function and wait for put to be submitted to server c. Pre-create Region: pre-create region to prevent region from reaching a certain threshold and split affecting performance when writing, which is consistent with the principle of mongodb pre-slicing. Delay WAL flush: if you turn on WAL, you can increase the interval between WAL flush and disk to improve the performance of E. Use bulk write HTable.put (List) 3. HBase basic core services to optimize a. Optimize the splitting operation: if you write more and read less, you can increase hbase.hregion.max.filesize to reduce region splitting b. Optimize merge operations: large merge is very resource-intensive and blocks write operations during merge. Large merge should be carried out when the cluster is not busy. 4. Hbase configuration parameter optimization: a. Set the number of regionserver handler: if there are more write requests, you can appropriately increase the number of hbase.regionserver.handler.count to increase write throughput. Raising this parameter consumes a lot of memory, please note. b. Resize blockCache: hfile.block.cache.size to set the memory settings for regionserver queries. The default 0.25 means that the read cache takes up 25% of the heap memory. If you read more scenes, you can adjust them up properly. c. Set the upper and lower limits of MemStore: hbase.regionserver.global.memstore.upperLimit represents the upper limit of the Memstore size of all region on the regionserver. Exceeding the upper limit will cause a global flush. This parameter mainly prevents the regionserver from taking up too much memory and being dropped by OOM Kill. In a read-based cluster, you can turn this parameter down, while increasing blockCache; writes is the opposite of d. Adjust the number of files that affect the merge: the hbase.hstore.blockingStoreFiles value is used to control the storefile that exceeds this value to start the merge. You can increase this value to reduce the number of mergers e. Adjust the flush factor of MemStore: when the memory occupied by Memstore exceeds the multiple of hbase.hregion.memstore.flush.size, all requests of region will be blocked, flush will be started, and memory will be freed. If writes do not occur normally or the amount of data written suddenly increases, you can keep the default, otherwise you want to increase this value. f. Resize a single file: hbase.hregion.max.filesize is used to define a single hstorefile size, and beyond this value raises the region file split. If the Region is relatively small, both the merge and split are very fast, which will certainly cause the cluster response time to fluctuate. Large mergers and split cause long-term blocking. You should define 5. 5 according to your own scene. Optimization of distributed coordination system zookeeper: there are also many optimization methods for zookeeper. I will mainly talk about hbase optimization. Just to illustrate that zookeeper optimization is also very important. 6. Table design optimization a. Turn on the Bloom filter: the Bloom filter can reduce the number of disk reads to reduce latency. The principle is the same as redis's hyperloglog (we used this feature to estimate the number of users) b. Resize column family blocks: smaller block sizes can increase the speed of random reads and cause the block index to become larger. c. Set the in memory property: in memory can be set for frequently accessed column families, but memory consumption should be considered. Adjust the maximum number of versions of column families: a large number takes up disk space and causes the cluster to become larger. Choose according to your own application scenario. For example, when we make portraits, we need to count the changes of user scenes, so the number of versions can be set according to our own needs. Set the TTL property: columns that exceed TTL are automatically deleted. This is also chosen according to your own scene. When we do user portraits, we will take some user actions over time and think that there is no need for storage analysis, so we can set TTL to automatically delete 7. 5. Turn off the prediction execution function of mapreduce: if you use mapreduce to access the hbase cluster, you should close it, otherwise it may cause a sharp increase in the number of hbase client links and affect the cluster running 8. 5%. Modify the execution cycle of load balancer: when the cluster writes frequently, it can be reduced, otherwise it can be increased. Thank you for reading! This is the end of this article on "how to optimize Hbase". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.