III. Hbase-- tuning 04/16 Update SLTechnology News&Howtos

III. Hbase-- tuning

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Here we mainly talk about hbase tuning.

1. High availability of Hmaster

In HBase, Hmaster is responsible for monitoring the life cycle of RegionServer and balancing the load of RegionServer. If Hmaster dies, the whole HBase cluster will fall into an unhealthy state, and the working state will not last long. So HBase supports highly available configurations for Hmaster.

First create a file with the name of backup-masters under $HBASE_HOME/conf. Note that this must be the name. The file content is the hostname or ip of the slave Hmaster

Vim backup-mastersbigdata122

Then copy this file to the conf directory of the hbase of the other hbase nodes

Scp backup-masters bigdata122:/opt/modules/hbase-1.3.1/conf/scp backup-masters bigdata123:/opt/modules/hbase-1.3.1/conf/

Then restart the entire hbase cluster

Stop-hbase.shstart-hbase.sh

Then you can see the situation of the backup node through http://bigdata121:16010.

This creates a node under the hbase/backup-masters/ node of zk with the name of the node information, for example:

[zk: localhost:2181 (CONNECTED) 6] ls / hbase/backup-masters [bigdata122,16000,1564564196379] II. Hadoop generality optimization

2.1 namenode metadata storage using SSD

2.2 back up metadata on namenode regularly (not required if high availability is configured)

2.3 specify multiple metadata directories for the namenode data directory

Use dfs.name.dir or dfs.namenode.name.dir to specify multiple identical metadata directories. This provides the redundancy and robustness of metadata to avoid failures. Specific configuration can be found in the article "working Mechanism of namenode".

2.4dir self-recovery of NameNode

Set dfs.namenode.name.dir.restore to true to allow attempts to restore previously failed dfs.namenode.name.dir directories, when creating a checkpoint, and if multiple disks are set, it is recommended to allow.

2.5 HDFS guarantees that RPC calls will have a large number of threads

Hdfs uses rpc for access communication, so the number of threads called by rpc determines concurrency performance

Hdfs-site.xml attribute: dfs.namenode.handler.count explanation: this attribute is the default number of threads for NameNode service, and the default value is 10, which can be adjusted to 50 threads 100 according to the available memory of the machine: dfs.datanode.handler.count explanation: the default value of this attribute is 10, which is the number of processing threads of DataNode. If the HDFS client program has more read and write requests, it can be adjusted to 15: 20. The higher the setting value, the more memory consumption, so don't adjust it too high. In general business, 5: 10 is fine.

2.6 number of hdfs replicas

Hdfs-site.xml attribute: dfs.replication explains: if the amount of data is large and not very important, it can be adjusted to 2: 3, and if the data is very important, it can be adjusted to 3: 5.

2.7 hdfs file block resizing

Hdfs-site.xml attribute: dfs.blocksize explanation: block size definition. This attribute should be set according to the storage size of a large number of individual files. If a large number of individual files are less than 100m, it is recommended to set it to 64m block size. If it is larger than 100m or reaches GB, it is recommended to set it to 256m. Generally, the setting range fluctuates between 64M~256M.

2.8The number of MapReduce Job task service threads adjusted

Mapred-site.xml attribute: mapreduce.jobtracker.handler.count explains: this attribute is the number of Job task threads. The default value is 10, which can be adjusted to 50 threads 100 according to the available memory of the machine.

2.9 number of http server worker threads

Mapred-site.xml attribute: mapreduce.tasktracker.http.threads explanation: defines the number of worker threads on the HTTP server. The default value is 40, which can be adjusted to 80cm 100 for large clusters.

2.10 File sorting and merging optimization

Mapred-site.xml attribute: mapreduce.task.io.sort.factor explains: the number of data streams merged at the same time when sorting files, which also defines the number of files that are opened at the same time. The default value is 10. If you increase this parameter, you can significantly reduce the disk IO, that is, reduce the number of file reads. During the merge process, the more files you open at the same time, you can reduce multiple calls to disk.

2.11 set task concurrency

Mapred-site.xml attribute: mapreduce.map.speculative explains: this attribute can set whether tasks can be executed concurrently. If there are many but small tasks, setting this property to true can significantly speed up task execution efficiency, but for tasks with very high latency, it is recommended to change to false, which is similar to Thunderbolt download.

2.12Compression of MR output data

Mapred-site.xml attribute: mapreduce.map.output.compress, this is the map output mapreduce.output.fileoutputformat.compress this is the reduce output explanation: for large clusters, it is recommended to set the output of Map-Reduce to compressed data, but for small clusters, it is not necessary.

2.13 optimize the number of Mapper and Reducer

Mapred-site.xml attribute: mapreduce.tasktracker.map.tasks.maximummapreduce.tasktracker.reduce.tasks.maximum explains: the above two attributes are the number of Map and Reduce that a single Job task can run at the same time. When setting the above two parameters, you need to consider the number of CPU cores, disk, and memory capacity. Suppose an 8-core CPU, and the business content consumes a lot of CPU, then you can set the number of map to 4, and if the business does not specifically consume the type of CPU, you can set the number of map to 40 and reduce to 20. After the value modification of these parameters is completed, be sure to observe whether there are tasks waiting for a long time, and if so, you can reduce the number to speed up task execution. If you set a large value, it will cause a lot of context switching and data exchange between memory and disk. There are no standard configuration values, and choices need to be made based on business and hardware configuration and experience. At the same time, do not run too much MapReduce at the same time, this will consume too much memory, the task will be executed very slowly, we need to set a maximum concurrency of MR tasks according to the number of CPU cores and memory capacity, so that tasks with a fixed amount of data are fully loaded into memory, avoiding frequent memory and disk data exchange, thus reducing disk IO and improving performance. Approximate estimation formula: map = 2 + ⅔ cpu_corereduce = 2 + ⅓ cpu_ core 3, Linux optimization

3.1 enabling the read-ahead cache of the file system can improve the read speed

Sudo blockdev-setra 32768 / dev/sdara is the abbreviation of readahead

3.2 close the process sleep pool

That is, the background process is not allowed to go to sleep. If the process is idle, it will directly kill and release resources.

Sudo sysctl-w vm.swappiness=0

3.3 disable atime for Linux files

Atime (access time) is updated every time the file is accessed, while the bottom layer of hdfs is a block stored as a file, so the number of files is large. If you refresh the atime every time you visit, it is not necessary. You can turn off the atime of the device that stores the hdfs file.

Vim / etc/fstabUUID=6086522b-3bc7-44a2-83f4-fd79a8c4afa1 / ext4 errors=remount-ro,noatime,nodiratime 0 1 in the working parameters, adding noatime means that atime is disabled

3.4 adjust the upper limit of ulimit. The default value is a relatively small number.

Ulimit-n View maximum number of processes ulimit-u View maximum number of open files vi / etc/security/limits.conf modify Open File limit add at the end: * soft nofile 1024000 * hard nofile 1024000Hive-nofile 1024000Hive-nproc 1024000 vi / etc/ Security/limits.d/20-nproc.conf modifies the limit on the number of processes opened by users: # * soft nproc 4096#root soft nproc unlimited* soft nproc 40960root soft nproc unlimited

3.5 enable cluster time synchronization

If the cluster time is not synchronized, there may be problems when the node heartbeat is checked.

IV. Zookeeper optimization

Optimize Zookeeper session timeout

Hbase-site.xml parameters: zookeeper.session.timeout interpretation: In hbase-site.xml, set zookeeper.session.timeout to 30 seconds or less to bound failure detection (20-30 seconds is a good start). This value is directly related to the maximum period in which master discovers server downtime. The default value is 30 seconds (different HBase versions, the default value is different). If this value is too small, when HBase writes a large amount of data and GC occurs, RegionServer will be temporarily unavailable, thus no heartbeat packet will be sent to ZK, resulting in a slave node shutdown. Generally, about 20 clusters need to be configured with 5 zookeeper. 5. Hbase optimization 5.1 pre-partition

each region maintains startRowKey and endRowKey, and if the added data meets the rowKey scope maintained by a certain region, the data is handed over to this region for maintenance. Then according to this principle, we can plan the partition of the data request in advance in order to improve the performance of HBase. Try to distribute read and write requests evenly to the region of different partitions, which is what we want.

Data skew View:

In the web page of hbase, you can click on each table, and then go in to see the status of the table at different region, with a column of "request" indicating the number of requests received by the different region of the table. From this, we can judge whether there is data skew between different region.

Zoning method:

(1) manually set the pre-partition

Hbase > create 'staff','info','partition1',SPLITS = > [' 1000,000,2000,000,000,000,000,000,000,000,000,4000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,

You can click on the corresponding table in the table detials of the web page of hbase to see the partitioning of each table.

You can also view partitions with the following command:

Hbase (main): 011staff'Total number of splits 0 > get_splits' staff'Total number of splits = 5 = > ["1000", "2000", "3000", "4000"]

(2) generate hexadecimal sequence pre-partition

Create 'staff2','info','partition2', {NUMREGIONS = > 15, SPLITALGO = >' HexStringSplit'}

(3) pre-partition according to the rules set by the file

Create the file as follows:

Split.txt

Aaaabbbbccccdddd

Create a table:

Create 'staff3','partition3',SPLITS_FILE = >' splits.txt' partition is created as follows:. ~ aaaaaaaa~bbbbbbbb~cccccccc~dddddddd~....5.2 rowkey design

The unique identification of a piece of data is rowkey, so the partition in which the data is stored depends on which pre-partition the rowkey is in. The main purpose of designing rowkey is to make the data evenly distributed in all region, to prevent data tilt to a certain extent. Next, let's talk about the common design schemes of rowkey. And the more random the design of rowkey, the better, and the more irregular the better, so that it can be more randomly distributed to different region, thus balancing the reading and writing pressure.

5.2.1 generate random numbers, hash, hash values such as: the original rowKey is 1001, then SHA1 becomes: dd01903921ea24941c26a48f2cec24e0bb0e8cc7 originally rowKey is 3001, SHA1 becomes: 49042c54de64a1e9bf0b33e00245660ef92dc7bd original rowKey is 5001, SHA1 becomes: 7b61dec07e02c188790670af43e717f0f46e8913 before doing this operation, we usually choose to take samples from the data set to determine what kind of rowKey to Hash as the critical value of each partition. 5.2.2 string inversion often uses some regular data to turn it into weakly regular data, such as phone number, identification card number, date and time, and so on. These data are regular, and when they are reversed, they are relatively random. To some extent, it is also possible to hash the data coming in from put step by step. 5.2.3 string concatenation rowkey itself is regular, and you can also add some random characters before or after it to disrupt the rowkey rule and turn it into a more random data 5.3 memory optimization.

Because hbase, as a memory-based nosql, uses a lot of memory, then if memory planning is closely related to the performance of hbase. Memory optimization mainly focuses on three aspects: compact, flush and memory area planning.

5.3.1 compact

The compact operation is to merge multiple storefile, mainly to compress the storage space and improve the reading and writing speed. Divided into Minor Compaction and Major Compaction according to the scale of the merger

Minor Compaction:

It means to select some small, adjacent StoreFile to merge them into a larger StoreFile, and the Cell that is already Deleted or Expired will not be processed in the process. The result of a Minor Compaction is less and larger StoreFile.

Major Compaction

It refers to merging all StoreFile into a single StoreFile, and this process also cleans up three types of meaningless data: deleted data, TTL expired data, and data whose version number exceeds the set version number. In addition, in general, the Major Compaction time will last for a long time, and the whole process will consume a lot of system resources, which will have a great impact on the upper-level business. Therefore, all online businesses will turn off the automatic trigger Major Compaction function and trigger it manually during the business trough.

Minor compaction is usually triggered by memstore flush, which is periodically checked by threads. Majorcompaction is usually triggered manually. You can set hbase.hregion.majorcompaction to 0 by parameter. The default time is 86400000 a day.

Manual compact operation:

Minor Compaction

Compact 'namespace:table','cf' if cf is not specified, the entire region will be merged

Major Compaction

The usage of major_compact 'namespace:table','cf' is similar to 5.3.2 flush above.

Flush writes data from memstore to storefile, hbase writes data, first writes hlog of wal, and then writes memstore after success. There is only one more memstore in each column. When the sum of memstore sizes of all cf in a region reaches the size set by hbase.regionserver.global.memstore.size, this value defaults to 128m and will flush. Currently, the memstore of a single cf will also be flush when it reaches this value. This non-real-time flush operation is mainly for better write performance and avoids having to flush to the hdfs every time you write. In the hbase2.0 version, memstore is actually composed of a mutable memstore and many immutable memstore, compaction directly in memory, if the flush into HFile, and then compaction, will take up a lot of disk and network IO.

5.3.3 memory Planning

How to allocate the appropriate memory to regionserver, and the size of different functional memory within regionserver is very elaborate. Memory can be roughly divided into the following areas:

CombinedBlockCache: responsible for read caching, divided into LRUBlockCache and BucketCache. The former is generally used to cache metadata of data, and belongs to in-heap memory, while the latter is used to cache original data, belonging to out-of-heap memory. Other: responsible for writing cache, used to store data that can be modified directly: used to store the object jvm_heap in the working process of hbase: that is, the size after subtracting BucketCache. Generally speaking, memstore+LRUBlockCache+other has a rigid requirement: LRUBlockCache+ MemStore < 80% * JVM_HEAP does not meet this condition, rs cannot be started directly

There are two scenarios: read less and write more, read more and write less.

1. Read less and write more

LRUBlockCache mode is commonly used, that is, LRUBlockCache + memstore + other mode. It's all in the heap memory. Take 96 GB of machine memory as an example, which is generally allocated to rs's 2 + 3 memory, that is, 64 GB. First of all, memstore is larger than LRUBlockCache, and LRUBlockCache + MemStore < 80% * JVM_HEAP recommends the following planning: MemStore = 45% * JVM_HEAP = 64G * 45% = 28.8G, LRUBlockCache = 30% * JVM_HEAP = 64G * 30% = 19.2 G; by default, Memstore is 40% * JVM_HEAP, and LRUBlockCache is 25% * JVM_HEAP

Jvm parameter configuration:

-XX:SurvivorRatio=2-XX:+PrintGCDateStamps-Xloggc:$HBASE_LOG_DIR/gc-regionserver.log-XX:+UseGCLogFileRotation-XX:NumberOfGCLogFiles=1-XX:GCLogFileSize=512M-server-Xmx64g-Xms64g-Xmn2g-Xss256k-XX:PermSize=256m-XX:MaxPermSize=256m-XX:+UseParNewGC-XX:MaxTenuringThreshold=15-XX:+CMSParallelRemarkEnabled-XX:+UseCMSCompactAtFullCollection-XX:+CMSClassUnloadingEnabled-XX:+UseCMSInitiatingOccupancyOnly-XX:CMSInitiatingOccupancyFraction=75-XX:-DisableExplicitGC where-Xmx64g-Xms64g sets the heap memory to 64g

Configuration in hbase-stie.xml

As you can see from the above definition, hbase.regionserver.global.memstore.upperLimit is set to 0.45 MemStore. Regionserver.global.memstore.lowerLimit is set to 0.40hbase.regionserver.global.memstore.upperLimit to indicate that all hbase.regionserver.lowerLimit represents the upper limit of the proportion of all hbase.regionserver.lowerLimit in RegionServer memory. If the percentage exceeds this value, RS first sorts all Region by MemStore size and executes flush in order from largest to smallest until the total memory size of all MemStore is less than hbase.regionserver.global.memstore.lowerLimit, and generally lowerLimit is 5% smaller than upperLimit. Hbase.regionserver.global.memstore.upperLimit 0.45 hbase.regionserver.global.memstore.lowerLimit 0.40 hfile.block.cache.size 0.3

2. Read more and write less

Generally use BucketCache, that is, read memory, including BucketCache and LRUBlockCache. The CombinedBlockCache is composed of the following memory distribution: in-heap memory: LRUBlockCache + memstore + other out-of-heap memory: BucketCache takes the machine memory of 96G as an example, which is generally allocated to the memory of rs, that is, 64G. You can also do more of this. The ratio of read to write to other memory is generally 5:3:2, and the LRU:BUCKET= in read memory at 1:9 must meet LRUBlockCache + MemStore < 80% * JVM_HEAPCombinedBlockCache 64 * 0.5=32-LRUBlockCache 32*0.1-BucketCache 32*0.9memstore 64*0.3heap 64-32. 9. LRU + MemStore / JVM_HEAP = 3.2G + 19.2G / 35.2G = 22.4G / 35.2G = 63.6%, much less than 80%. Therefore, the corresponding size needs to be adjusted. This situation proves that the heap memory is too large, appropriately reduce the JVM_ HEAP value (reduced to 30G), and increase the Memstore to 20G. Because the JVM_HEAP is reduced, the out-of-heap memory needs to be increased appropriately, so the BucketCache is increased to 30G. After adjustment, LRU + MemStore / JVM_HEAP = 3.2g + 20g / 30g = 23.2g / 30g = 77%

Jvm parameter settin

-XX:SurvivorRatio=2-XX:+PrintGCDateStamps-Xloggc:$HBASE_LOG_DIR/gc-regionserver.log-XX:+UseGCLogFileRotation-XX:NumberOfGCLogFiles=1-XX:GCLogFileSize=512M-server-Xmx40g-Xms40g-Xss256k-XX:PermSize=256m-XX:MaxPermSize=256m-XX:+UseParNewGC-XX:MaxTenuringThreshold=15-XX:+CMSParallelRemarkEnabled-XX:+UseCMSCompactAtFullCollection-XX:+CMSClassUnloadingEnabled-XX:+UseCMSInitiatingOccupancyOnly-XX:CMSInitiatingOccupancyFraction=75-XX:-DisableExplicitGC

Hbase-site.xml configuration

Memstore related: according to the definition of upperLimit parameters, combined with the above memory planning data, upperLimit = 20g / 30g = 66% can be calculated. So the upperLimit parameter is set to 0.66 hbase.regionserver.global.memstore.upperLimit lowerLimit is set to 0.60 hbase.regionserver.global.memstore.upperLimit 0.66 bucketcache related: the ratio set to out-of-heap memory hbase.bucketcache.ioengine offheap out-of-heap memory size hbase.bucketcache.size 34816 bucketcache

For more details, please see: http://hbasefly.com/2016/06/18/hbase-practise-ram/ is super well written.

5.4 Foundation optimization

1. Allow append content in hdfs file

Hdfs-site.xml and hbase-site.xml attributes: dfs.support.append explains: enabling HDFS append synchronization can work well with HBase's data synchronization and persistence. The default is true.

2. Optimize the maximum number of file openings allowed by DataNode

Hdfs-site.xml attribute: dfs.datanode.max.transfer.threads explains: HBase generally operates a large number of files at the same time, which is set to 4096 or higher depending on the number and size of the cluster and the data action. Default value: 4096

3. Optimize the waiting time for data operations with high latency

Hdfs-site.xml attribute: dfs.image.transfer.timeout explains: if the delay for a data operation is very high and socket needs to wait longer, it is recommended to set this value to a higher value (default is 60000 milliseconds) to ensure that socket will not be dropped by timeout

4. Optimize DataNode storage

Attribute: dfs.datanode.failed.volumes.tolerated explanation: the default is 0, which means that when a disk in the DataNode fails, the DataNode shutdown will be considered. If changed to 1, if one disk fails, the data will be copied to another normal DataNode and the current DataNode will continue to work.

5. Set the number of RPC snoops for regionserver

Hbase-site.xml attribute: hbase.regionserver.handler.count explanation: the default value is 30, which is used to specify the number of RPC listeners, which can be adjusted according to the number of requests from the client. When there are more read and write requests, this value is increased.

6. Optimize HStore file size

Attribute: hbase.hregion.max.filesize explanation: the default value is 10737418240 (10GB). If you need to run a MR task of HBase, you can reduce this value, because a region corresponds to a map task. If a single region is too large, the execution time of the map task will be too long. This value means that if the size of the HFile reaches this value, the region will be cut into two.

7. Optimize hbase client cache

Hbase-site.xml attribute: hbase.client.write.buffer explanation: used to specify the HBase client cache. Increasing this value reduces the number of RPC calls, but consumes more memory, and vice versa. Generally, we need to set a certain cache size in order to reduce the number of RPC.

8. Specify the number of rows obtained by scan.next scan HBase

Hbase-site.xml attribute: hbase.client.scanner.caching explanation: used to specify the default number of rows fetched by the scan.next method. The higher the value, the greater the memory consumption.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.