How HBase is managed and performance tuning 07/19 Update SLTechnology News&Howtos

How HBase is managed and performance tuning

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to manage HBase and how to tune its performance. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Java GC and HBase heap settings

Because the Garbage Collection (GC) setting of HBase running on JVM,JVM is important for HBase to run smoothly, higher performance is important, in addition to the guidelines for configuring HBase heap settings. It is equally important to have HBase processes output to their GC logs, and they adjust JVM settings based on the output of the GC logs.

I will describe the most important HBase JVM heap settings, as well as how it works and how to understand GC logs in this respect. I'll cover some guidelines to adjust HBase's Java GC settings.

Preparatory work

How to do it

The following are recommended for Java GC and HBase heap settings:

Give HBase a large enough heap size by editing the hbase-env.sh file. For example, the following snippet configures HBase with a 8000-MB heap:

$vi $HBASE_HOME/conf/hbase-env.shexport HBASE_HEAPSIZE=8000

Make the GC log effective with the following command:

Export HBASE_OPTS= "$HBASE_OPTS-verbose:gc-XX:+PrintGCDetails-XX:+PrintGCTimeStamps-Xloggc:/usr/local/hbase/logs/gc-hbase.log"

Add the following code to start Concurrent-Mark-Sweep GC (CMS) earlier than the default:

$vi $HBASE_HOME/conf/hbase-env.shexport HBASE_OPTS= "$HBASE_OPTS-XX:CMSInitiatingOccupancyFraction=60"

Synchronize changes and restart HBase in the cluster.

Check the GC log output to the specified log file (/ usr/local/hbase/logs/gc-hbase.log). The GC log looks like the following screen capture:

How does it work?

In step 1, we configure the HBase heap memory size. By default, HBase uses 1GB's heap, which is too low for modern machines. For HBase, being bigger than 4GB is good. We recommend 8GB or larger, but less than 16 GB.

In step 2, we have the JVM log in effect, and with this setting, you can get the JVM log of the region server, similar to the one we showed in step 5. Basic knowledge of JVM memory allocation and garbage collection is required to understand log output. The following is a chart of the JVM generational garbage collection system:

There are three heap generations: Perm (or Permanent) [permanent], Old Generation [older], and Young [younger]. The younger generation consists of three independent spaces, Eden space and two survivor spaces, S0 and S1.

Typically, the object is allocated in the younger generation's Eden space, and if an allocation fails (Eden is full), all java threads stop and a younger generation GC (Minor GC) is called. All objects that survive in the younger generation (Eden and S0 space) are copied to S1 space. If S1 space is full, the object is copied (promoted) to the old age. When an ascension fails, the old age is collected (Major/Full GC). Permanent generations and old times are usually collected together. Permanent generations are used to store methods defined in classes and objects.

Going back to step 5 of our example, the minor GC output from the above option is in the following form:

: [GC [:->, secs]->, secs] [Times:,]

In this output:

Timestamp is the time when the GC occurs, relative to the startup time of the application.

Collector is the internal name used by collector for minor collection

Starting occupancy1 is the occupation of the younger generation before garbage collection.

Ending occupancy1 is the occupation of the younger generation after garbage collection.

Pause time1 is the time when minor collection is interrupted

Starting occupancy3 is the occupation of the entire heap before garbage collection

Ending occupancy3 is the occupation of the entire heap after garbage collection

Pause time3 is the downtime for the entire garbage collection, including major collection.

[Time:] explains the time spent on garbage collection, user time, system time, and actual time.

The first line of our output in step 5 shows that it is a minor GC, which interrupts JVM for 0.0764200 seconds, which has reduced the space of the younger generation from 14.8MB to 1.6MB.

Next, let's take a look at the CMS GC log, where HBase uses CMS GC as its default old garbage collector.

CMS GC performs the following steps:

Initialization tag

Concurrent tagging

Repetitive marking

Concurrent dormancy

CMS interrupts the application process only when it initializes and repeats the tag. During the concurrent marking and sleep phases, the CMS thread runs with the application thread.

The second line of the example shows that it took 0.0100050 seconds for CMS to initialize tags and 6.496 seconds for concurrent tags. Note that Java will not be interrupted by concurrent markup.

In an early screenshot of the GC log, the line starts at 1441.435: [GC [YG occupancy: …] There is an interruption in the place. The interrupt here is 0.0413960 seconds, which is used to repeatedly mark the heap. After that, you can see that sleep begins. CMS sleep took 3.446 seconds, but the heap size hasn't changed much here (it continues to occupy about 150MB).

The adjustment point here is to make all downtime lower. To keep the interrupt time lower, you need to use the-XX:NewSize and-XX:MaxNewSize JVM parameters to adjust the space size of the younger generation, in order to set them to relatively small values (for example, a few hundred MB higher). If the server has more CPU resources, we recommend using Parallel New Collector by setting the-XX:+UseParNewGC option. You may also want to adjust the number of parallel GC threads for your younger generation through the-XX:ParallelGCThreads JVM parameter.

We recommend that you add the above settings to the HBASE_REGIONSERVER_OPTS variable instead of the HBASE_OPTS variable in the hbase-env.sh file. HBASE_REGIONSERVER_OPTS only affects the process of the region server, which is very good, because HBase master neither handles heavy tasks nor participates in data processing.

For the old days, concurrent collection (CMS) usually could not be accelerated, but it could have started earlier. When the ratio of space allocated in the old era exceeded a threshold, the CMS began to run. This threshold is calculated automatically by the collector. In some cases, especially during loading, if CMS starts too late, HBase may go straight to full garbage collection. To avoid this, we recommend setting the-XX:CMSInitiatingOccupancyFraction JVM parameter to specify exactly at what percentage CMS should be started, as we did in step 3. It is a good practice to start at 60 or 70 percent. When CMS is used in the old days, the default younger generation GC will be set to Parallel New Collector.

It's more than that.

If you were using HBase version 0.92, consider using MemStore-Local to allocate Buffer to prevent old heap fragmentation, under frequent write loads:

$vi $HBASE_HOME/conf/hbase-site.xml hbase.hregion.memstore.mslab.enabled true

This feature is enabled by default in HBase 0.92.

Use Compression

Another of the most important features of HBase is the use of compression. It is very important because:

Compression reduces the number of bytes read and written from HDFS

Save disk space

When getting data from a remote server, it improves the efficiency of network bandwidth.

HBase supports GZip and LZO formats, and my recommendation is to use the LZO compression algorithm because it decompresses data quickly and has low CPU usage. Better compression ratio is the first choice of the system, you should consider GZip.

Unfortunately, HBase cannot use LZO because of license problems. HBase is Apache-licensed, while LZO is GPL-licensed. Therefore, we need to install LZO ourselves. We will use the hadoop-lzo library to bring a deformed LZO algorithm to Hadoop.

In this regard, we will describe how to install LZO and how to configure HBase to use LZO compression.

Preparatory work

Make sure that Java is installed on the machine on which hadoop-lzo is built. Apache Ant is required to build hadoop-lzo from source code. Install Ant by running the command:

$sudo apt-get-y install ant

All nodes in the cluster need to have native LZO libraries installed. You can install it by using the following command:

$sudo apt-get-y install liblzo2-dev how to do

We will use the hadoop-lzo library to add LZO compression support to HBase:

Get the latest hadoop-lzo source code from https://github.com/toddlipcon/hadoop-lzo

Build a native hadoop-lzo library from the source code. Depending on your OS, you should choose to build binary packages for 32-bit or 64-bit. For example, to build a 32-bit binary package, run the following command:

$export JAVA_HOME= "/ usr/local/jdk1.6" $export CFLAGS= "- M32" $export CXXFLAGS= "- M32" $cd hadoop-lzo$ ant compile-native$ ant jar

These commands create the hadoop-lzo/build/native directory and the hadoop-lzo/build/hadoop-lzo-x.y.z.jar file. To build the 64-bit binary package, you need to change CFLAGS and CXXFLAGS to M64.

Copy the built package to the $HBASE_HOME/lib and $HBASE_HOME/lib/native directories of your master node:

Hadoop@master1 $cp hadoop-lzo/build/hadoop-lzo-x.y.z.jar $HBASE_HOME/libhadoop@master1 $mkdir $HBASE_HOME/lib/native/Linux-i386-32hadoop@master1 $cp hadoop-lzo/build/native/Linux-i386-32 Universe * $HBASE_HOME/lib/native/Linux-i386-32 /

For a 64-bit OS, change Linux-i386-32 to (in the previous step) Linux-amd64-64.

Add the configuration of hbase.regionserver.codecs to your hbase-site.xml file:

Hadoop@master1 $vi $HBASE_HOME/conf/hbase-site.xmlhbase.regionserver.codecslzo,gz

Synchronize the $HBASE_HOME/conf and $HBASE_HOME/lib directories in the cluster.

HBase ships uses a tool to test whether the compression is set correctly. Use this tool to test LZO settings on each node in the cluster. If everything is configured correctly, you will get a successful output:

Hadoop@client1 $$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.util.CompressionTest / tmp/lzotest lzo12/03/11 11:01:08 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 249.6m12/03/11 11:01:08 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library12/03/11 11:01:08 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev Unknown build revision] 12-03-11 11:01:08 INFO compress.CodecPool: Got brand-new compressor12/03/ 11 11:01:18 INFO compress.CodecPool: Got brand-new decompressorSUCCESS

Test the configuration by creating a table using LZO compression and validating it in HBase Shell:

$hbase > create 't1bread, {NAME = > 'cf1', COMPRESSION = >' LZO'} $hbase > describe 't1'DESCRIPTION ENABLED {NAME = >' t1bicycle, FAMILIES = > [{NAME = > 'cf1', BLOOMFILTER = >' NONE', true REPLICATION_SCOPE = > '01mm, VERSIONS = >' 3mm, COMPRESSION = > 'LZO', MIN_VERSIONS = >' 01mm, TTL = > '2147483647', BLOCKSIZE = > '6553636', IN_ MEMORY = > 'false' BLOCKCACHE = > 'true'}]} 1 row (s) in 0.0790 seconds how does it work

The hbase.hregion.majorcompaction property specifies the major compactions time between all stored files on the region. The default time is 86400000, that is, one day. We set it to 0 in step 1, which disables automatic major compaction. This will prevent major compaction from running during busy load times, such as when the MapReduce task is running on the HBase cluster.

In other words, major compaction was asked to help improve performance. In step 4, we have shown an example of how to manually trigger major compaction on a particular region through HBase Shell. In this example, we have passed a region name to the major_compact command to invoke major compaction only on a separate region. It may also run major compaction on all region in a table, passing the table name to the command. The major_compact command queues the specified table or region for major compaction; however, they are hosted by the region server, and these are executed in the background.

As we mentioned earlier, you may just want to perform major compaction manually during a period of low load. This can be easily achieved by calling major_compact from a scheduled task.

It's more than that.

Another way to call major compaction is to use the majorCompact API provided by the org.apache.hadoop.hbase.client.HBaseAdmin class. It is very easy to call this API in Java. So you can manage complex major compaction scheduling from Java.

Manage region split

Usually a HBase table starts with a separate region. However, because the data keeps growing and the region reaches its configured maximum, it automatically splits into two parts so that they can handle more data. The following chart shows a HBase region split:

This is the default behavior for HBase region splits. This principle works well in most cases, but there are situations where problems are encountered, such as split/ compaction storm problems.

With uniform data distribution and growth, all the region in the table finally needs to be split at the same time. Following a split, the compression will be run on the child region to rewrite their data to a separate file. This can cause a lot of disk Ibank O read and write and network traffic.

To avoid this, you can turn off automatic split and call it manually. Because you can control when to call split, it can help expand the Icano load. Another advantage is that manual splitting gives you better regions control and helps you track and resolve region-related problems.

In this regard, I will describe how to turn off automatic region splitting and invoke it manually.

Preparatory work

How to do it

To turn off automatic region split and invoke it manually, follow these steps:

Add the following code to the hbase-site.xml file:

$vi $HBASE_HOME/conf/hbase-site.xmlhbase.hregion.max.filesize107374182400

Synchronize these changes in the cluster and restart HBase.

With the above settings, region splitting will not occur until the size of the region reaches the configured 100GB threshold. You will need to explicitly call it on the region of your choice.

To run a region split through HBase Shell, use the following command:

$echo "split 'hly_temp,1327118470453.5ef67f6d2a792fb0bd737863dc00b6a7.'" | $HBASE_HOME/bin/hbase shellHBase Shell; enter' help' for list of supported commands.Type "exit" to leave the HBase Shell Version 0.92.0, r1231986, Tue Jan 17 02:30:24 UTC 2012split 'hly_temp,1327118470453.5ef67f6d2a792fb0bd737863dc00b6a7.'0 row (s) in 1.6810 seconds how it works

The hbase.hregion.max.filesize property specifies the maximum region size (bytes). By default, the value is 1GB (versions prior to HBase 0.92 are 256MB). This means that when an region exceeds this size, it will split into two. In step 1, we set the maximum value of region to 100GB, which is a very high number.

Because the split will not occur until it crosses the boundaries of the 100GB, we need to explicitly call it. In step 4, we use the split command to invoke split through HBase Shell on a specified region.

Don't forget to split the big region. An region in HBase is the basic data distribution and load unit. Region should be split into appropriate sizes during periods of low load.

In other words; too many splits are not good, and too many splits on a region server will degrade its performance.

After splitting region manually, you may want to trigger major compaction and load balancing.

It's more than that.

Our previous settings will cause the entire cluster to have a default region maximum for 100GB. In addition to changing the entire cluster, you can also specify the MAX_FILESIZE attribute on a column cluster basis when creating a table.

$hbase > create 't1cards, {NAME = > 'cf1', MAX_FILESIZE = >' 107374182400'}

Like major compaction, you can also use the split API provided by the org.apache.hadoop.hbase.client.HBaseAdmin class.

This is the end of the article on "how to manage HBase and performance tuning". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.