What are the features of HBase 07/04 Update SLTechnology News&Howtos

What are the features of HBase

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail what features HBase has, and the editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

1 Configurable block size

The HFile block size can be set at the column family level. This data block is different from the HDFS data block mentioned earlier. The default is 65536 bytes, or 64KB. The block index stores the start key for each HFile block. The block size setting affects the size of the block index. The smaller the block, the larger the index, which takes up more memory space. At the same time, random lookup performance is better because the blocks loaded into memory are smaller. But if you need better sequence scanning performance, it makes more sense to be able to load more HFile data into memory at a time, which means that the data block should be set to a larger value. Accordingly, the index becomes smaller, and you will pay a price in random read performance.

You can set the block size when the table is instantiated, as follows:

Hbase (main): 002 create 0 > mytable'

{NAME = > 'colfam1', BLOCKSIZE = >' 65536'}

2 Block caching

Put data in the read cache, but workloads often don't get a performance boost from it-for example, if a table or column family in a table is accessed only by sequential scans or rarely accessed, you don't care if Get or Scan takes a little longer. In this case, you can choose to turn off the cache for those column families. If you just perform a lot of sequential scans, you will reverse the cache many times and may abuse the cache to crowd out data that should be put in the cache for performance improvement. If you turn off caching, you can not only avoid this, but also make more caches available to other tables and other column families of the same table.

Block caching is on by default. You can close it when you create a new table or change it:

Hbase (main): 002 create 0 > mytable'

{NAME = > 'colfam1', BLOCKCACHE = >' false'}

3 radical caching

You can select some column families and give them a higher priority in the block cache (LRU cache). If you expect one column family to read more randomly than another, this feature will come in handy sooner or later. This configuration is also set when the table is instantiated:

Hbase (main): 002 create 0 > mytable'

{NAME = > 'colfam1', IN_MEMORY = >' true'}

The default value for the IN_MEMORY parameter is false. Because HBase does not provide an additional guarantee other than saving this column family in the block cache compared to other column families, setting this parameter to true does not change much in practice.

4 Bloom filter (Bloom filters)

Block indexes provide an efficient way to find blocks of HFile that should be read when accessing a particular row. But its utility is limited. The default size of HFile blocks is 64KB, which cannot be resized too much.

If you are looking for a short row, indexing only on the starting row key of the entire data block will not give you fine-grained indexing information. For example, if your row takes up 100 bytes of storage space, a 64KB block contains (64 * 1024) / 100 = 655.53 = ~ 700rows, and you can only place the starting line on the index bit. The row you are looking for may fall in a row interval on a particular data block, but it is not necessarily stored on that data block. This is possible in a variety of situations, either that the row does not exist in the table, or that it is stored in another HFile, or even in MemStore. In these cases, reading blocks from the hard disk can incur IO overhead and abuse the block cache. This can affect performance, especially if you are faced with a large dataset and many concurrent read users.

The Bloom filter allows you to do a reverse test of the data stored in each block. When a row is requested, check the Bloom filter to see if the row is not in this block. The Bloom filter either confirms that the answer line is not there or that it does not know. That's why we call it reverse testing. Bloom filters can also be applied to cells in a row. Use the same reverse test when accessing a column identifier.

The Bloom filter is not without price. Storing this additional index hierarchy takes up extra space. Bloom filters grow as their index object data grows, so row-level Bloon filters take up less space than column identifier-level Bloom filters. When space is not about ti, they can help you drain the performance potential of your system.

You can turn on the Bloom filter on the column family, as follows:

Hbase (main): 007 create 0 > mytable'

{NAME = > 'colfam1', BLOOMFILTER = >' ROWCOL'}

The default value for the BLOOMFILTER parameter is NONE. A row-level Bloom filter is opened with ROW, and a column identifier-level Bloom filter is opened with ROWCOL. The row-level Bloom filter checks whether a specific row key does not exist in the data block, and the column identifier-level Bloom filter checks whether the row and column identifier union does not exist. The cost of ROWCOL Bloom filter is higher than that of ROW Bloom filter.

5 time to live (TTL)

Applications often need to delete old data from the database. Because it is difficult for the database to exceed a certain size, there are traditionally many flexible processing methods built into the database. For example, in TwitBase you don't want to delete any tweets that users generate while using the application. These are user-generated data that may one day be useful when you perform some advanced analysis. But you don't need to save all the tweets for real-time access. So tweets earlier than a certain time can be archived and stored in a flat file.

HBase allows you to set a TTL at the column family level in seconds. Data that is earlier than the specified TTL value will be deleted at the next big merge. If you have multiple time versions on the same unit, versions earlier than the set TTL will be deleted. You can turn off TTL or leave it on forever by setting its value to INT.MAX_VALUE (2147483647) (which is the default). You can set TTL when you create the table, as shown below:

Hbase (main): 002mytable', 0 > create 'mytable', {NAME = >' colfam1', TTL = > '18000'}

This command sets TTL to 18000 seconds = 5 hours on the colfam1 column family. Data in colfam1 for more than 5 hours will be deleted in the next big merge.

6 Compression

HFile can be compressed and stored on HDFS. This helps save hard disk IO, but compression and decompression increases CPU utilization when reading and writing data. Compression is part of the table definition and can be set when the table is created or the schema changes. Unless you are sure you will not benefit from compression, we recommend that you turn on table compression. The compression feature may be turned off only if the data cannot be compressed or if the server's CPU utilization is limited for some reason.

HBase can use a variety of compression codes, including LZO, Snappy, and GZIP. LZO [1] and Snappy [2] are the two most popular ones. Snappy was released by Google in 2011, soon after the Hadoop and HBase projects began to provide support. Prior to this, LZO encoding was selected. The LZO native libraries used by Hadoop are copyrighted by GPLv2 and cannot be placed in any distribution of Hadoop and Hbase; they must be installed separately. On the other hand, Snappy has a BSD license (BSD-licensed), so it is easier to bundle with Hadoop and HBase distributions. The compression ratio and compression / decompression speed of LZO and Snappy are similar.

When you create a table, you can turn on compression on the column family, as follows:

Hbase (main): 002 create 0 > mytable'

{NAME = > 'colfam1', COMPRESSION = >' SNAPPY'}

Note that the data is compressed only on the hard drive. There is no compression in memory (MemStore or BlockCache) or network transmission.

Changing the compression code should not happen very often, but if you do need to change the compression code of a column family, just do it. You need to change the table definition and set the new compression code. When merged thereafter, all the generated HFile will be compressed with the new encoding. This process does not require the creation of new tables and replication of data. But you need to make sure that you can't remove the old coding libraries from the cluster until all the old HFiles have been merged after the coding has been changed.

7 unit time version

HBase maintains three time versions per unit by default. This property can be set. If you only need one version, it is recommended that you maintain only one version when setting up the table. In this way, the system does not retain multiple time versions of the update unit. The time version is also set at the column family level and can be set when the table is instantiated:

Hbase (main): 002mytable', 0 > create 'mytable', {NAME = >' colfam1', VERSIONS = > 1}

You can specify multiple properties for a column family in the same create statement, as shown below:

Hbase (main): 002 create 0 > mytable'

{NAME = > 'colfam1', VERSIONS = > 1, TTL = >' 18000'}

You can also specify the minimum number of time versions to be stored in the column family, as follows:

Hbase (main): 002mytable', 0 > create 'mytable', {NAME = >' colfam1', VERSIONS = > 5

MIN_VERSIONS = >'1'}

Setting TTL on column families at the same time is also useful sooner or later. If all time versions currently stored are earlier than TTL, at least the latest version of MIN_VERSION will be retained. This ensures that results are returned when your query and data are earlier than TTL.

This is the end of this article on "what are the characteristics of HBase". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.