Hbase build table create advanced properties / / hbase table pre-partition, that is, manual partition, this is very important 07/19 Update SLTechnology News&Howtos

Hbase build table create advanced properties / / hbase table pre-partition, that is, manual partition, this is very important

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Tuesday, February 19, 2019

hbase create table create advanced attributes//hbase table pre-partitioning, that is, manual partitioning, this is very important

The following shell commands can play a very important role in subsequent hbase operations, and are mainly reflected in the process of table creation. See the following create attributes.

1. BLOOMFILTER Default is NONE Whether to use Bloom or not

Bloom filtering can be enabled individually for each column family. Use HColumnDescriptor.setBloomFilterType(NONE| ROW |ROWCOL) enables Bloom separately for column families. Default = NONE No Bloom filtering. For ROW, the hash of the row key is added to Bloom each time a row is inserted. For ROWCOL, row key + column family + column family modifier hashes are added to Bloom every time rows are inserted

Usage: create 'table',{BLOOMFILTER =>'ROW'}

Enabling Bloom filtering saves you the process of having to read disks and can help improve read latency

2. VERSIONS Default is 3. This parameter means that the data is kept in three versions. If we think that our data is not so large and needs to be kept so much, it is updated at any time, and the old version of the data is worthless to us, then setting this parameter to 1 can save 2/3 space.

Use method: create 'table',{VERSIONS=>' 2'}

3. Compression Default value is NONE, that is, no compression is used.

This parameter means whether the column family is compressed and what compression algorithm is used

How to use: create 'table',{NAME=>' info', COMPRESSION=>'SNAPPY'}

I suggest SNAPPY compression algorithm, a compression algorithm comparison online more, I extracted from the Internet as a reference table, the specific snappy installation will be described in separate chapters.

This table is a set of test data released by Google a few years ago. The actual test Snappy is similar to the table below. In HBase, before Snappy was released (Snappy was released by Google in 2011), the LZO algorithm was adopted.

The goal is to achieve as fast compression and decompression speed as possible, while reducing the consumption of CPU; after Snappy is released, it is recommended to use Snappy algorithm (refer to "HBase: The Definitive Guide"), and the specific selection can be made after more detailed comparison testing of LZO and Snappy according to the actual situation.

Algorithm % remaining Encoding Decoding

GZIP 13.4% 21 MB/s 118 MB/s

LZO 20.5% 135 MB/s 410 MB/s

Zippy/Snappy 22.2% 172 MB/s 409 MB/s

What if there is no compression at the beginning of the table, and later you want to add compression algorithm, hbase has another command alter

4、alter

How to use:

such as modify compression algorithms

disable 'table'

alter 'table',{NAME=>'info',COMPRESSION=>'snappy'}

enable 'table'

Delete Column Family

disable 'table'

alter 'table',{NAME=>'info',METHOD=>'delete'}

enable 'table'

But after this modification, the table data is still so large that it has not changed much. what

The actual operation is done only after the major_compact 'table' command.

TTL default is 2147483647, that is:Integer.MAX_VALUE value is about 68 years

This parameter is to explain the survival time of the column family data, that is, the life cycle unit of the data is s.

This parameter can set the survival time of data according to specific requirements. Data that has exceeded the survival time will not be displayed in the table, and the data will be deleted completely at the next major compact.

Why delete data at the next major compact will be explained later.

Note that after TTL is set, MIN_VERSIONS=>'0' is set. After TTL timestamp expires, all data in this family will be deleted completely. If MIN_VERSIONS is not equal to 0, the latest MIN_VERSIONS version data will be retained, and all other versions will be deleted. For example, MIN_VERSIONS=>'1' will retain the latest version data, and other versions will not be saved.

The command describe 'table' looks at the parameters of create table or defaults.

7、disable_all 'toplist.* ' disable_all supports regular expressions and lists the currently matched tables as follows:

toplist_a_total_1001

toplist_a_total_1002

toplist_a_total_1008

toplist_a_total_1009

toplist_a_total_1019

toplist_a_total_1035

...

Disable the above 25 tables (y/n)? And give a confirmation prompt

8. Drop_all is used in the same way as disable_all

9, hbase table pre-partition is also manual partition

By default, a region partition is automatically created when creating an HBase table. When importing data, all HBase clients write data to this region until the region is large enough. One way to speed up bulk writes is to create empty regions in advance, so that when data is written to HBase, it will be Load Balancer across the cluster according to the region partition.

Usage:create 't1',' f1',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

You can also use API.

hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f info

Parameters are easy to understand test_table is table name HexStringSplit is split method-c is divided into 10 regions-f is family (column family)

In this way, the table can be divided into 10 regions in advance, reducing the time consumption of automatic partitioning when the data reaches the size of the storefile, and there is also an advantage, that is, reasonable design of rowkey can make the concurrent requests of each region evenly distributed (tend to be uniform) to maximize IO efficiency, but pre-partitioning needs to set filesize to a larger value, which parameter to set hbase.hregion.max. filesizeThis value defaults to 10G, that is, the default size of a single region is 10 G.

This value occurs from 0.90 to 0.92 to 0.94.3 from 256M--1G--10G This value is modified according to your own needs.

However, if MapReduce Input type is TableInputFormat and hbase is used as input, it should be noted that each region has a map. If the data is less than 10G, only one map will be enabled, resulting in a large waste of resources. At this time, you can consider appropriately reducing the value of this parameter, or using the method of pre-allocating regions, and will

hbase.hregion.max.filesize Set a relatively large value, such as 1000G, which is not easy to reach. If it reaches this value, manually assign the region.

As mentioned earlier, compact, why does the data that has been set to TTL exceed the survival time disappear, and how does it disappear? Is it deleted? By which parameters are deleted.

We'll talk about hbase compact later.

For example, the table creation statement is://production, which can be used to create hbase table like this

hbase(main):009:0> create 'NewsClickFeedback',{NAME=>'Toutiao',VERSIONS=>1,BLOCKCACHE=>true,BLOOMFILTER=>'ROW',COMPRESSION=>'SNAPPY',TTL => 2592000 },{SPLITS => ['1','2','a','b']}

Tip: NAME=>'Toutiao' column family is Toutiao ;VERSIONS=>1 version number is 1 ; whether to enable block cache, enabled by default.;

BLOOMFILTER=>'ROW' Bloom filter, optimizing random read performance of HBase, optional value NONE| ROW| ROWCOL, default is NONE, this parameter can be individually enabled for a column cluster. COMPRESSION=>'SNAPPY' compression; Expiration TTL => 2592000 is 30 days

{SPLITS => ['1',' 2','a',' b']} partition; generally pre-allocate regions by hash + partition. For example, rowkey in the example first uses md5 hash, and then according to the initial letter partition, 4 regions can be pre-allocated.

insert data

put 'NewsClickFeedback', '1', 'Toutiao:name', 'zhangsan'

put 'NewsClickFeedback', '01', 'Toutiao:name', 'lisi'

put 'NewsClickFeedback', '100', 'Toutiao:name', 'wangwu'

put 'NewsClickFeedback', '4', 'Toutiao:name', 'liuliu'

IN_MEMORY

Whether the data resides in memory, default is false. HBase provides a cache area for frequently accessed data. The cache area generally stores data with small data volume and frequent access. The common scenario is metadata storage. By default, the cache area size is equal to Jvm Heapsize 0.2 0.25, and if Jvm Heapsize = 70G, the storage area size is approximately equal to 3.2 G. It should be noted that HBase meta metadata information is stored in this area. If the business data is set to true and too large, Meta data will be replaced, resulting in reduced performance of the entire cluster. Therefore, you need to be extra careful when setting this parameter.

IN_MEMORY means resident memory. If the user sets this parameter to true in column family, it means that the data corresponding to this column family will reside in memory. Generally, it is recommended to set it to resident memory if it is business metadata. In addition, hbase:meta data is resident in memory. As for whether hbase:meta can execute split, I actually execute the command:split 'hbase:meta' in the actual environment, and the result is still a region, and there are no multiple regions. At present, it is still considered that hbase:meta cannot execute split.

HBase BlockCache Series-Walk into BlockCache http://hbasefly.com/2016/04/08/hbase-blockcache-1/ Mentioned IN_MEMORY=true in this link

//The in-memory area indicates that data can reside in memory and is generally used to store frequently accessed data with small data volume, such as metadata. Users can also place this column family in the in-memory area by setting the column family attribute IN-MEMORY= true when creating a table.

blocksize

HFile will be divided into multiple blocks of equal size. The size of each block can be specified by the parameter blocksize => '65535' when creating the table column cluster. The default is 64k. Large blocks are conducive to sequential Scan, and small blocks are conducive to random query. Therefore, trade-offs are required.

//If the service requests are mainly Get requests, you can consider setting the block size smaller; if Scan requests are main requests, you can increase the block size; the default 64K block size is a balance between Scan and Get.

The reference link is: HBase -table creation sentence parsing http://hbasefly.com/2016/03/23/hbase_create_table/

HBase(VIII): Table structure design optimization: www.cnblogs.com/tgzhu/archive/2016/09/11/5862299.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.