Detailed explanation of getting started with Hbase 07/12 Update SLTechnology News&Howtos

Detailed explanation of getting started with Hbase

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

1. Overview of hbase

1.1 what is hbase

Hbase is a distributed data storage based on hdfs. It is a nosql database with high reliability, high performance, column storage, scalability, real-time read and write.

Hbase can store a large amount of data, and the query performance is very high in the later stage, and it can return the results of hundreds of millions of data in seconds.

1.2 characteristics of hbase table

1. Large

Hbase tables can store huge amounts of data.

2. No mode

The fields in each row of the mysql table are the same, while each row of data in the hbase table can have distinct columns.

3. Column oriented

The data in the hbase table can have many columns, and later it stores the data according to different columns and writes it to different files. Store data for column families.

4. Sparse

Columns that are null in the hbase table do not take up actual storage space.

5. Multiple versions of data

When updating the data in the hbase table, it does not delete the previous result data directly, but retains multiple versions of the data, each of which is given a version number, which is determined according to the time stamp when we insert the data.

6. Single data type

No matter what type of data it is, it is finally converted into a byte array and stored in the hbase table.

1.3 logical view of the hbase table

2. The cluster structure of hbase

1 、 client

Provides some java interfaces for manipulating hbase tables. Client maintains some cache to speed up access to hbase. Client will save and cache the queried location information, and the cache will not be invalidated actively.

2 、 zookeeper

A zk cluster is required for client to manipulate hbase table data.

Action

1. Zk saves the metadata information of hbase cluster.

Schema for storing Hbase, including what table it has and what column family it has for each table

2. Zk holds the addressing entry for all hbase tables

Later, when you manipulate hbase data through the client interface, you need to connect to the zk cluster

Address entry for storing all Region-on which server is the root table

3. After the introduction of zk, the high availability of the whole hbase cluster is realized.

4. Zk keeps the registration and heartbeat information of HMaster and HRegionServer.

After which HRegionServer dies in the later stage, zk will also sense it and inform the boss HMaster of this information.

3 、 HMaster

It is the boss of the whole hbase cluster.

Action

1. It accepts client requests to create and delete tables. Processing schema update requests

2. It will assign the corresponding region to HRegionServer to manage the data.

3. It will reassign the region managed by the dead HRegionServer to other living HRegionServer

4. It will achieve HRegionServer load balancing to avoid excessive region managed by a HRegionServer.

4 、 HRegionServer

It is the younger brother who integrates the hbase cluster.

Action

1. Responsible for managing the region assigned to it by HMaster boss

2. It will receive read and write requests from the client

3. It will split the region data that becomes too large in the process of running.

5 、 Region

It is the smallest unit of distributed storage in the entire hbase table.

Its data is stored based on hdfs.

3. Hbase cluster installation and deployment

prerequisite

First set up zk and hadoop clusters

1. Download the corresponding installation package

Http://archive.apache.org/dist/hbase/1.2.1/hbase-1.2.1-bin.tar.gzhbase-1.2.1-bin.tar.gz

2. Plan the installation directory

/ export/servers

3. Upload the installation package to the server

4. Extract the installation package to the specified planning directory

Tar-zxvf hbase-1.2.1-bin.tar.gz-C / export/servers

5. Rename the decompression directory

Mv hbase-1.2.1 hbase

6. Modify the configuration file

You need to put the hadoop installation directory in the / etc/hadoop folder

Core-site.xmlhdfs-site.xml

You need to copy the configuration files of the above two hadoop to the conf folder under the hbase installation directory

1 、 vim hbase-env.sh

# configure the java environment variable export JAVA_HOME=/export/servers/jdk# to specify that the hbase cluster is managed by an external zk cluster, instead of using its own zk cluster export HBASE_MANAGES_ZK=false

2 、 vim hbase-site.xml

Hbase.rootdirhdfs://node1:9000/hbase hbase.cluster.distributed true hbase.zookeeper.quorumnode1:2181,node2:2181,node3:2181

3 、 vim regionservers

# specify which nodes are HRegionServernode2node3

4 、 vim backup-masters

# specify which nodes are backup Hmasternode2

7. Configure hbase environment variables

Vim / etc/profile

Export HBASE_HOME=/export/servers/hbaseexport PATH=$PATH:$HBASE_HOME/bin

8. Distribute hbase directories and environment variables

Scp-r hbase node2:/export/serversscp-r hbase node3:/export/serversscp / etc/profile node2:/etcscp / etc/profile node3:/etc

9. Let the environment variables of all hbase nodes take effect

Execute on all nodes

Source / etc/profile

4. Start and stop the hbase cluster

1. Start the hbase cluster

Start the zk and hadoop clusters first

And then through hbase/bin

Start-hbase.sh

Where do you start this script, first start a HMaster process (that is, the living HMaster) on the current machine, start HRegionServer on the corresponding node through the regionservers file, and start the backup HMaster on the corresponding node through the backup-masters file.

2. Stop the hbase cluster

Through hbase/bin

Stop-hbase.sh

Hbase Cluster web Management Interface

1. After starting the hbase cluster

Access address

HMaster hostname: 16010

5. Hbase shell command line operation

Hbase/bin/hbase shell enters the hbase shell client command operation

1. Create a table

Create't collect usernames and infoids. Create'T1', {NAME = > f1'}, {NAME = > f2'}, {NAME = > f3'}

2. Check which tables are available.

List is similar to sql:show tables in the mysql table.

3. View the description information of the table

Describe'tkeeper username info'

4. Modify the properties of the table

# modify the maximum number of versions of a column family, alter't versions usernames infofiles, NAME = > 'base_info', VERSIONS = > 3

5. Add data to the table

Put's inforegation reparator "00001", "base" infoV namepole, "zhangsanput 't" username", "00001", "baseband inforegation", "30input", "t" username, "infoc", "00001", "base,", "base,",

6. Query the data of the table

/ query get'tcustom usernames infostaments according to conditions, {COLUMN = > 'base_info'} get' usernames infocodes, {COLUMN = > 'base_info:name'} get' usernames infocodes, 00001, {TIMERANGE = > [154424330062660]} get'tinquiry usernames infocodes, {COLUMN = > 'base_info:age',VERSIONS = > 3} / / query the full table of scan'tusernames

7. Delete data

Delete't collect usernames in foodies, memorials, 00001, memorials, basebooks, foocons, nameplates, deleteall, tweets, usernames, foodies, memos, 00001.

8. Delete the table

Disable 't_user_info'drop' tasking username info'

6. The internal principle of hbase

All rows in Table are arranged in row key dictionary order Table is split into multiple Hregionregion by size in the direction of the row (default 10G), each table starts with only one region, region grows continuously, and when it reaches a threshold, Hregion will divide into two new Hregion. When there are more rows in the table, there will be more and more Hregion. Hregion is the smallest unit of distributed storage and load balancing in Hbase. The smallest unit means that different Hregion can be distributed on different HRegion server. Although HRegion is the smallest unit of load balancing, it is not the smallest unit of physical storage. A HRegion consists of one or more Store, and each store holds a column family. Each Strore in turn consists of a memStore and 0 to more StoreFile. The write operation first writes to memstore. When the amount of data in memstore reaches a certain threshold (default 128m or 1 hour), Hregionserver starts the flashcache process to write to storefile, and each write forms a separate storefile. When the number of storefile exceeds a certain threshold (the default parameter hbase.hstore.blockingStoreFiles=10), multiple storeFile will be merged. When the sum of the storefile size of all the store of the region, that is, the size of all the store exceeds the hbase.hregion.max.filesize=10G, the region will be split into two of the current region and assigned by the Hmaster to the corresponding region server to achieve load balancing. There is a HLog object in each HRegionServer. HLog is a class that implements Write Ahead Log. Each user operation writes MemStore, a piece of data is also written to the HLog file. The HLog file periodically scrolls out the new file and deletes the old file (data that has been persisted to the StoreFile). When HRegionServer terminates unexpectedly, HMaster will perceive through Zookeeper that HMaster will first process the legacy HLog file, split the Log data of different Region into the corresponding region directory, and then redistribute the invalid region. In the process of Load Region, the HRegionServer of these region will find that there is a historical HLog to deal with, so the data in Replay HLog will be transferred to MemStore, and then flush to StoreFiles to complete the data recovery.

7. The addressing mechanism of hbase

Looking for RegionServer

ZooKeeper- >-ROOT- (single Region)-> .meta.-> user table

-ROOT- table

The table contains .meta. The region list where the table is located, and the table will have only one Region;root region that will never be split, ensuring that a maximum of three jumps are needed to locate any region. The location of the-ROOT- table is recorded in Zookeeper.

.META. Table

The table contains a list of all the user space region, as well as the server address of the RegionServer. Meta. Each row of the table holds the location information of one region, and the row key is encoded by the table name + the last row of the table. To speed up the visit,. META. All region of the table is stored in memory.

Contact regionserver to query target data

Regionserver locates to the region where the target data is located, and issues a query request

Region looks for it in memstore first, and returns if hit.

If it is not found in memstore, scan in storefile (there may be a lot of storefile----bloomfilter Bloom filters) Bloom filter can quickly return whether the query rowkey is in this storeFile, but there is also an error. If it does not, it must not. If it does, it may not.

8. Advanced applications of Hbase

Build a table

BLOOMFILTER defaults to Row Bloom filter

For ROW, the hash of the row key is added to the Bloom each time the row is inserted. For ROWCOL, the hash modified by row key + column family + column family will be added to Bloom each time the row is inserted

VSRSIONS defaults to 1 data version

If we don't think it's necessary to keep so much of our data and update it all the time, and the old version of the data is of no value to us, then setting this parameter to 1 will save 2 to 3 space.

COMPRESSION default is NONE compression

GZIP / LZO / Zippy / Snappy

Disable_all 'toplist.*' disable_all supports regular expressions and lists the currently matched tables drop_all as well

Hbase table pre-partitioning-manual partitioning

One way to speed up batch writes is to create some empty regions in advance, so that when the data is written to HBase, it will load balance the data in the cluster according to the region partition. Reduce the number of automatic partitions when the data reaches storefile size

Time consumption, and there is another advantage, that is, the reasonable design of rowkey can make the concurrent requests of each region evenly distributed (tend to uniform) to maximize the efficiency of IO.

Row key design

There are as few as possible, usually 2-3.

Rowkey

According to the characteristics of the dictionary order, the data that need to be queried in batches should be stored as continuously as possible (spear). The keywords of the query conditions should be assembled into rowkey as far as possible, and the conditions with the highest query frequency should be as short as possible by rowkey, not more than 16 bytes.

Every cell in HFile will store rowkey. Too much rowkey will affect storage efficiency.

MemStore will cache part of the data to memory, if the rowkey field is too long, the effective utilization of memory will be reduced, the system can not cache more data, which will reduce the efficiency of retrieval.

It is recommended to use the high bit of rowkey as a hash field, which is randomly generated by the program, and the low-bit time field. This will improve the probability of data balancing distributed in each RegionServer to achieve load balancing. (Shield)

Rowkey contradiction

The rows in HBase are sorted according to the dictionary order of rowkey, and this design optimizes the scan operation so that the related rows and the rows that will be read together can be saved nearby, which is convenient for scan. However, poor rowkey design is the source of the hot spots.

Hot spot resolution

Adding salt to the rowkey with a random string hash causes the same line to always use a prefix to reverse the fixed length or numeric format of the rowkey at the expense of rowkey's ordered timestamp reversal.

You can append to the end of key with Long.Max_Value-timestamp, for example, [key] [reverse_timestamp], the latest value of [key] can get the first record of [key] through scan [key], because the rowkey in HBase is ordered, and the first record is the last data entered.

Summary

The above is the whole content of this article. I hope the content of this article has a certain reference and learning value for everyone's study or work. Thank you for your support. If you want to know more about it, please see the relevant links below.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.