What is HBase? 07/06 Update SLTechnology News&Howtos

What is HBase?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces what is HBase, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

What is HBase

HBase is a highly reliable, high-performance, column-oriented and scalable distributed storage system. Large-scale structured storage clusters can be built on cheap PC Server by using Hbase technology.

HBase is an open source implementation of Google Bigtable, similar to Google Bigtable using GFS as its file storage system, HBase uses Hadoop HDFS as its file storage system; Google runs MapReduce to deal with massive data in Bigtable, HBase also uses Hadoop MapReduce to deal with massive data in HBase; Google Bigtable uses Chubby as a collaborative service and HBase uses Zookeeper as its counterpart.

2. HBase design model

Every table in HBase is called BigTable. BigTable stores a series of row records, which have three basic types of definitions: Row Key, Time Stamp, and Column.

1. Row Key is the unique identity of the line in the BigTable.

2. Time Stamp is the associated timestamp for each data operation, which can be regarded as the version of SVN.

3. Column is defined as

< family>

< label>

Through these two parts, you can specify a unique data storage column, the definition and modification of family requires a DDL operation similar to DB to HBase, while label can be used directly without definition, which also provides a means to dynamically customize columns. Another role of family is to optimize read and write operations in physical storage, which is physically close to the data stored in family, so this feature can be used in the process of business design.

1. Logical storage model

HBase stores data as a table. A table consists of rows and columns. Columns are divided into several column families (row family), as shown in the following illustration.

The following is a detailed parsing of the elements in the table:

Row Key

Like nosql databases, row key is the primary key used to retrieve records. There are only three ways to access rows in hbase table:

1 access through a single row key

2 range through row key

3 full table scan

The Row key line key (Row key) can be any string (the maximum length is 64KB, which is usually 10-100bytes in practical applications). Within hbase, row key is saved as a byte array.

When storing, the data is sorted and stored according to the byte order of Row key. When designing a key, you should fully sort the storage feature and put together the row stores that are often read together. (location correlation)

Note:

The main results are as follows: 1. The result of the lexicographic order of int is 1, 10, 100, 11, 12, 13, 14, 15, 16, 17, 18, 192, 20, 21, … , 9,91,92,93,94,95,96,97,98,99 . To maintain the natural order of shaping, the row key must be filled to the left with 0.

2. One read and write of a row is an atomic operation (no matter how many columns are read and written at a time).

Column family

Each column in the hbase table belongs to a column family. Column families are part of the chema of the table (while columns are not) and must be defined before using the table. Column names are prefixed with column families. For example, courses:history and courses:math all belong to the courses column family.

Access control, disk and memory usage statistics are all carried out at the column family level. In practical applications, control permissions on column families can help us manage different types of applications: we allow some applications to add new basic data, some applications can read basic data and create inherited column families, and some applications are only allowed to browse data (and may even not be able to browse all data for privacy reasons).

Time stamp

The storage unit determined by row and columns in HBase is called cell. Each cell holds multiple versions of the same data. The version is indexed by timestamp. The type of timestamp is 64-bit integer. The timestamp can be assigned by hbase (automatically when the data is written), where the timestamp is the current system time accurate to milliseconds. The timestamp can also be explicitly assigned by the customer. If the application wants to avoid data version conflicts, it must generate its own unique timestamps. In each cell, different versions of the data are sorted in reverse chronological order, meaning that the latest data comes first.

In order to avoid the burden of management (including storage and indexing) caused by too many versions of data, hbase provides two ways to recycle data versions. One is to save the last n versions of the data, and the other is to save the most recent version (for example, the last seven days). You can set it for each column family.

Cell

The unit uniquely determined by {row key, column (= +), version}. The data in cell is typeless and is all stored in bytecode form.

2. Physical storage model

The Table is split into multiple HRegion in the direction of the row, each HRegion scattered in a different RegionServer.

Each HRegion consists of a plurality of Store, each Store consists of a memStore and 0 or more StoreFile, and each Store holds a Columns Family

StoreFile is stored in HDFS in HFile format.

III. HBase storage architecture

As can be seen from the architecture diagram of HBase, the storage in HBase includes HMaster, HRegionServer, HRegion, Store, MemStore, StoreFile, HFile, HLog, and so on. The following is the HBase storage architecture diagram:

Each table in HBase is divided into multiple child tables (HRegion) according to a certain range by row keys. By default, a HRegion exceeding 256m will be divided into two. This process is managed by HRegionServer, while the allocation of HRegion is managed by HMaster.

The role of HMaster:

1. Assign region to Region server.

2. Be responsible for the load balancing of Region server.

3. Find the invalid Region server and redistribute the region on it.

4. Garbage file collection on HDFS.

5. Process schema update request.

HRegionServer function:

1. Maintain the region assigned to him by master and process io requests for these region.

2. Be responsible for shredding the region that becomes too large in the process of running.

As you can see, client does not need master to access the data on hbase (addressing access zookeeper and region server, data read-write access region server), master only maintains the metadata information of table and region (the metadata information of table is stored on zookeeper), and the load is very low. When HRegionServer accesses a child table, it creates a HRegion object, and then creates an Store instance for each column family of the table, each Store has a MemStore and 0 or more StoreFile corresponding to it, each StoreFile corresponds to a HFile, and the HFile is the actual storage file. Therefore, there are as many Store as there are column families in a HRegion. A HRegionServer will have multiple HRegion and a HLog.

HRegion

The table is separated into multiple Region in the direction of the row. Region is the smallest unit of distributed storage and load balancing in HBase, that is, different region can be on different Region Server, but the same Region will not be split into multiple server.

Region is separated by size, and each table typically has only one region. As the data continues to be inserted into the table, the region grows, and when a column family of region reaches a threshold (the default is 256m), it is split into two new region.

1 、

< 表名,startRowkey,创建时间>

2. By catalog table (- ROOT- and .meta.) Record the endRowkey of the region

HRegion location: the Region Server to which the Region is assigned is completely dynamic, so a mechanism is needed to locate the specific region server where the Region is located.

HBase uses a three-tier structure to locate region:

1. Get the location of the-ROOT- table through the file / hbase/rs in zk. -the ROOT- table has only one region.

2. Find .meta through the-ROOT- table. The location of the corresponding region in the first table of the table. Actually-the ROOT- table is. META. The first region;.META of the table. Each region in the table is a row record in the-ROOT- table.

3. Through .meta. The table finds the location of the desired user table region. Each region in the user table is in .meta. There is a row of records in the table.

The ROOT- table is never separated into multiple region, ensuring that a maximum of three jumps are needed to locate any region. Client will save and cache the location information of the query, and the cache will not expire actively, so if all the caches on the client expire, you need to go back and forth six times to locate the correct region, of which three times are used to detect cache failure and the other three times to obtain location information.

Store

Each region consists of one or more store, at least one store,hbase will put the data accessed together in a store, that is, create a store for each ColumnFamily, and if there are several ColumnFamily, there will be several Store. A Store consists of a memStore and 0 or more StoreFile. HBase determines whether the region needs to be segmented by the size of the store.

MemStore

MemStore is stored in memory. Save the modified data, namely keyValues. When the size of the memStore reaches a threshold (the default 64MB), the memStore is flush to the file, that is, a snapshot is generated. Currently, hbase will have a thread responsible for the flush operation of memStore.

StoreFile

After the data in memStore memory is written to a file, the underlying StoreFile,StoreFile is saved in HFile format.

HFile

The storage format of KeyValue data in HBase is the binary format file of hadoop. First of all, the HFile file is indefinite in length, and there are only two pieces of fixed length: Trailer and FileInfo. There is a pointer to the starting point of other data blocks in Trailer, and FileInfo records some meta information about the file. Data Block is the basic unit of hbase io. In order to improve efficiency, there is block cache mechanism based on LRU in HRegionServer. The size of each Data block can be specified by parameters when creating a Table (default block size 64KB), large Block is good for sequential Scan, and small Block is good for random query. In addition to the Magic at the beginning, each Data block is spliced together by KeyValue pairs, and the Magic content is a random number designed to prevent data corruption, as follows.

The structure diagram of HFile is as follows:

The Data Block section is used to store the data in the table, which can be compressed. The Meta Block segment (optional) is used to save user-defined kv segments and can be compressed. The FileInfo section is used to store the meta-information of HFile and cannot be compressed. Users can also add their own meta-information in this section. The Data Block Index section (optional) is used to hold the index of Meta Blcok. Trailer this part is fixed in length. Save the offset of each segment, when reading a HFile, you will first read the Trailer,Trailer to save the starting position of each segment (the Magic Number of the segment is used for security check), and then the DataBlock Index will be read into memory, so that when retrieving a key, you do not need to scan the entire HFile, but only need to find the block where the key is located in memory, read the entire block into memory through a disk io, and then find the needed key. DataBlock Index was eliminated by LRU mechanism. The Data Block,Meta Block of HFile is usually stored by compression. After compression, the network IO and disk IO can be greatly reduced. The resulting overhead, of course, is to spend cpu for compression and decompression. (remarks: defects of DataBlock Index. A) too much memory b) slow start loading time)

HLog

HLog (WAL log): Wal means write ahead log, which is used for disaster recovery. HLog records all changes to the data, and once the region server goes down, it can be recovered from the log.

LogFlusher

Periodically write the information in the cache to the log file

LogRoller

Manage and maintain log files

IV. Steps for stand-alone deployment and installation of Hbase

HBase needs to run in a Hadoop environment, so the prerequisite for installing HBase is that the Hadoop environment must be installed.

Hadoop version: hadoop-2.6.0-cdh6.7.0

Download address: http://archive.cloudera.com/cdh6/cdh/5/hadoop-2.6.0-cdh6.7.0.tar.gz

HBase version: hbase-1.2.0-cdh6.7.0

Download address: http://archive.cloudera.com/cdh6/cdh/5/hbase-1.2.0-cdh6.7.0.tar.gz

The installation steps for HBase are as follows:

Step 1: extract the hbase-1.2.0-cdh6.7.0.tar.gz to the specified directory (in this case, in / home/hadoop/app/).

[hadoop@hadoop001 software] $ll hbase-1.2.0-cdh6.7.0.tar.gz-rw-rw-r--. 1 hadoop hadoop 156854981 Apr 1 2016 hbase-1.2.0-cdh6.7.0.tar.gz [hadoop@hadoop001 software] $tar-zxvf hbase-1.2.0-cdh6.7.0.tar.gz-C / home/hadoop/app/ [hadoop@hadoop001 software] $cd.. / app/ [hadoop@hadoop001 app] $lldrwxr-xr-x. 32 hadoop hadoop 4096 Sep 27 15:21 hbase-1.2.0-cdh6.7.0

Step 2. Configure the environment variables of HBase to the ~ / .bash_profile file.

[hadoop@hadoop001 app] $vi ~ / .bash_profile export HBASE_HOME=/home/hadoop/appase-1.2.0-cdh6.7.0export PATH=$HBASE_HOME/bin:$PATH

Make the configuration file effective immediately:

[hadoop@hadoop001 app] $source ~ / .bash_profile

Step 3, modify the conf/hbase-env.sh.

1) remove the "#" before JAVA_HOME and modify it to your own installed Java path.

Export JAVA_HOME=/home/hadoop/appk1.8.0_45

2) remove the "#" before HBASE_MANAGES_ZK and set its value to true (HBase manages its own ZooKeeper, so there is no need to install ZooKeeper).

Export HBASE_MANAGES_ZK=true

Step 4. Open conf/hbase-site.xml and add the following.

Hbase.rootdir needs to correspond to the fs.default.name attribute value in the conf/core-site.xml file in the previously installed Hadoop directory.

Fs.default.name is set to hdfs://hadoop001:9000/

Hbase.rootdir is set to hdfs://hadoop001:9000/hbase

Hbase.ZooKeeper.quorum is set to default: localhost can also be set to: hadoop001

Hbase.tmp.dir is set to the previously created tmp directory: / usr/java/hbase/tmp

The code is as follows:

Fs.default.name hdfs://hadoop001:9000 hbase.rootdir hdfs://hadoop001:9000/hbase hbase.zookeeper.quorum localhost hbase.tmp.dir / home/hadoop/app/hbase-1. 2.0-cdh6.7.0/hbase-tmp

Step 5. Start HBase (provided that Hadoop is already started).

[hadoop@hadoop001 bin] $. / start-hbase.sh starting master, logging to / home/hadoop/app/hbase-1.2.0-cdh6.7.0/logs/hbase-hadoop-master-hadoop001.outJava HotSpot (TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0Java HotSpot (TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 [hadoop@hadoop001 bin] $

Step 6. After Hbase starts successfully, enter the URL http://hadoop001:60010 in the browser

Thank you for reading this article carefully. I hope the article "what is HBase" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.