What is the principle and construction of HBase distributed database 04/28 Update SLTechnology News&Howtos

What is the principle and construction of HBase distributed database

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail the principle and construction of HBase distributed database. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have some understanding of the relevant knowledge after reading this article.

I. brief introduction I. Overview of HBase

HBase sits in the structured storage layer of the Hadoop ecosystem, builds a distributed, column-oriented database on top of HDFS, and uses Zookeeper as a coordination service.

HDFS**** provides highly reliable underlying storage support for HBase

MapReduce**** provides high-performance computing power for HBase

Zookeeper**** provides stable service and failure recovery mechanism for HBase.

II and HBase process data

Although Hadoop is a distributed file system with high fault tolerance, high latency and high concurrency, it is not suitable for providing real-time computing. HBase is a distributed database that can provide real-time computing, and the data is saved on the HDFS distributed file system. The HDFS guarantee period is high fault tolerance, but in the reproduction environment, how does HBase provide real-time performance based on hadoop?

The data on HBase is stored in block blocks on HDFS in the form of StoreFile (HFile) binary streams; but HDFS does not know what HBase is used to store, it only treats the stored files as binary files, that is, HBase's stored data is transparent to the HDFS file system.

III, HBase and HDFS

In the following table, we compare HDFS with HBase:

HDFSHBaseHDFS is suitable for distributed file systems that store large-capacity files. HBase is a database built on top of HDFS. HDFS does not support fast individual record lookup. HBase provides fast lookup in larger tables HDFS provides high latency batch processing; there is no batch concept. HBase provides billions of records with low latency access to a single row record (random access). The data provided by HDFS can only be accessed sequentially. HBase internally uses hash tables and provides random access, and it stores indexes, which can quickly find the data in the HDFS file. 2. HBase data model I, HBase data structure rowkeycolumn family (userInfo)

Column family (addressInfo)

Timestamp

Nameagesexpasswordcountryprovincecityemail

1zhengshuang290123456PRC Liaoning Shenyang 11@1.com1545907012lixiaolu280123456PRC Shanghai Shanghai 22@2.com1545907023luozhixiang311123456PRC Taiwan Taipei 33@3.com154590703

Table (Table)

In HBase, data is stored in a table, the table name is a string, and the table consists of rows and columns. Unlike relational databases, HBase is multidimensional mapping.

Line (Row)

The row of a HBase consists of row keys or one or more columns. Row keys have no data type and are always treated as byte array byte []. Row keys are similar to primary key indexes in relational databases and are unique throughout the HBase table, but unlike RDBMS, row keys are sorted alphabetically. For example, there are already three row keys in the table that are 1000001jc1000002j1000004, and when you insert a data with a row key of 1000003, the data is not ranked last, but in the middle of the row key 1000002mem1000004. Therefore, the design of row keys is very important, and we can use the characteristics of row keys to arrange the relevant data together. For example, if you prefix the domain name of a website as a row key, you should org.apache.www,org.apache.mail,org.apache.jira the domain name so that all Apache domain names will be arranged together in the table instead of scattered.

Column family (Column Family)

The HBase column family consists of multiple columns, which is equivalent to grouping columns. There is no limit to the number of columns, and there can be millions of columns in a column family. Each row in the table has the same column family. * * column families must be specified when the table is created, cannot be easily modified, and the number does not exceed three. * * the type of column family name is a string.

Column qualifier (Qualifier)

Column qualifiers are used to represent the names of columns in the HBase table and to locate the data in the column family. Format such as * * family:qualifier. * * represents the qualifier column in the family column family.

Cell (Cell)

Cells are located by row keys, column families, and qualifiers. The cell contains a value and a timestamp. The value has no data type and is always treated as byte []; the timestamp represents the worthwhile version and is of type long.

II, HBase data model

Because HBase represents multidimensional mapping, the arrangement of rows and columns is different from that of traditional RDBMS. Traditional RDBMS databases must store NULL for values that do not exist, while in HBase, values that do not exist can be omitted and do not take up storage space. In addition, HBase specifies the table name and column family when creating the table, and there is no need to specify columns. All columns are added dynamically when data is added later, while RDBMS can not modify or add columns dynamically after specifying columns.

3. Cluster Building [daniel@hadoop103 software] $tar-zxvf hbase-1.2.1-bin.tar.gz-C / opt/moudle/ [daniel@hadoop103 hbase-1.2.1] $vim conf/hbase-env.sh export JAVA_HOME=/usr/java/jdk1.8.0_211-amd64# specifies the use of an external zk cluster export HBASE_MANAGES_ZK= false [Daniel @ hadoop103 hbase-1.2.1] $vim conf/hbase-site.xml Hbase.rootdir hdfs://hadoop101:9000/hbase hbase.cluster.distributed true hbase.zookeeper.quorum hadoop101:2181 Hadoop102:2181 Hadoop103:2181 hbase.master.info.port 16010 hbase.cluster.distributed true [daniel@hadoop103 hbase-1.2.1] $vim conf/regionservershadoop104hadoop105 [daniel@hadoop101 hbase-1.2.1] $vim conf/backup-mastershadoop105 [daniel@hadoop103 hbase-1.2.1] $scp-r / opt/moudle/hbase-1.2 .1 / daniel@hadoop104:/opt/moudle/ [daniel@hadoop103 hbase-1.2.1] $scp-r / opt/moudle/hbase-1.2.1/ daniel@hadoop105:/opt/moudle/ IV. Common shell operations

Enter the client interface

[daniel@hadoop103 hbase-1.2.1] $bin/hbase shell

Create a user table that contains two column families, base_info and extra_info

Hbase (main): 001base_info', 0 > create 'user',' base_info', 'extra_info'0 row (s) in 1.7180 seconds= > Hbase::Table-user## another way to write hbase (main): 002create 0 > create' user2', {NAME = > 'base_info', VERSIONS = >' 3'}, {NAME = > 'extra_info'} 0 row (s) in 1.2440 seconds= > Hbase::Table-user2

Add data operation

Hbase (main): 001user', 0 > put 'user',' rk0001', 'base_info:name',' zhangsan'0 row (s) in 0.2610 secondshbase (main): 002VOO > put 'user',' rk0001', 'base_info:gender',' female'0 row (s) in 0.0280 secondshbase (main): 003Rank 0 > put 'user',' rk0001', 'base_info:age', 200 row (s) in 0.0180 secondshbase (main): 004RV 0 > put' user' 'rk0001', 'extra_info:address',' beijing'0 row (s) in 0.0410 seconds

Data query

Hbase (main): 005user', 0 > get 'user',' rk0001'COLUMN CELL base_info:age timestamp=1612262295591, value=20 base_info:gender timestamp=1612262285794, value=female base_info:name timestamp=1612262272928, value=zhangsan extra_info:address timestamp=1612262304269, value=beijing 4 row (s) in 0.0840 secondshbase (main): 006user', 0 > get 'user',' rk0001' 'base_info' COLUMN CELLbase_info:age timestamp=1612262295591, value=20 base_info:gender timestamp=1612262285794, value=female base_info:name timestamp=1612262272928, value=zhangsan 3 row (s) in 0.0240 seconds

View information about a column family under rowkey

Hbase (main): 006 rk0001', 0 > get 'user',' rk0001', 'base_info' COLUMN CELL base_info:age timestamp=1612262295591, value=20 base_info:gender timestamp=1612262285794, value=female base_info:name timestamp=1612262272928, value=zhangsan 3 row (s) in 0.0240 seconds

Get the information of the name and age column identifiers in the user table where row key is the rk0001,base_info column family

Hbase (main): 007 user', 0 > get 'user',' rk0001', 'base_info:name',' base_info:age' COLUMN CELL base_info:age timestamp=1612262295591, value=20 base_info:name timestamp=1612262272928, value=zhangsan 2 row (s) in 0.0200 seconds

Get the information that row key is rk0001,base_info and extra_info column family in user table

Hbase (main): 008 user', 0 > get 'user',' rk0001', 'base_info',' extra_info' COLUMN CELL base_info:age timestamp=1612262295591, value=20 base_info:gender timestamp=1612262285794, value=female base_info:name timestamp=1612262272928, value=zhangsan extra_info:address timestamp=1612262304269 Value=beijing 4 row (s) in 0.0180 secondshbase (main): 009secondshbase 0 > get 'user',' rk0001', {COLUMN = > ['base_info',' extra_info']} COLUMN CELL base_info:age timestamp=1612262295591, value=20 base_info:gender timestamp=1612262285794, value=female base_info:name timestamp=1612262272928 Value=zhangsan extra_info:address timestamp=1612262304269, value=beijing 4 row (s) in 0.0210 secondshbase (main): 010 extra_info:address' 0 > get 'user',' rk0001', {COLUMN = > ['base_info:name',' extra_info:address']} COLUMN CELL base_info:name timestamp=1612262272928, value=zhangsan extra_info:address timestamp=1612262304269, value=beijing 2 row (s) in 0.0200 seconds

Specify rowkey and column value query

Hbase (main): 011user', 0 > get 'user',' rk0001', {FILTER = > "ValueFilter (=, 'binary:zhangsan')"} COLUMN CELL base_info:name timestamp=1612262272928, value=zhangsan 1 row (s) in 0.0750 seconds

Get the information that row key is rk0001 and column identifier contains an in the user table

Hbase (main): 002user', 0 > get 'user',' rk0001', {FILTER = > "(QualifierFilter (=, 'substring:a'))"} COLUMN CELL base_info:age timestamp=1612262295591, value=20 base_info:name timestamp=1612262272928, value=zhangsan extra_info:address timestamp=1612262304269, value=beijing

Continue to insert a batch of data

Hbase (main): 003main 0 > put 'user',' rk0002', 'base_info:name',' fanbingbing'0 row (s) in 0.1320 secondshbase (main): 004VOO > put 'user',' rk0002', 'base_info:gender',' female'0 row (s) in 0.0270 secondshbase (main): 005VOO > put 'user',' rk0002', 'base_info:birthday' '2000-06-06: 0 row (s) in 0.0160 secondshbase (main): 006 secondshbase 0 > put' user', 'rk0002',' extra_info:address', 'shanghai'0 row (s) in 0.0150 seconds

Query all the information in the user table

Hbase (main): 007 scan 0 > user'

Column family query

# # you can set whether to enable Raw mode when Scan is enabled. If Raw mode is enabled, the maximum number of versions of the query that includes data that has been marked but not actually deleted # # VERSIONS specified hbase (main): 008hbase 0 > scan 'user', {COLUMNS = >' base_info'} hbase (main): 009Raw 0 > scan 'user', {COLUMNS = >' base_info', RAW = > true, VERSIONS = > 5}

Multi-column family query

Hbase (main): 023 COLUMNS 0 > scan 'user', {COLUMNS = > [' base_info', 'extra_info']} hbase (main): 024 COLUMNS 0 > scan' user', {COLUMNS = > ['base_info:name',' extra_info:address']}

Specify a column family and a column name query

V. Cluster architecture

HBase architecture adopts master-slave (master/slaver) mode and is composed of three types of nodes-* * HMaster node, HRegionServer node and Zookeeper cluster. * * HMaster node is the master node and HRegionServer node is the slave node. The master-slave mode is similar to NameNode and DataNode of HDFS.

The HBase client communicates with the HMaste node and the HRegionServer node through RPC, and the HMaster node connects the Zookeeper to obtain the status of the HRegionServer node and manage it. The system architecture of HBase.

Because HBase stores the underlying data in HDFS, NameNode nodes and DataNode nodes are also designed. HRegionServer is often in the same node as HDFS DataNode, which is conducive to local data access and save network transmission time.

I 、 HMaster

There is not only one HMaster, users can start multiple HMaster nodes and use Zookeeper elections to ensure that only one HMaster node remains active and the rest is on standby.

The main functions of HMaster are:

HMaster is not used to store any data of HBase, but to manage HRegionServer nodes and specify which HRegion can be managed by HRegionServer nodes to achieve load balancing.

When the HRegionServer goes down, HMaster migrates the HRegion from it to the HRegionServer elsewhere.

Manage the addition, deletion, modification and query of the user table.

Manage the metadata of the table (the metadata mainly holds the mapping relationship between the unique identifier of HRegion and HRegionServer)

Authority control

II, HRegion and HRegionServer

HBase** uses rowkey to automatically split the table horizontally into multiple regions called HRegion, each HRegion consisting of multiple rows of data in the table. * * at first, a table has only one HRegion. With the increase of data, the table will be split into two HRegion of the same size on the boundary of a row. After that, the HMaster node assigns the HRegion to different HRegionServer nodes, and the HRegionServer manages it and responds to the read and write requests of the client. All the HRegion distributed in the cluster are arranged in sequence to form a table.

Each HRegion records the start line key (startkey) and end line key (endkey) of the rowkey. * the startkey of the first HRegion and the endkey of the last Hregion are empty. * * the client can quickly locate the HRegion where each rowkey resides through the HMaster node.

III 、 Store

A Store stores data for a column family in the HBase table. Because a table is split horizontally into multiple HRegion, a HRegion contains one or more Store. A Store contains a MemStore and multiple HFile.

MemStore is the equivalent of a memory buffer, and data is stored in MemStore before being stored on disk.

When MemStore reaches a certain size, a HFile file will be generated, the data of MemStore will be moved to HFile, StoreFile is the encapsulation of HFile file, HFile is the underlying data storage format of HBase, and the final data is stored in HDFS in HFile format.

It is important to note that a HFile file only holds all the data in the MemStore at a certain time, and a complete row of data may be stored in multiple HFile.

IV 、 HLog

HLog is the log file of HBase, which records the update operation of data. Similar to RDBMS database, in order to ensure the consistency of data and realize the rollback operation, HBase will first perform WAL (pre-write log) operation when writing data. After the update operation is written to the HLog, the data will not be written to the MenStore of the Store until the data is written successfully in both places.

Because MenStore is written to memory, data will not be written to HDFS until a certain amount of data is written. If the server goes down before writing to HDFS and the data in MemStore is lost, the data can also be recovered through HLog.

* * HLog is in HDFS, * * so the server crashes and HLog is still available.

V 、 Zookeeper

Each HRegionServer node registers its own temporary node in Zookeeper, and HMaster discovers available HRegionServer nodes through these temporary nodes and tracks the node failures of HRegionServer.

HBase uses Zookeeper to ensure that only one active HMaster is running.

The HRegionServer node to which the HRegion should be assigned is also known through Zookeeper.

About the principle of HBase distributed database and how to build is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.