What is the principle and basic architecture of hbase data? 07/11 Update SLTechnology News&Howtos

What is the principle and basic architecture of hbase data?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of the principle and basic structure of hbase data, the content is detailed and easy to understand, the operation is simple and fast, and it has a certain reference value. I believe you will gain something after reading this article on the principle and basic structure of hbase data. Let's take a look at it.

Hbase is a distributed column storage system built on hdfs.

Hbase is an important member of apache hadoop ecosystem, which is mainly used for massive structured data storage.

Logically, hbase stores data by table, row, and column

Hbase features:

1. Large: a table can have billions of rows and millions of columns

two。 Schemaless: each row has a sortable primary key and as many columns as possible, columns can be dynamically added as needed, and different rows in the same table can have distinct columns.

3. Column-oriented: column-oriented storage and permission control, column (family) independent retrieval

4. Sparse: for null columns, it does not take up storage space, and tables can be designed to be very sparse

5. Multiple versions of data: there can be multiple versions of data in each cell. By default, the version number is automatically assigned, which is the timestamp when the cell is inserted.

6. Single data type: all data in hbase is a string and has no type

Comparison of hbase and hdfs:

1. Both of them have good fault tolerance and expansibility, and can be extended to hundreds of nodes.

2.hdfs is suitable for batch scenarios, does not support random data search, is not suitable for incremental data processing, and does not support data update.

Row storage and column storage:

Traditional row database:

1. Data is stored by row

two。 Queries without indexes use a large number of Ibank O

3. Indexing and materialized views take a lot of time and resources

4. For query-oriented requirements, the database must be greatly expanded to meet the performance requirements.

Column database:

1. Data is stored in columns-each column is stored separately

two。 Data is the index.

3. Refers to the columns involved in the access query-greatly reduces the system Imax O

4. Each column is processed by a clue-concurrent processing of the query

5. Consistent data type, similar data characteristics-efficient compression

Second: hbase data model

Hbase is developed based on Google BigTable model, a typical key/value system.

First: hbase introduction

Hbase is a distributed column storage system built on hdfs.

Hbase is an important member of apache hadoop ecosystem, which is mainly used for massive structured data storage.

Logically, hbase stores data by table, row, and column

Hbase features:

1. Large: a table can have billions of rows and millions of columns

two。 Schemaless: each row has a sortable primary key and as many columns as possible, columns can be dynamically added as needed, and different rows in the same table can have distinct columns.

3. Column-oriented: column-oriented storage and permission control, column (family) independent retrieval

4. Sparse: for null columns, it does not take up storage space, and tables can be designed to be very sparse

5. Multiple versions of data: there can be multiple versions of data in each cell. By default, the version number is automatically assigned, which is the timestamp when the cell is inserted.

6. Single data type: all data in hbase is a string and has no type

Comparison of hbase and hdfs:

1. Both of them have good fault tolerance and expansibility, and can be extended to hundreds of nodes.

2.hdfs is suitable for batch scenarios, does not support random data search, is not suitable for incremental data processing, and does not support data update.

Row storage and column storage:

Traditional row database:

1. Data is stored by row

two。 Queries without indexes use a large number of Ibank O

3. Indexing and materialized views take a lot of time and resources

4. For query-oriented requirements, the database must be greatly expanded to meet the performance requirements.

Column database:

1. Data is stored in columns-each column is stored separately

two。 Data is the index.

3. Refers to the columns involved in the access query-greatly reduces the system Imax O

4. Each column is processed by a clue-concurrent processing of the query

5. Consistent data type, similar data characteristics-efficient compression

Second: hbase data model

Hbase is developed based on Google BigTable model, a typical key/value system.

Third: hbase physical model

Each column family is stored in a separate file on the HDFS

Key and Version number have a copy in each column family

Null values are not saved

Eg:

Info Column Family:

Roles Column Family

Physical storage of data:

All the lines in 1.Table are arranged according to the dictionary sequence of row key

The 2.Table is split into multiple Region in the direction of the row

3.Region is divided according to size, and there is only one region at the beginning of each table. With the increase of data, the region increases continuously. When it reaches a threshold, the region will be equally divided into two new region, and then there will be more and more region.

4.Region is the smallest unit of distributed storage and load balancing in Hbase. Different region is distributed on different RegionServer.

Although 5.Region is the smallest unit of distributed storage, it is not the smallest unit of storage.

1) Region consists of one or more Store, and each store holds one columns family

2) each Store consists of a memStore and 0 or more StoreFile

3) memStore is stored in memory and StoreFile is stored on HDFS.

Fourth: hbase infrastructure

Hbase architecture:

In a distributed production environment, HBase needs to run on top of HDFS, with HDFS as its basic storage facility. The cluster of HBase is mainly composed of Master and Region Server, as well as Zookeeper

Hbase related components:

Clinet:

Contains interfaces to access Hbase and maintains cache to speed up access to Hbase.

Zookeeper:

Ensure that there is only one master in the cluster at any time

Address entry to store all Region

Real-time monitor the online or offline information of Region Server, and notify Master in real time

Store schema and Table metadata for HBase

Zookeeper function:

HBase depends on zk

By default, Hbase manages zk instances, eg: start or stop zk

Master and RegionServers register with zk when they start

The introduction of Zookeeper makes Master no longer a single point of failure

Master:

Assign region to Region Server

Responsible for load balancing of Region Server

Find the invalid Region Server and reassign the region on it

Manage users' operations on adding, deleting, modifying and querying table

Region Server:

Maintain region and process IO requests for these region

Responsible for shredding region that becomes too large during operation

-ROOT- table and-META- table:

-ROOT- table:

Contains the region list where the-META- table is located, which will have only one Region

The location of the-ROOT- table is recorded in zookeeper

-META- table:

Contains all the user space region lists, as well as the server address of RegionServer

Detailed explanation:

All Region metadata of 1.HBase is stored in .meta. In the table, with the increase of Region,. META. The data in the table also grows and splits into multiple new Region. In order to locate. Meta. The position of each Region in the table, put. Meta. The metadata of all the Region in the table is saved in the-ROOT- table, and finally the location information of the-ROOT- table is recorded by the Zookeeper. Before all clients can access user data, they need to first access the location of Zookeeper to get-ROOT-, and then access the-ROOT- table to get .meta. The position of the table, finally according to .meta. The information in the table determines where the user data is stored, as shown in the figure above.

2.Murrow-the table will never be split, it has only one Region, which ensures that it only takes up to three jumps to locate any Region. In order to speed up the visit,. META. All Region of the table is stored in memory. The client will cache the queried location information, and the cache will not be invalidated actively. If the client cannot access the data based on the cache information, ask the relevant .meta. The Region server of the table, trying to get the location of the data, and if it still fails, ask the META associated with the-ROOT- table. Where's the watch. Finally, if all the previous information fails, the Region information is relocated through ZooKeeper. So if all the caches on the client are invalid, you need to go back and forth six times to locate the correct Region.

High availability

Write-Ahead-Log (WAL) ensures high availability of data

To understand high availability, first of all, you must understand the role of HLog. The Hlog mechanism in HBase is an implementation of WAL, and WAL is a common implementation of consistency in transaction mechanisms. There will be an instance of HLog in each RegionServer. RegionServer will first record the update operation (put,delete, etc.) in WAL (that is, HLog), and then write it to Store's MemStore. After the final Memstore reaches a certain threshold, it will be written to HFile, thus ensuring the reliability of HBase writing. Without WAL, when RegionServer is hung up, MemStore has not written to HFile data, or when StoreFile is not saved, the data will be lost. (at this point, one may ask what to do if HFile itself is lost, which is guaranteed by HDFS. There will be 3 copies of data by default in HDFS)

HFile is composed of many data blocks (Block) and has a fixed ending block, in which the data block is composed of a Header and a number of key-value pairs of Key-Value. The data block at the end contains the index information related to the data, and the system also finds the data in the HFile through the index information at the end.

High availability of components

Master fault tolerance: Zookeeper reselects a new Master

In the process without Master, data reading goes on as usual.

In the process without Master, region segmentation and load balancing cannot be performed.

RegionServer fault tolerance:

Regularly report the heartbeat to Zookeeper, and if the heartbeat does not occur within a period of time, Master will reassign the Region on the RegionServer to other RegionServer

The "pre-write" log on the failed server is split by the master server and sent to the new RegionServer

Zookeeper fault tolerance: zookeeper is a reliable service

It is usually 3 to 5 zookeeper instances.

Read and write process

Write operation:

1) client sends a request to write data to regionserver through the scheduling of zookeeper, and writes data in Region

2) the data is first recorded in HLog, and then written to the MemStore of Store until the MemStore reaches the predetermined threshold

3) the data in MemStore is Flush into a StoreFile

4) with the increasing number of StoreFile files, when their data reaches a certain threshold, the Compact merge operation will be triggered to merge multiple StoreFile into a single StoreFile, and version merging and data deletion will be carried out at the same time.

5) through continuous Compact merge operations, StoreFiles gradually forms a larger and larger StoreFile.

6) when the size of a single StoreFile exceeds a certain threshold, the Split operation will be triggered, the current Region Split will be transformed into two new Region, the parent Region will be offline, and the two child Region of the new Split will be assigned to the corresponding RegionServer by HMaster, so that the pressure of the original one Region can be diverted to the two Region.

Through the above write process, it can be found that HBase updates, deletions and other operations are carried out in the subsequent Compact process, so that the user's write operations can be returned as soon as they enter memory, achieving the high performance of HBase Imax 0.

Read operation:

1) client visits zk, looks up the-ROOT- table, and gets .meta. Information about the table.

2) from .meta. Table lookup to get the Region information in which the target data is stored, so as to find the corresponding RegionServer.

3) obtain the data you need to find through RegionServer

4) the memory of RegionServer is divided into two parts: MemStore and BlockCache. MemStore is mainly used to write data, and BlockCache is mainly used to read data. Read requests first check data in MemStore, then check data in BlockCache if not found, and read it on StoreFile if it is not found, and put the reading results in BlockCache.

Read process: client-- > zookeeper-- >-ROOT- table-- > .META. Table-- > RegionServer-- > Region-- > client

This is the end of the article on "what is the principle and basic architecture of hbase data?" Thank you for reading! I believe you all have a certain understanding of "what is the principle and basic structure of hbase data". If you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.