What is Hbase? 07/11 Update SLTechnology News&Howtos

What is Hbase?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "what is Hbase". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. What is HBase?

The prototype of HBase is Google's BigTable paper, which is inspired by the idea of this paper. At present, it is developed and maintained as a top-level project of Hadoop to support structured data storage.

HBase is a highly reliable, high-performance, column-oriented and scalable distributed storage system. Large-scale structured storage clusters can be built on cheap PC Server by using HBASE technology.

The goal of HBase is to store and process large data, or more specifically, to be able to handle large data consisting of thousands of rows and columns using only a normal hardware configuration. [do not use it unless it is too big]

HBase is an open source implementation of Google Bigtable, but there are many differences. For example, Google Bigtable uses GFS as its file storage system, HBase uses Hadoop HDFS as its file storage system; Google runs MAPREDUCE to deal with massive data in Bigtable, and HBase also uses Hadoop MapReduce to deal with massive data in HBase; Google Bigtable uses Chubby as a collaborative service and HBase uses Zookeeper as its counterpart.

Simple rough summary: is a NoSQL database, column-oriented storage, used to store and deal with large amounts of data.

The core is that it is a place to store data, but after learning about HDFS and Mysql before, why does HBase still appear?

two。 Why is there HBase?

Let's talk about Mysql first. We all know that Mysql is a relational database, which is developed and used very frequently. The core table of a website or system is the user table, and when the data of the user table reaches tens of millions or even hundreds of millions of levels, the retrieval of a single piece of data will take several seconds or even minutes. The actual emptying may be more complicated.

Look at the following table:

If the user name corresponding to this data is queried according to id=1, it is very simple and will return zhangsan to us. But when we look it up, think about it, will age and email be found when you look up the name? The answer is yes, Mysql's data storage is in behavior units, row-oriented storage. Then the problem arises. I only need to find out the name of zhangsan, but I need to query a whole row of data. If there are a lot of columns, then the query efficiency can be imagined.

The operation speed of a query is restricted by two factors:

Tables are inserted, edited, and deleted concurrently.

Query statements are usually not simple operations on a table, but may be complex queries associated with multiple tables, or even group by or order by operations. At this time, the performance degradation is obvious.

If a table has too many columns, it will affect the query efficiency. We call such a table a wide table. How to optimize it, take it apart, split it vertically:

In this case, when we want to find username, we only need to look up the user_basic table, there are no extra fields, and the query efficiency will be very fast. If a table has too many rows, it will affect the query efficiency. We call such a table a high table, and we can improve the efficiency by splitting the table horizontally:

The scenario where this kind of horizontal splitting is widely used is the log table, which generates a lot of log information every day, and can be split horizontally on a monthly basis, so that the high table becomes shorter.

Ok, this split method seems to solve the problem of wide table and high table, but if one day the company's business changes, for example, there was no Wechat, but now there is Wechat, you need to add the user's Wechat field. What should I do if the structural information of the table needs to be changed at this time? The simplest idea is to add one more column, like this:

If you think about it, you can see that this is inappropriate. For example, some early users do not have Wechat, so you have to weigh whether to set the default value or to take other measures in this column. If you need to expand a lot of lists, and not all users have these attributes, then the extension is more complicated.

At this point, you think of a string in JSON format, which is an object in the form of a string (a summary of several optional information), and the property field can be expanded dynamically, so you have the following approach, which is compared between the two:

Ok, it's good to store data like this. What's HBase doing out here? One thing about Mysql is that the data reaches a certain threshold (usually 500W), and no matter how much it is optimized, it cannot achieve high performance. And big data field data, often PB level, this kind of storage application obviously can not meet the demand. HBase has a good solution to the above problems.

3. How to realize HBase

Not to mention why, then the above mentioned several problems: high table width table, data column dynamic expansion, several solutions mentioned: horizontal vertical segmentation, column expansion method mixed together.

If there is such a table, which is afraid that it will be wide and tall and will dynamically expand the column, then at the beginning of the design, the table will be taken apart and stored in JSON format directly for the dynamic expansion of the column:

In this way, the problem of wide meter is solved. What about high meter? Two parts of a table, each with a branch:

Solved the problem of high table, wide table, and dynamically expanded columns, what if we need to further improve the performance? Mysql- > Redis! Cache!

The queried data is put into the cache, and the next query takes the data directly from the cache. What about inserting data? It can also be understood that I put the data to be inserted into the cache, no longer care, the database directly from the cache to insert the data into the database. At this time, the program does not need to wait for the data to be inserted successfully, which improves the efficiency of parallel work.

However, there is a great risk in doing so. If the server is down and the data in the cache is not inserted into the database in time, the data will be lost. Refer to the persistence policy of Redis, you can add an operation log for inserting data, which is used to persist the insert operation and recover from the log after being restarted when a crash occurs.

So the design architecture looks like this:

The above solution is actually the general idea of the implementation of HBase, and the details will be discussed later.

Simple and rough summary: HBase is a non-relational database oriented to column storage. The main differences between the two are:

HBase is based on HDFS storage, HDFS has high fault tolerance, is designed to be deployed on low-cost hardware, and it provides high throughput to access application data, based on Hadoop means that HBase is born with super scalability and throughput.

HBase adopts the storage mode of key/value, which means that even as the amount of data increases, it will hardly lead to a decline in query performance. HBase is also a database for column storage. When there are many fields in the table, several of them can be placed independently on one part of the machine, while the other fields can be placed on another part of the machine, which fully disperses the pressure of the load. The cost of such a complex storage structure and distributed storage is that even if very little data is stored, it will not be very fast.

HBase is not fast enough, but it is not noticeably slow when the amount of data is large. When to use HBase? there are two main situations:

The amount of data in a single table is more than 10 million, and the concurrency is very large.

The need for data analysis is weak, or less real-time and flexible.

4. Introduction to HBase

Official website: http://hbase.apache.org

The prototype of HBase is Google's BigTable paper, which is inspired by the idea of this paper. At present, it is developed and maintained as a top-level project of Apache to support structured data storage.

Characteristics of HBase

Mass storage

Hbase is suitable for storing huge amounts of PB-level data, and can return data in tens to 100 milliseconds in the case of PB-level data and cheap PC storage. This is closely related to the ease of extensibility of Hbase. It is precisely because of the good expansibility of Hbase that it provides convenience for the storage of massive data.

Column storage

Column storage here actually refers to column family storage, and Hbase stores data based on column families. There can be a lot of columns under the column family, which must be specified when you create the table.

Very easy to expand

The scalability of Hbase is mainly reflected in two aspects, one is based on the upper processing power (RegionServer), and the other is based on storage (HDFS).

By adding RegionSever machines horizontally, we can scale horizontally, improve the processing power of the upper layer of Hbase, and enhance the ability of Hbsae services to provide more Region.

Note: the role of RegionServer is to manage region and undertake business access. This will be described in detail later to expand the storage layer capacity by adding Datanode horizontally to improve the data storage capacity of Hbase and the read and write capabilities of backend storage.

High concurrency

Since most of the architectures that use Hbase are cheap PC, the latency of a single IO is not small, usually between tens and hundreds of ms. When we talk about high concurrency here, it is mainly in the case of concurrency that the single IO latency of Hbase does not decrease much. Services with high concurrency and low latency can be obtained.

Sparse

Sparse is mainly for the flexibility of Hbase columns. In the column family, you can specify as many columns as you want. If the column data is empty, it will not take up storage space.

HBase logical structure

First of all, we have a general understanding of the storage of HBase in terms of logical thinking:

Table (Table):

A table consists of one or more `column families. Properties of the data. For example, name, age, TTL (timeout), and so on are all defined in the column family. The table that defines the column family is an empty table, and the table does not have data until data rows have been added.

Column Family (column families):

In HBase, you can combine multiple columns into a single column family. There is no need to create columns when creating a table, because the columns can be increased or subtracted and are very flexible. The only thing that needs to be determined is the column families, that is, how many column families in a table are determined at the beginning. Many of the attributes of this appearance, such as data expiration, block caching, and whether to use compression, are defined on the column family, not on the table or column. This is very different from the previous relational database. The purpose of the existence of column families is that HBase will put the columns of the same column family on the same machine as much as possible, so if you want to put several columns on a server, you only need to define the same column family for them.

Row (line):

A row package contains multiple columns, which are classified by column families. The column families to which the data in the row belongs are selected from the column families defined in the table, and you cannot select column families that do not exist in this table. Because HBase is a column-oriented database, the data in a row can be distributed on different servers.

RowKey (line key):

The primary key of the rowkey database is much simpler than that of the MySQL database. Rowkey must have it, and if the user does not specify it, there will be a default. The rowkey is entirely a string of non-repeating strings specified by the user, and the rowkey is sorted by dictionary. A rowkey corresponds to a row of data!

Region:

A Region is a collection of data. As mentioned earlier, the concept of a high table is split horizontally, assuming that it is divided into two parts, and then two Region are formed. Note several features of Region:

Region cannot span servers, and a RegionServer can have more than one Region.

When the amount of data is small, a Region can store all the data; but when the amount of data is large, HBase will split the Region.

When HBase is doing load balancing, it is also possible to move Region from one RegionServer to another server's RegionServer.

Region is based on HDFS, and all its data access operations are done by calling the HDFS client.

RegionServer:

RegionServer is the container where Region is stored. Intuitively speaking, it is a service on the server. Responsible for managing and maintaining Region.

HBase physical storage

The above is a basic logical structure, and the underlying physical storage structure is the top priority. See the figure below and try to explain the above concepts from a different point of view:

NameSpace:

Namespaces, similar to the DatabBase concept of relational databases, have multiple tables under each namespace. HBase has two built-in namespaces, one in hbase and one in default,hbase is the built-in table in HBase, and the default table is the user's default namespace.

Row:

Each row of data in the HBase table consists of a RowKey and multiple Column (columns). The data is stored in the dictionary order of RowKey, and the data can only be retrieved according to RowKey when querying data, so the design of RowKey is very important.

Column:

Each column in HBase is qualified by Column Family (column family) and Column Qualifier (column qualifier), such as info:name,info:age. When you create a table, you only need to specify the column family, and the column qualifier does not need to be pre-defined.

TimeStamp:

Timestamp, used to identify a different version of the data (version). If you do not specify a timestamp when each piece of data is written, the system will automatically add this field to it, and its value is the time when the HBase was written. And when reading the data, generally only take out the Type of the data to match, timestamp the latest data. Type is used in HBase to identify whether data is available. Because HBase is based on HDFS and HDFS can be added, deleted, queried but not changed.

Cell:

Cell, a cell uniquely determined by {rowkey, column Family:column Qualifier, time Stamp}. The data in cell is typeless and is all stored in bytecode form.

Comparison between HBase and Relational Database

The table composition of a traditional relational database is as follows:

Each row is inseparable, which reflects the atomicity of the first paradigm of the database, that is, the three columns must be together and stored on the same server, or even in the same file.

The table frame of HBase is as shown in the figure:

Each row of HBase is discrete, and because of the existence of column families, different columns in a row are even assigned to different servers. The concept of rows is reduced to an abstract being. In terms of entities, multiple columns are defined as the keyword rowkey of a row, which is the only embodiment of the concept of rows in HBase.

The HBase storage statement must specify exactly which cell to store the data in, and the cell is uniquely determined by the table: column family: row: column. To use people is to write clearly which column of which column family of which table the data is to be stored in. If a row has 10 columns, you need to write a statement of 10 rows to store the data in the row.

Analysis of HBase Architecture

1. Macroscopic diagram

You can see from this diagram that a HBase cluster consists of a Master (which can also be configured into multiple HA) and multiple RegionServer, followed by a detailed description of the architecture of RegionServer. The figure above illustrates the composition of the server role of HBase. The following details are given:

Master:

Responsible for assigning Region to specific RegionServer at startup, and performing various management operations, such as Region segmentation and merging. In HBase, the role of master is much weaker than that of other types of clusters. The read and write operation of the data has nothing to do with him, and after it is dead, the cluster still runs. The specific reasons will be described in detail later. But master can't go down for too long, and there are many necessary operations, such as creating tables, modifying column family configurations, mainly DDL, and, more importantly, splitting and merging.

RegionServer:

A RegionServer is a machine on which there are multiple region. The data we read and write is stored in Region.

Region:

It is part of the split of the table, and HBase is a database that is automatically sliced. When the database is too high, it will be split.

HDFS:

The data storage of HBase is based on HDFS, which is the real carrier of data.

Zookeeper:

It is responsible for storing the location storage information of the hbase:meata in this cluster, and the client needs to read the metadata information first when writing data.

2. RegionServer

As you can see in the last RegionServer of the macro architecture diagram, its interior is a collection of multiple Region:

Now let's zoom in on the internal architecture of the RegionServer:

A WAL:

It has the same meaning as the edits file in HDFS.

WAL is the abbreviation of Write-Ahead Log, which translates to pre-write log. You can probably guess its role from the name. When the operation reaches Region, HBase first writes the operation to WAL, then puts the data into MemStore based on memory implementation, and waits for a certain time to brush the data into a HFile file and store it on HDFS. WAL is an insurance mechanism in which data is written to WAL before it is written to MemStore, so that if there is an accident during the writing process, the data can be recovered from WAL.

Multiple Region:

Region has been mentioned many times that it is part of the database, and each Region has a starting rowkey and an ending rowkey, representing the scope of the row it stores.

Let's zoom in on the internal structure of Region:

You can see from the figure that a Region contains multiple Store: a Region has multiple Store, and a Store is the data corresponding to a column family, as shown in the figure, there are three column families. From the last Store, we can see that Store is made up of MemStore and HFile.

3. WAL

Pre-write logs are designed to solve the problem of operation recovery after downtime, and WAL is a persistent file saved on HDFS. When the data reaches the Region, it is written to the WAL and then loaded into the MemStore. In this way, even if the Region goes down and there is no time to persist the operation, you can load the operation from the WAL and perform it when you restart it.

How do I enable WAL?

WAL is enabled by default, or you can turn it off manually, so that the operation of adding, deleting and modifying will be faster. However, this is done at the expense of data security, so it is not recommended to turn it off.

Turn off method:

Mutation.setDurability (Durability.SKIP_WAL)

Write to WAL asynchronously

If you don't hesitate to improve performance by shutting down WAL, consider a compromise: write to WAL asynchronously.

Normally, when the put, delete and append operations submitted by the client come to Region, the client that calls HDFS first writes to WAL. Even if there is only one change, the interface of HDFS is called to write the data. As you can imagine, this way ensures the security of the data as much as possible, at the cost of frequent consumption of resources.

If you do not want to close WAL and do not want to consume so much resources every time, call the HDFS client for each change, and you can choose to write WAL asynchronously:

Mutation.setDurability (Durability.ASYNC_WAL)

When this is set, Region will wait until the condition is met before writing the operation to WAL. The condition here refers to how long the interval is, write once, and the default interval is 1s.

What if something goes wrong when writing data asynchronously? For example, the client operation is now in Region memory. Because the time interval is less than 1 second, the operation has not been written to WAL,Region yet. It only takes less than 1 second to lose it. There is no guarantee that something goes wrong.

WAL scrolling

I've learned about MapReduce's shuffle mechanism before, so I guessed that WAL is a wake-up rolling log data structure, because this structure does not cause a continuous increase in space consumption and is the most efficient to write.

Through wal log switching, you can avoid generating separate, oversized wal log files, which facilitates subsequent log cleanup (you can delete expired log files directly). In addition, if you need to use logs for recovery, you can also parse multiple small log files at the same time, reducing the recovery time.

The check interval for WAL is defined by hbase.regionserver.logroll.period, and the default value is one hour. The content of the check is to compare the operations in the current WAL with those that are actually persisted on the HDFS to see which operations have been persisted, and if they have been persisted, the WAL will be moved to the .oldlogs folder on the HDFS.

A WAL instance contains multiple WAL files. The maximum number of WAL files can be configured manually through parameters.

Other conditions for triggering scrolling are:

The size of WAL exceeds a certain threshold.

The HDFS file block where the WAL file is located is almost full.

WAL archiving and deletion

Archiving: all files created by WAL will be placed under / hbase/.log. When WAL files are defined as archived, files will be moved to / hbase/.oldlogs.

Delete: determine whether this WAL file is no longer needed and whether it is not pointed to this WAL file by other references

The service that will reference this file:

TTL process: this process ensures that the WAL file survives until the timeout defined by hbase.master.logcleaner.ttl is reached (the default is 10 minutes).

Backup (replication) mechanism: if you enable the backup mechanism of HBase, HBase will delete the WAL file only if you ensure that the backup cluster does not need the WAL file at all. The replication mentioned here is not the number of backups of files, but a feature added in version 0.90, which is used to back up data from one cluster to another in real time. If you have only one cluster on hand, you don't have to consider this factor.

Only when the WAL file is not referenced by the above two cases will it be completely erased by the system.

4. Store

After explaining WAL, zoom in on the internal architecture of Store:

There are two important parts of Store:

MemStore:

Each Store has an instance of MemStore. After the data is written to the WAL, it is put into the MemStore. MemStore is a storage object in memory, and it will not be written to HFile until a certain time is reached.

HFile:

There are multiple HFile in Store, and each time you write and write, a HFile file is placed on the HDFS. HFile deals directly with HDFS, which is the entity of data storage.

Here is a question:

When the client operation reaches the Region, the data is first written to the WAL, while the WAL is stored on the HDFS. So it means that the data has been persisted, so why load it from WAL into MemStore and then swipe it to form HFile and save it on HDFS?

To put it simply: the data was persisted before it entered HFile, so why put it in MemStore?

This is because HDFS supports the creation, appending and deletion of files, but cannot be modified. For a database, the order of the data is very important. The first WAL persistence is to ensure the security of the data, unordered. And then read into the MemStore, is for sorting and storage. So the point of MemStore is to keep the data in rowkey dictionary order, not to do a cache to improve write efficiency.

Add a picture, by comparison, about MemStore brushing:

Analysis of HBase Architecture Diagram

You can see from the figure that Hbase is composed of several components, such as Client, Zookeeper, Master, HRegionServer, HDFS, and so on. If you don't need to care about HMaster for DML operation, you only need to obtain the necessary meta data address from ZK, and then add, delete and check data from RegionServer. Here is a look at the functions of several components:

Client

Client contains interfaces to access Hbase, and Client also maintains a corresponding cache to speed up Hbase access, such as cache's .meta. Information about metadata.

Zookeeper

HBase uses Zookeeper to do high availability of master, monitoring of RegionServer, entry of metadata and maintenance of cluster configuration. The specific work is as follows:

Use Zoopkeeper to ensure that there is only one master running in the cluster. If the master is abnormal, a new master will be generated through the competition mechanism to provide services.

Monitor the status of RegionServer through Zoopkeeper, and notify MasterRegionServer of online and offline information in the form of callback when there is an exception in RegionSevrer.

A unified entry address for storing metadata through Zoopkeeper

Hmaster

The main responsibilities of the master node are as follows:

Assign Region to RegionServer

Maintain the load balancing of the whole cluster

Maintain the metadata information of the cluster

Discover the failed Region and assign the failed Region to the normal RegionServer

Coordinate the split of the corresponding RegionSever when the Hlog fails

HregionServer

HregionServer directly connects users' read and write requests and is the real working node. Its functions are summarized as follows:

Manage the Region assigned to it by master

Handle read and write requests from the client

Responsible for interacting with the underlying HDFS and storing data to HDFS

Responsible for the split of Region when it gets bigger.

Responsible for the merger of Storefile

HDFS

HDFS provides the final underlying data storage service for Hbase. The underlying data of HBase is stored in HDFS in HFile format (similar to the underlying data storage format of hadoop), while HBase is supported with high availability (Hlog is stored in HDFS). The specific features are summarized as follows:

Provide underlying distributed storage services for metadata and table data

Multiple copies of data to ensure high reliability and high availability

That's all for "what Hbase is". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.