What does Apache HBase mean? 04/08 Update SLTechnology News&Howtos

What does Apache HBase mean?

2025-04-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the meaning of Apache HBase, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand.

Introduction to Apache HBase

Apache HBase (Hadoop Database) is a highly reliable, high-performance, column-oriented and scalable distributed storage system. Large-scale structured storage clusters can be built on cheap PC by using HBase technology.

HBase is an open source implementation of Google Bigtable, similar to Google Bigtable using GFS as its file storage system and HBase using Hadoop HDFS as its file storage system; Google runs MapReduce to deal with massive data in Bigtable, HBase also uses Hadoop MapReduce to deal with massive data in HBase; Google Bigtable uses Chubby as a collaborative service and HBase uses Zookeeper as its counterpart.

HBase is a "NoSQL" database. "NoSQL" is a generic word indicating that the database is not RDBMS, which supports SQL as the primary means of access. There are many kinds of NoSQL databases. For example, BerkeleyDB is an example of a local NoSQL database, while HBase is a large distributed database. Technically, HBase is more like "Data Store" than "Data Base" because HBase lacks many of the features of RDBMS, such as column types, second indexes, triggers, advanced query languages, and so on.

However, HBase has many features that support both linearization and modular extension. The HBase cluster expands by adding RegionServer, and you just need to put it on a normal server. For example, if the cluster expands from 10 to 20 RegionServer, both storage space and processing capacity double at the same time. RDBMS can also be expanded, but only for a certain point-- especially for the size of a single database server-- while relying on special hardware and storage devices that HBase doesn't need for better performance.

HBase has the following characteristics:

Strong consistency read and write: HBase is not a "eventually consistent" data store. This makes it suitable for high-speed counting aggregation tasks.

Automatic fragmentation (Automatic sharding): the HBase table is distributed in the cluster through region. As data grows, region automatically splits and redistributes

RegionServer automatic failover

Hadoop/HDFS integration: HBase supports HDFS as its distributed file system out of the box

MapReduce: HBase supports large concurrency processing through MapReduce

Java client API:HBase supports programmatic access to easy-to-use Java API

Thrift/REST API:HBase also supports access to Thrift and REST as non-Java front ends

Block Cache and Bloom Filter: for bulk query optimization, HBase supports Block Cache and Bloom Filter

Operation and maintenance management: HBase supports JMX to provide built-in web pages for operation and maintenance.

1. Application scenarios of HBase

HBase is not suitable for all scenarios.

First of all, make sure there is enough data, and if there are hundreds of millions or hundreds of billions of rows of data, HBase is a good candidate. If there are only thousands or millions of lines, the traditional R DBMS may be a better choice. Because all data can be saved on one or two nodes, other nodes in the cluster may be idle.

Second, make sure you don't have to rely on all the additional features of RDBMS (for example, column data types, second indexes, transactions, advanced query languages, and so on).

Third, make sure you have enough hardware. Because HDFS basically does not show its advantages when it has less than 5 data nodes.

Although HBase works well on a separate laptop, this should be treated only as a development-phase configuration.

2. Advantages and disadvantages of Hbase

Advantages of Hbase:

The column can be dynamically increased, and if the column is empty, the data will not be stored, thus saving storage space.

Hbase automatically splits the data so that the data storage can be automatically extended horizontally.

Hbase can provide support for highly concurrent read and write operations

Combining with Hadoop MapReduce is beneficial to data analysis.

Fault tolerance

Very flexible pattern design (or no restrictions on fixed patterns)

Can be integrated with Hive, using SQL-like queries

Automatic failover

Client interface is easy to use

Row-level atomicity, that is, the PUT operation must be a complete success or a complete failure

Disadvantages of Hbase:

Conditional queries are not supported, only row key queries are supported.

Prone to a single point of failure (when using only one HMaster)

Transactions are not supported

JOIN is not supported by the database layer, but needs to use MapReduce

Can only be indexed and sorted gradually

No built-in identity and permission authentication

3. The difference between HBase and Hadoop/HDFS

HDFS is a distributed file system that is suitable for saving large files. Officials claim that it is not a general-purpose file system and does not provide quick access to individual records of documents. On the other hand, HBase is based on HDFS and can provide fast search and update of records for large tables. This can sometimes lead to conceptual confusion. HBase internally puts the data into an indexed "StoreFiles" storage file to provide high-speed queries, while the storage file is located in HDFS.

If you want to know more about HBase, it is recommended to read Lars George's "HBase: The Definitive Guide".

Basic concepts of Apache HBase

The data in HBase is stored in a table with rows and columns. This overlaps with the terms of relational databases (in RDBMS), but conceptually they are not the same category. Instead, you should think of HBase's table as a multidimensional map structure that is easier to understand.

1. Terminology

Table (Table): HBase table consists of multiple row.

Row (row): each row represents a data object, and each row consists of a row key and one or more column. Row key is uniquely identified for each data object, sorted alphabetically, that is, row is also stored in this order. Therefore, the design of the row key is very important, and an important principle is that the relevant row should be stored in a close location. For example, the domain name of the website, row key is the domain name, and the domain name should be reversed at design time (for example, org.apache.www, org.apache.mail, org.apache.jira). In this way, the location of the Apache-related domain name stored in table will be very close.

Column (column): column consists of column family and column qualifier, spaced by a colon (:). Like family:qualifier.

Column Family (column family): in HBase,column family is a collection of column. All column members of a column family have the same prefix. For example, courses:history and courses:math are both members of courses. The colon (:) is the column family delimiter that distinguishes prefixes from column names. The column prefix must be printable characters, and the rest of the column names can be any byte array. Column family must be declared when the table is established. Column can be created at any time. Physically, a column family member is stored together on the file system. Because storage optimization is targeted at the column family level, this means that all members of a column family are accessed in the same way.

Column Qualifier (column qualifier): data in column family is mapped through column qualifier. Column qualifier also has no specific data type and is stored in binary bytes. For example, a column family "content" whose column qualifier can be set to "content:html" and "content:pdf". Although column family is fixed when table is created, column qualifier is variable and may vary greatly from row to row.

Cell (cell): cell is a combination of row, column family, and column qualifier that contains a value and a timestamp to identify the version of the value.

Timestamp (timestamp): each value has a timestamp as an identifier for a specific version of the value. By default, timestamp represents the time when the data is written to the RegionServer, but you can also specify a different timestamp when putting the data into the cell.

2. Map

The core data structure of HBase/Bigtable is map. Different programming languages have different terms for map, such as associative array (PHP), associative array (Python), Hash (Ruby), or Object (JavaScript).

To put it simply, a map is a collection of key-value pairs. Here is an example of expressing map in JSON format:

{"zzzzz": "woot", "xyz": "hello", "aaaab": "world", "1": "x", "aaaaa": "y"} 3. Distributed system

There is no doubt that HBase/Bigtable is based on distributed systems. HBase is based on Hadoop Distributed File System (HDFS) or Amazon's Simple Storage Service (S3), while Bigtable uses Google File System (GFS). One of the problems they want to solve is data synchronization. How to achieve data synchronization is not discussed here. HBase/Bigtable can be deployed on thousands of machines to spread the pressure of access.

4. Sort

Unlike a normal map implementation, the map in HBase/Bigtable is strictly sorted alphabetically. That is to say, for row key is "aaaaa" next to row key should be "aaaab", but far away from row key is "zzzzz".

Or take the above JSON as an example, an example of ordering is as follows:

{"1": "x", "aaaaa": "y", "aaaab": "world", "xyz": "hello", "zzzzz": "woot"}

In a system with a large amount of data, sorting is very important, especially the setting strategy of row key determines the performance of the query. For example, the domain name of the website, row key is the domain name, and the domain name should be reversed at design time (for example, org.apache.www, org.apache.mail, org.apache.jira).

5. Multi-dimensional

Multidimensional map, that is, map is nested within map. For example:

{"1": {"A": "x", "B": "z"}, "aaaaa": {"A": "y", "B": "w"}, "aaaab": {"A": "world", "B": "ocean"}, "xyz": {"A": "hello" "B": "there"}, "zzzzz": {"A": "woot", "B": "1337"} 6. Time version

If no time is specified in the query, the most recent version of the time will be returned. If timestamp is given, a value earlier than this time will be returned. For example, if the query row/column is "aaaaa" / "A:foo", it will return y; if the query row/column/timestamp is "aaaaa" / "A:foo" / 10, it will return m; and if the query row/column/timestamp is "aaaaa" / "A:foo" / 2, it will return null.

{/ /... "aaaaa": {"A": {"foo": {15: "y", 4: "m"}, "bar": {15: "d",}} "B": {": {6:" w "3:" o "1:" w "}}, / /.} 7. Conceptual view

The following table is a table named webtable, which contains two row (com.cnn.www and com.example.www) and three column family (contents, anchor, and people). In the first row (com.cnn.www), anchor contains two column (anchor:cssnsi.com and anchor:my.look.ca), and contents contains a column (contents:html). In this example, the row of row key is com.cnn.www contains 5 versions, and the row of row key is com.example.www contains 1 version. Column qualifier contains the complete HTML for a given website for contents:html. Column family is each qualifier of anchor that contains a link to the website. People list ethnic groups to represent people associated with the site. Column family is the people associated with the website's character profile.

Row KeyTimestampColumnFamily contentsColumnFamily anchorColumnFamily people "com.cnn.www" T9

Anchor:cnnsi.com = "CNN"

"com.cnn.www" T8

Anchor:my.look.ca = "CNN.com"

"com.cnn.www" t6contents:html = "..."

"com.cnn.www" t5contents:html = "..."

"com.cnn.www" t3contents:html = "..."

The cell shown as empty in this table does not take up space, which makes the HBase "sparse". In addition to tabular attempts to present data, multidimensional map is also used, as follows:

{"com.cnn.www": {contents: {T6: contents:html: "..." T5: contents:html: "..." T3: contents:html: "..."} anchor: {T9: anchor:cnnsi.com = "CNN" T8: anchor:my.look.ca = "CNN.com"} people: {}} "com.example.www": {contents: {T5: contents:html: "..."} anchor: {} people: {T5: people: Author: "John Doe"}} 8. Physical view

Although in the conceptual view, table can be thought of as a sparse collection of row. But physically, it is stored according to column family. A new column qualifier (column_family:column_qualifier) can be added to an existing column family at any time.

The following table is a ColumnFamily anchor:

Row KeyTimestampColumn Family anchor "com.cnn.www" t9anchor:cnnsi.com = "CNN"com.cnn.www" t8anchor:my.look.ca = "CNN.com"

The following table is a ColumnFamily contents:

Row KeyTimestampColumnFamily contents: "com.cnn.www" t6contents:html = "..."com.cnn.www" t5contents:html = "..."com.cnn.www" t3contents:html = "..."

It is worth noting that in the conceptual view above, blank cell is not physically stored because it is not necessary at all. So if a request is to get the contents:html of T8 time, its result is null. Similarly, if the request is to get the anchor:my.look.ca of T9 time, the result is also empty. However, if timestamp is not specified, the column of the latest time will be returned. For example, if the request is "com.cnn.www" and does not specify timestamp, the result returned is anchor:cnnsi.com under contents:html,t9 under T6 and anchor:my.look.ca under T8.

9. Data model operation

The four main data model operations are Get, Put, Scan, and Delete. The operation is done through the Table instance. For the API of Table, see http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html.

Get

Get returns the properties of a specific row. Get is executed through Table.get. For the API of Get, see http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html.

Put

Put either adds a new row to the table (if the key is new) or updates the row (if the key already exists). Put is executed through Table.put (writeBuffer) or Table.batch (not writeBuffer). For the API of Put, see http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html.

Scan

Scan allows multiple row-specific attribute iterations.

The following is an example of Scan on an Table table instance. Suppose table has several lines with row key of "row1", "row2", "row3", and some row key values of "abc1", "abc2", and "abc3". The following example shows how a Scan instance returns a row that starts with "row".

Public static final byte [] CF = "cf" .getBytes (); public static final byte [] ATTR = "attr" .getBytes ();... Table table =... / / instantiate a Table instanceScan scan = new Scan (); scan.addColumn (CF, ATTR); scan.setRowPrefixFilter (Bytes.toBytes ("row")); ResultScanner rs = table.getScanner (scan); try {for (Result r = rs.next (); r! = null; r = rs.next ()) {/ / process result... }} finally {rs.close (); / / always close the ResultScanner!}

Note that the easiest way to specify a stop point for scan is to use the InclusiveStopFilter class, whose API can be found in http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/InclusiveStopFilter.html.

Delete

Delete is used to remove row from table. Delete is executed through Table.delete. For the API of Delete, see http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Delete.html.

HBase does not have an appropriate way to modify the data. So delete handles it by creating a new flag called tombstones. These tombstones and dead values are cleared during major compaction.

Thank you for reading this article carefully. I hope the article "what is the meaning of Apache HBase" shared by the editor will be helpful to everyone? at the same time, I also hope that you will support and pay attention to the industry information channel, and more related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.