A basic introduction to hbase 07/01 Update SLTechnology News&Howtos

A basic introduction to hbase

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Brief introduction of hbase:

HBASE is the open source version of bigTable (the source code is written by Java), is the database of Apache Hadoop, is based on hdfs, is designed to provide high reliability, high performance, column storage, scalable, multi-version, Nosql distributed data storage system to achieve real-time, random read and write requests for large data. What's more, it makes up for the shortcomings of low latency of hive and line-level additions, deletions and modifications.

HBASE relies on hdfs for underlying data storage

HBASE relies on MapReduce for data calculation

HBASE relies on zookeeper for service coordination

2. The design idea of hbase:

-column oriented, can implement a near real-time query of a distributed database.

-Index, rowkey of hbase are sorted by dictionary

-query, the query mechanism is implemented by index + Bloom filter

3. Characteristics of hbase:

-it is between nosql and RDBMS and can only be retrieved through the primary key and the range (range) of the primary key.

The query function of -hbase is very simple. It still uses key-value database and does not support complex operations such as join.

-does not support complex transactions, only row-level transactions (complex operations such as multi-table join can be implemented through hive support)

-mainly used to store structured and semi-structured loose data.

-No mode, each row has a sortable primary key and multiple arbitrary columns, columns can be dynamically increased as needed, and different rows in the same table can have very different columns.

4. The characteristics of tables in hbase:

-large, a table can be 1 billion rows and millions of columns

-column oriented, column (family) oriented storage and permission control, column (cluster) independent retrieval. (improve the performance of queries)

-sparse, does not take up space for null columns, so tables can be designed to be very sparse

-No strict mode, each row has a sortable primary key and as many columns as needed, columns can be dynamically added as needed, and different rows in the same table can have very different columns. (format checking will be done when reading and writing.)

5. Logical view of table structure in hbase:

hbase stores data in the form of table structures. The table consists of rows and columns, and the columns are divided into several column clusters.

The process of querying data in :

table-rowkey--- column cluster-column-timestamp

Rowkey: sort by dictionary

column cluster: contains a set of columns that are specified when you insert data and when you create a table

columns: there can be multiple columns in a column cluster, and can be different

timestamp: the value of each column can store multiple versions of the value, and the version number is the timestamp, sorted by time from near to far.

Features:

-RDBMS can be abstracted into a two-dimensional table, which consists of rows and columns, with rows and columns to determine a unique value.

-HBASE is essentially a key-value database, and key is a row key rowkey,value is a collection of all real key-value

-HBASE can also be abstracted into a four-dimensional table, which consists of row key RowKey, column cluster Column Family, column Column and timestamp Timestamp.

-all columns of a HBASE are divided into several column clusters

-each column cluster of each region is another store, which is represented as a folder in hdfs.

6. Explanation of specific nouns of hbase:

Row key (rowkey):

is the same as the Nosql database, rowkey is the primary key used to retrieve records, and the rowkey row key can be any string (the maximum length is 64KB, which is usually 10-100bytes in practice), preferably 16. Within HBase, rowkey is saved as an array of bytes, and HBase sorts the data in the table according to rowkey (dictionary order).

accesses rows in HBASE table. There are only three ways:

-access through a single rowkey

-range through rowkey (range)

-full table scan

Column clusters:

Each column in the HBASE table belongs to a column cluster. The column cluster is part of the Schema of the table (while the column is not) and must be defined before using the table and cannot be changed after it has been defined. Column names are prefixed with column clusters, and access control, disk and memory usage statistics are all carried out at the column cluster level.

Note: the more column clusters, the more files you have to participate in IO and search for when fetching a row of data, so if it is not necessary, do not set up too many column clusters (preferably just one column cluster).

Timestamp:

What determines in HBASE through rowkey and columns is a storage unit called cell. Each cell holds multiple versions of the same data. The version is indexed by timestamp. The type of timestamp is 64-bit integer. The timestamp can be assigned by hbase (automatically when the data is written), where the timestamp is the current system time accurate to milliseconds. In each cell, different versions of the data are sorted in reverse chronological order, meaning that the latest data comes first.

In order to avoid the administrative burden caused by too many versions of data, provides two ways to recycle data versions:

-the last n versions of the data saved (number)

-saves the most recent version within time (sets the life cycle of the data TTL)

Cell:

What determines in HBASE through rowkey and columns is a storage unit called cell. By {rowkey,column (= +)

< column>

Version} the data in a cellCell is untyped and is all stored in bytecode form.

7. Comparison between hbase and hive:

Similarities:

Both -HBASE and hive are based on hadoop and use hdfs as the underlying data storage. Use MapReduce to do data calculation.

Differences:

-hive is based on hadoop, in order to reduce the complexity of Mapreduce programming, and hbase is to make up for the shortcomings of hadoop in real-time operation.

-Hive represents a pure logical table, because hive itself can not do data storage and calculation, but completely depends on the hadoop,hbaseHBASE is a physical table, providing a super-large memory Hash table to store the index to facilitate query.

-Hive is a data warehouse, which requires a full table scan. Hive,hive is a file storage, HBase is a database, and index access is needed. HBase is used because HBase is a column-oriented NoSQL database.

-Hive does not support single-line recording operations, data processing depends on MapReduce, and operation latency is high; HBase supports single-line recording CRUD and is real-time processing, which is much more efficient than Hive

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.