The HBase Foundation of big data 07/16 Update SLTechnology News&Howtos

The HBase Foundation of big data

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Introduction to HBase

1.1. What is HBase?

HBase is a highly reliable, high-performance, column-oriented and scalable distributed storage system. Large-scale structured storage clusters can be built on cheap PC Server by using HBase technology.

The goal of HBase is to store and process large data, or more specifically, to be able to handle large data consisting of thousands of rows and columns using only a normal hardware configuration.

HBase is an open source implementation of Google Bigtable, but there are many differences.

For example:

Google Bigtable uses GFS as its file storage system, and HBase uses Hadoop HDFS as its file storage system

Google runs MAPREDUCE to deal with massive data in Bigtable, while HBase also uses Hadoop MapReduce to deal with massive data in HBase.

Google Bigtable uses Chubby as a collaborative service and HBase uses Zookeeper as its counterpart.

If you want to know big data's learning route, you can add 784789432 if you want to learn big data's knowledge and free learning materials. Welcome to join us. Every day, a live broadcast will be held at 3 pm to share basic knowledge, and at 20:00 in the evening, a live broadcast will be held to share the actual combat of big data project.

1.2. Comparison with traditional database

1. Problems encountered in traditional databases:

1) it cannot be stored when the amount of data is large.

2) there is no good backup mechanism.

3) when the data reaches a certain amount, it starts slowly, and if it is very large, it is basically unsustainable.

2. HBase advantages:

1) Linear expansion, which can be supported by node expansion as the amount of data increases.

2) the data is stored on hdfs, and the backup mechanism is sound

3) coordinate to find data and access speed blocks through zookeeper.

1.3. Roles in the HBase cluster

1. One or more primary nodes, Hmaster

2. Multiple slave nodes, HregionServer

Hbase data model

2.1. Hbase data model

2.1.1. Row Key

Like nosql databases, row key is the primary key used to retrieve records. There are only three ways to access rows in HBASE table:

1. Access through a single row key

two。 Range through row key (regular)

3. Full table scan

The Row key line key (Row key) can be any string (the maximum length is 64KB, which is usually 10-100bytes in practical applications). Within HBASE, row key is saved as a byte array. When storing, the data is sorted and stored according to the byte order of Row key. When designing a key, you should fully sort the storage feature and put together the row stores that are often read together. (location correlation)

2.1.2. Columns Family

Column clusters: each column in the HBASE table belongs to a column family. Column families are part of the schema of the table (while columns are not) and must be defined before using the table. Column names are prefixed with column families. For example, courses:history,courses:math belongs to the column family of courses.

2.1.3. Cell

The unit uniquely determined by {row key, columnFamily, version}. The data in cell is typeless and is all stored in bytecode form.

Keywords: untyped, bytecode

2.1.4. Time Stamp

The storage unit determined by rowkey and columns in HBASE is called cell. Each cell holds multiple versions of the same data. The version is indexed by timestamp. The type of timestamp is 64-bit integer. The timestamp can be assigned by HBASE (automatically when the data is written), where the timestamp is the current system time accurate to milliseconds. The timestamp can also be explicitly assigned by the customer. If the application wants to avoid data version conflicts, it must generate its own unique timestamps. In each cell, different versions of the data are sorted in reverse chronological order, meaning that the latest data comes first.

In order to avoid the burden of management (including storage and indexing) caused by too many versions of data, HBASE provides two ways to recycle data versions. One is to save the last n versions of the data, and the other is to save the most recent version (for example, the last seven days). You can set it for each column family.

Hbase command

3.1. The advance and retreat of an order

1. Hbase provides a shell terminal for user interaction.

# $HBASE_HOME/bin/hbase shell

2. If you exit and execute the quit command

# $HBASE_HOME/bin/hbase shell

……

Quit

3.2. Command

Name

Command expression

Create a tabl

Create 'table name', 'column family name 1' column family name 2 'table name' column family name N'

View all tables

List

Description table

Describe 'table name'

Judge the existence of table

Exists' table name'

Determine whether to disable the enable table

Is_enabled 'table name'

Is_disabled 'table name'

Add record

Put 'table name', 'rowKey',' column family: column', 'value'

View and record all the data under rowkey

Get 'table name', 'rowKey'

View the total number of records in the table

Count 'table name'

Get a column family

Get 'table name', 'rowkey',' column family'

Get a column of a column family

Get 'Table name', 'rowkey',' column Family: columns'

Delete record

Delete 'Table name', 'Row name', 'column Family: column'

Delete the entire row

Deleteall 'table name', 'rowkey'

Delete a table

The table must be masked before it can be deleted

The first step is disable 'table name' and the second step is drop 'table name'

Clear the table

Truncate 'table name'

View all records

Scan "Table name"

View all the data in a table and a column

Scan "Table name", {COLUMNS= > 'column Family name: column name'}

Update record

Is to rewrite, overwrite, hbase is not modified, it is all appended

Hbase depends on zookeeper

1. Save the address and backup-master address of Hmaster

Hmaster:

A) manage HregionServer

B) add, delete and change the node of the table

C) manage table allocations in HregionServer

2. Save the table-the address of ROOT-

Hbase default root table, key table.

3. HRegionServer list

Add, delete, change and check data of the table.

Interact with hdfs and access data.

Hbase principle

5.1. System diagram

5.1.1. Writing process

1. Client sends a write request to hregionserver.

2. Hregionserver writes the data to hlog (write ahead log). For data persistence and recovery.

3. Hregionserver writes data to memory (memstore)

4. Feedback that client is written successfully.

5.1.2. Data flush process

1. When the memstore data reaches the threshold (the default is 64m), the data is brushed to the hard disk, the data in memory is deleted, and the historical data in Hlog is deleted.

2. And store the data in hdfs.

3. Mark points in hlog.

5.1.3. Data merging process

1. When the data block reaches 4, hmaster loads the data block locally and merges it.

2. When the merged data exceeds 256m, split and assign the split region to different hregionserver management

3. When the hregionser is down, split the hlog on the hregionserver and assign it to different hregionserver to load and modify .meta.

4. Note: hlog will be synchronized to hdfs

5.1.4. The reading process of hbase

1. Through zookeeper and-ROOT-. Meta. Table locates hregionserver.

2. Data is returned to client after merging from memory and hard disk.

3. Data blocks will be cached

5.1.5. Duties of hmaster

1. Manage users' operations of adding, deleting, modifying and querying Table

2. Record which Hregion server region is on.

3. After Region Split, be responsible for the allocation of new Region

4. Manage HRegion Server load balance and adjust Region distribution when new machines join

5. Responsible for the Regions migration on the failed HRegion Server after the downtime of HRegion Server.

5.1.6. Duties of hregionserver

HRegion Server, which is the core module of HBASE, is mainly responsible for reading and writing data to the HDFS file system in response to the user ID O request.

HRegion Server manages a lot of table partitions, that is, region.

5.1.7. Client responsibility

Client

HBASE Client uses HBASE's RPC mechanism to communicate with HMaster and RegionServer

Management operations: Client and HMaster for RPC

Data read and write operations: Client and HRegionServer for RPC.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.