The way of learning HBase framework 04/23 Update SLTechnology News&Howtos

The way of learning HBase framework

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

1 background knowledge 1.1 problem solving

Resolve the problem that HDFS does not support quick search and update of individual records.

1.2 Application

In a database with hundreds of millions of records, it is more appropriate to use RDBMS for only tens or millions of records

Make sure your application does not need to use the advanced features of RDBMS (second index, transaction mechanism, advanced query language, etc.)

With sufficient hardware configuration, that is, the number of nodes, HDFS does not perform well when there are less than 5 nodes, and the same is true for HBase.

2 Design concept 2.1 Overview 2.1.1 introduction

Distributed database of NoSQL type developed with Java language

Some advanced features of RDBMS are not supported, such as transaction mechanism, second index, advanced query language, etc.

Supports linear and modular extensions, and can linearly improve performance by adding RegionServer to business machines

2.1.2 HBase features:

Strong read-write consistency: suitable for high-speed count aggregation operations

Automatic data segmentation: distributed storage of data and automatic slicing as the data grows

RegionServer automatic failure backup

Integrate with HDFS

Support MapReduce to perform massively parallel operations

Provide Java Client API

Provide Thrift/REST API

Block caching and Bloom Fliter optimized for bulk queries

Visual management interface

2.1.3 disadvantages

The re-execution of WAL is slow.

Fault recovery is slow and complex

Main compression can cause Imax O storm (a large number of Imax O operations)

2.2 Design Architecture 2.2.1 Chinese explanation of basic concepts Table table consists of multiple lines.

The Row row consists of a Key and one or more columns

Column columns are composed of column families and column qualifiers: column qualifiers; columns between rows can differ greatly

The Column Family column family physically stores multiple columns; designed to improve performance; table creation requires topping the data in the contentColumn Qualifier column qualifier column family. The index table creation does not need to be specified, and the content:htmlCell unit can be added at any time consisting of rows, column families, column qualifiers, values, and timestamps that represent the version

The TimeStamp timestamp is used to indicate the version of the data that can be specified using the system time or by yourself.

2.2.1.2 example this example is taken from the official document Row KeyTime StampColumnFamily contentsColumnFamily anchorColumnFamily people "com.cnn.www" T9

Anchor:cnnsi.com = "CNN"

"com.cnn.www" T8

Anchor:my.look.ca = "CNN.com"

"com.cnn.www" t6contents:html = "

"com.cnn.www" t5contents:html = "…"

"com.cnn.www" t3contents:html = "

Com.example.wwwt5contents:html: "..."

People:author: "John Doe"

Description:

Table format is not the only and most accurate expression, but can also be expressed in Json format.

The blank cells in the table do not take up physical storage space, but exist conceptually

2.2.1.3 Operation the relationship between API attention and version GetTable.get returns the attribute of the specified row; if the first row of Scan does not specify a version, it returns the data with the largest version value (but may not be the latest) You can change the number of data returned by setting the value of MaxVersion. If ScanTable.scan returns multiple rows that meet the condition, if the update Key is not present, insert it; use the system time by default through Table.put (write cache) or Table.batch (no write cache); you can overwrite as long as key, column and version are the same; you can specify a version of DeleteTable.delete1 when inserting. Delete the specified column; 2. Delete all versions of the column; 3. Delete all columns of a specific column family. The deletion operation will not be performed immediately, but the tombstone will be tagged to the data, and the death data and tombstone will be cleared when the space is cleaned. 2. By means of hbase-site.xml. Set the TTL (time to live) by setting the hbase.hstore.time.to.purge.deletes property in the

Description:

The maximum and minimum values of the number of versions can be specified and affect the operation

Version (timestamp) is used to control the survival time of data, it is best not to set it manually

2.2.1.4 limitations

1) the Delete operation will affect the Put operation: the reason is that the Delete operation is not performed immediately, but the death data is tagged. If you perform an operation with a Delete version less than or equal to T, and then insert a Put version of T data, the data of the new Put will also be tagged, and all the tagged data will be cleared in the next clean-up work of the system. When you execute the query, you will not get the data for the new Put, which will not happen if you do not set the version manually and the version uses the system default time.

2) the clean-up work will affect the query: create three units with the version t _ 1 ~ T _ 2 ~ T _ 3 and set the maximum number of versions to 2. So when we query all versions, only T2 and T3 are returned. But when you delete versions T2 and T3, version T1 reappears. Obviously, once the important streamlining work is running, such behavior will not occur again.

See more information about the data model

2.2.2 Architecture 2.2.2.1 Architecture Featur

1) Master-Slave Architecture

2) there are three components:

Component name component main functions HMaster is responsible for Region allocation and DDL operation (creating and deleting tables) HRegionServerRegionServer is responsible for reading and writing data; communicating with clients ZooKeeper maintains cluster activity

3) the underlying storage is HDFS

2.2.2.2 component hbase:meta: information of all region 1) structure: Key

Format: ([table], [region start key], [region id])

Values

Info:regioninfo (serialize HRegionInfo instance)

Info:server (server: Port of the RegionServer containing this Region)

Info:serverstartcode (startup time of the RegionServer containing this Region)

2) Storage location: in ZooKeeper

HMaster: controller

Assign Region: allocation at startup, redistribution of Region on failed RegionServer, allocation when Region segmentation

Monitor all RegionServer in the cluster to achieve load balancing

DDL:Data Definition Language (creation, deletion, and update of tables-updates for column families)

Manage metadata for namespace and table

Rights Management (ACL)

Garbage file collection on HDFS

HRegionServer:HBase actual reader and writer

Respond to the read and write request from client and perform the I _ plink O operation (bypass HMaster directly)

Interact with HDFS to manage table data

Split the Region when the size of the Region reaches the threshold

This section can be explained in detail by referring to Region Server

ZooKeeper: coordinator

Ensure that one and only one HMaster in the cluster is Active

Store hbase:meta, that is, location information for all Region

Store metadata information for tables in HBase

Monitor the status of RegionServer and report the status of RS to HMaster

ZooKeeper cluster itself uses consistency protocol (PAXOS protocol) to ensure the consistency of the state of each node.

Region:Region is the basic unit of HBase data storage and management.

This section can be explained in detail by referring to Region

2.3 related process 2.3.1 first time read and write process

In this section, please refer to the first read and write process in the detailed explanation of Region Server.

2.3.2 Writing process

You can refer to the writing process in the detailed explanation of Region Server in this section.

2.3.2 Reading process

You can refer to the reading process in the detailed explanation of Region Server in this section.

2.4 related mechanism 2.4.1 Compaction mechanism (compression merge) 2.4.1.1 times compression

In this section, please refer to the secondary compression section in the detailed explanation of Region Server.

2.4.1.2 main compression

In this section, please refer to the main compression section in the detailed explanation of Region Server.

2.4.2 WAL Replay mechanism

You can refer to WAL Replay in the detailed explanation of Region Server in this section

Update 2.5.1. Meta table = > hbase:meta2.5.1.1-ROOT- and .meta

Before 0.96.x, there were two tables-ROOT- and .meta to maintain the metadata of region.

1) structure: Key

.meta. Region key (.meta., 1)

Values

Info:regioninfo (serialized instance of hbase:meta)

Info:server (server:port that stores the RegionServer of the hbase:meta)

Info:serverstartcode (startup time of the RegionServer that stores the hbase:meta)

2) the process of reading region location information

Read from ZooKeeper-HRegionServer where the ROOT- Table is located

Read the HRegionServer of .meta. Table from the HRegionServer based on the requested TableName,RowKey

Read the contents of .meta. Table from the HRegionServer to get the location of the HRegion that this request needs to access

Access the HRegionSever to get the requested data

2.5.1.2 hbase:meta

This section can refer to the hbase:meta in the 2.2.2.2 component and the first read and write process in the 2.3 related process for comparison.

2.5.1.3 purpose of the upgrad

1) before version 0.96.x, it was designed with reference to the BigTable of Goole. It takes four steps from the request to read the data to actually read the data. The purpose of Google to design BigTable is that it has a huge amount of data, and the multi-tier schema structure can store more Region, but with it comes the decline of access performance.

2) the amount of data in general companies is not as large as Google, so removing the-ROOT- table, leaving .meta (hbase:meta) table, and increasing the size of Region can not only meet the storage needs, but also improve the access performance.

2.5.2 HLog = > WAL

The WAL implementation in HBase prior to 0.94.x is called HLog and is stored in the / hbase/.logs/ directory

Change its name to WAL after 0.94.x and store it in the / hbase/WALs/ directory

2.6 links to other frameworks

To be continued.

2.7 performance tuning

To be continued.

2.8 Advanced Features

To be continued.

3 Project practice 3.1 getting started Guide 3.1.1 Environment Building

You can refer to the HBase deployment getting started Guide in this section.

3.1.2 getting started

This section can refer to the HBase Shell activity, HBase Java API activity, and using MapReduce to operate HBase

3.2 Technical difficulties

To be continued.

3.3 problems encountered in development

To be continued.

3.4 Application 3.4.1 OpenTSDB development

To be continued.

4 statement

The part to be continued will be updated from time to time. Please look forward to it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.