What is the basic knowledge of HBase 07/11 Update SLTechnology News&Howtos

What is the basic knowledge of HBase

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is to share with you what is the basic knowledge of HBase. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Overview

HBase features:

Strong consistent read and write: HBase is not a "eventually consistent" data store. This makes it suitable for high-speed counting aggregation tasks.

Automatic fragmentation (Automatic sharding): the HBase table is distributed in the cluster through region. As the data grows, the region automatically splits and redistributes.

RegionServer automatic failover

Hadoop/HDFS integration: HBase supports external HDFS as its distributed file system.

MapReduce: HBase supports large concurrency processing through MapReduce, and HBase can be both source and target.

Java client API: HBase supports programmatic access to easy-to-use Java API.

Thrift/REST API: HBase also supports Thrift and REST as non-Java front ends.

Block Cache and Bloom Filters: for bulk query optimization, HBase supports Block Cache and Bloom Filters.

Operation and maintenance management: HBase provides built-in web pages for operation and maintenance perspective and JMX metrics.

HBase is not suitable for all problems.

First of all, make sure there is enough data, and if there are hundreds of millions or hundreds of billions of rows of data, HBase is a good candidate. If there are only thousands or millions of lines, traditional RDBMS may be a better choice. Because all data can be saved on one or two nodes, other nodes in the cluster may be idle.

Second, make sure you don't have to rely on all the extra features of RDBMS (e.g., column data types, second indexes, things, advanced query languages, etc.) An application built on RDBMS, for example, cannot be ported to HBase only by changing a JDBC driver. As opposed to migration, consider that the transition from RDBMS to HBase is a complete redesign.

Third, make sure you have enough hardware. Even when HDFS is less than 5 data nodes, it can't do anything well (according to the default value of 3 according to HDFS block copy), and add a NameNode.

Pre-write log (WAL)

Each RegionServer records updates (Puts, Deletes) in the pre-write log (WAL), and then updates them in Section 9.7.5, "Store" Section 9.7.5.1, "MemStore". This ensures the reliability of HBase's writing. If there is no WAL, when the RegionServer is down, the MemStore does not have the flush,StoreFile and the data will be lost. HLog is a WAL implementation of HBase, and a RegionServer has an instance of HLog.

WAL is saved in / hbase/.logs/ of HDFS, with one file per region.

Region

Regions are the basic elements of table acquisition and distribution, consisting of a Store for each column family. The object level diagram is as follows:

Table (HBase table) Region (Regions for the table) Store (Store per ColumnFamily for each Region for the table) MemStore (MemStore for each Store for each Region for the table) StoreFile (StoreFiles for each Store for each Region for the table) Block (Blocks within a StoreFile within a Store for each Region for the table)

Region Siz

The size of the Region is a thorny issue, and the following factors need to be considered.

Regions is the most basic unit of availability and distribution

HBase distributes the region by splitting it across many machines. In other words, if you have 16GB data, only 2 region, but you have 20 machines, 18 will be wasted.

Too many region will cause performance degradation, which is much better than before. But for the same size of data, 700 region is better than 3000.

Too few region will hinder scalability and reduce parallelism. Sometimes the pressure is not dispersed enough. That's why you import 200MB data into a 10-node HBase cluster, and most of the nodes are idle.

There is not much difference in the amount of memory required between 1 region and 10 region indexes in RegionServer.

It is best to use the default configuration, and you can make the hot table smaller (or region affected by split hotspots to spread the pressure into the cluster). If your cell is relatively large (100KB or larger), you can adjust the size of the region to 1GB.

Storage

A store contains a memory store (MemStore) and several file stores (StoreFile--HFile). A store can be located to an area in a column family.

MemStore MemStores is a memory Store in Store that can be modified. The modified content is KeyValues. When the flush is, the existing memstore generates a snapshot and then empties it. When the snapshot is performed, the HBase continues to receive modification operations and keeps them outside the memstore until the snapshot is complete.

StoreFile (HFile) file storage is where the data is stored.

Tighten up

There are two types of austerity: secondary austerity and primary austerity. Minor crunch usually merges several small adjacent files into one large one. Minor will not delete data marked with deletion, nor will it delete expired data, while Major tightening will delete expired data. Sometimes minor tightening will tighten all the files in a store, in fact, it is a major compression at this time.

After performing a major crunch, a store will have only one sotrefile, which usually provides performance. Note: major tightening will rewrite all the data in store, which can be very hurtful in a heavily loaded system.

Austerity does not merge partitions.

Bulk loading (Bulk Loading)

An overview of HBase there are several ways to load data into tables. The most direct way is either through the MapReduce task or through the normal client API. But none of this is an efficient method.

The bulk load feature uses the MapReduce task to output the table data to the internal data format of HBase, and then the resulting storage files can be loaded directly into the running cluster. Bulk loading consumes less CPU and network resources than simply using HBase API.

Bulk loading architecture

The HBase bulk loading process consists of two main steps.

The first step in preparing the bulk loading of data through the MapReduce task is to generate the HBase data file (StoreFiles) from the MapReduce task through HFileOutputFormat. The output data is in the internal data format of HBase so that it is more efficient to load it to the cluster later.

In order for processing to be efficient, HFileOutputFormat must be more suitable in one partition than configuring each HFile. To do this, the output will be bulk loaded into the HBase task, using Hadoop's TotalOrderPartitioner class to separate the map output into separate key space intervals. The key space corresponding to each partition (region) in the table.

HFileOutputFormat includes a convenient function, configureIncrementalLoad (), which automatically sets TotalOrderPartitioner based on the current partition boundary of the table.

Complete data loading After the data has been prepared using HFileOutputFormat, it is loaded into the cluster using completebulkload. This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to. It then contacts the appropriate Region Server which adopts the HFile, moving it into its storage directory and making the data available to clients.

If the region boundaries have changed during the course of bulkload preparation, or between the preparation and completion steps, the completebulkloads utility will automatically split the data files into pieces corresponding to the new boundaries. This process is not optimally efficient, so users should take care to minimize the delay between preparing a bulk load and importing it into the cluster, especially if other clients are simultaneously loading data through other means.

Table creation: pre-created area (Region)

By default, the HBase creation table creates a new region. Performing a batch import means that all client will be written to this area until the area is large enough to split. An effective way to improve the performance of batch imports is to pre-create empty areas. It's best to be a little conservative, because too many areas can actually degrade performance.

Table creation: delayed log flush

The default behavior of Puts uses Write Ahead Log (WAL), which causes HLog editors to write the disk immediately. If deferred brushing is used, WAL edits will remain in memory until the brushing cycle comes. The advantage is that HLog is written centrally and asynchronously, and the potential problem is that if RegionServer exits, logs that are not brushed will be lost. But it's also much more secure than not using WAL when using Puts.

Delayed log flushing can be set on the table through HTableDescriptor, and the default value of hbase.regionserver.optionallogflushinterval is 1000ms.

HBase client: automatic writing

When you do a lot of Put, make sure that the setAutoFlush of your HTable is off. Otherwise, each time a Put is executed, a request is sent to the zone server. Add Put to the write buffer through htable.add (Put) and htable.add (Put). If autoFlush = false, the request will not be made until the write buffer is full. To explicitly initiate a request, you can call flushCommits. Close operations on HTable instances also initiate flushCommits

HBase client: shutting down WAL on Puts

A frequently discussed option to increase throughput on Puts is to call writeToWAL (false). Turning it off means that RegionServer no longer writes Put to Write Ahead Log, only to memory. However, the consequence is that if RegionServer fails, it will result in data loss. If you call writeToWAL (false), you need to be on high alert. You will find that there is virtually no difference if your load is well distributed to the cluster.

In general, it is best to use WAL for Puts, and increasing load throughput is related to the use of bulk loading alternative technologies.

Read from HBase

Scan cache if the input source of the HBase is a MapReduce Job, make sure that the setCaching value of the entered Scan is larger than the default value of 0. Using the default value means that every line of map-task will request a region-server. You can set this value to 500 so that 500 lines can be transferred at a time. Of course, this also needs to be weighed, too large value will consume a lot of memory on both the client and the server, not the bigger the better.

Scan attribute selection

When Scan is used to process a large number of rows (especially as input to MapReduce), it is important to pay attention to which fields are selected. If scan.addFamily is called, all properties of the column family are returned. If you only want to filter a small part of them, specify those column, otherwise it will cause a lot of waste and affect performance.

Close ResultScanners

This is not so much to improve performance as to avoid performance problems. If you forget to turn off ResultScanners, it will cause problems with RegionServer. So be sure to include ResultScanner in the try/catch block.

Scan scan = new Scan (); / / set attrs...ResultScanner rs = htable.getScanner (scan); try {for (Result r = rs.next (); r! = null; r = rs.next ()) {/ / process result...} finally {rs.close (); / / always close the ResultScanner!} htable.close (); Thank you for reading! This is the end of this article on "what are the basic knowledge of HBase?". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.