What are the core knowledge points of HBase 07/01 Update SLTechnology News&Howtos

What are the core knowledge points of HBase

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is to share with you what is the core knowledge of HBase. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Introduction of HBase

1. Basic concepts

HBase is a kind of Hadoop database, which is often described as a sparse, distributed, persistent, multi-dimensional ordered mapping. It builds indexes based on row keys, column keys and timestamps, and is a platform for storing and retrieving data that can be accessed randomly. HBase does not limit the types of data stored, allows dynamic and flexible data models, does not use SQL language, and does not emphasize the relationship between data. HBase is designed to run on a server cluster and can scale out accordingly.

2. HBase usage scenarios and success stories

Internet search problem: crawlers collect web pages and store them in BigTable. MapReduce computing jobs scan the full table to generate a search index, query search results from BigTable, and show them to users.

Capture incremental data: for example, capture monitoring indicators, capture user interaction data, telemetry technology, targeted advertising, etc.

Content service

Information exchange

3. HBase Shell command line interaction:

Start Shell $hbase shell

List all tables hbase > list

Create a table named mytable with a column family hb hbase > create 'mytable',' hb'

Insert a byte array 'hello HBase' in the data unit corresponding to the' hb:data' column'in the first' row of the 'mytable' table'.

Hbase > put 'mytable',' first', 'hb:data',' hello HBase'

Read the contents of the mytable table 'first' row' hbase > get 'mytable',' first'

Read all the contents of the mytable table hbase > scan 'mytable'

II. Getting started

1 、 API

There are five HBase API related to data operation, namely Get (read), Put (write), Delete (delete), Scan (scan) and Increment (column value increment).

2. Operation table

The first step is to create a configuration object

Configuration conf = HBaseConfiguration.create ()

Configuration files must also be added when using eclipse.

Conf.addResource (new Path ("E:\\ share\\ hbase-site.xml")

Conf.addResource (new Path ("E:\\ share\\ core-site.xml")

Conf.addResource (new Path ("E:\\ share\\ hdfs-site.xml")

Create a table using connection pooling.

HTablePool pool = new HTablePool (conf,1)

HTableInterface usersTable = pool.getTable ("users")

3. Write operation

The command used to store data is put. To store data in a table, you need to create a Put instance. And make the lines to join.

Put put = new Put (byte [] row)

Put's add method is used to add data, setting column families, qualifiers, and cell references, respectively.

Put.add (byte [] family, byte [] qualifier, byte [] value)

Finally, submit the command to the table.

UsersTable.put (put)

UsersTable.close ()

Modify the data and just resubmit the latest data.

How HBase writes work:

Each time HBase performs a write operation, it writes to two places: write-ahead log (also known as HLog) and MemStore (write buffer) to ensure data persistence. The write action is considered complete only when the change information in these two places is written and confirmed. MemStore is a write buffer in memory. The data in HBase accumulates here before it is permanently written to the hard disk. When the MemStore fills up, the data is written to the hard disk and a HFile is generated.

4. Read operation

Create an instance of the Get command that contains the rows to query

Get get = new Get (byte [] row)

Execute addColumn () or addFamily () to set constraints.

Submitting a get instance to a table returns an Result instance that contains data, and the instance contains all columns of all column families in the row.

Result r = usersTable.get (get)

Specific values can be retrieved on result instances

Byte [] b = r.getValue (byte [] family, byte [] qualifier)

Working mechanism:

BlockCache is used to store frequently accessed data that is read into memory from HFile, avoiding hard disk reading, and each column family has its own BlockCache. To read a line from the HBase, first check the queue of the MemStore waiting for modification, then check the BlockCache to see if the Block containing the line has been accessed recently, and finally access the corresponding HFile on the hard disk.

5. Delete operation

Create an instance of Delete and specify the rows to delete.

Delete delete = new Delete (byte [] row)

You can specify to delete a portion of a row through the deleteFamily () and deleteColumn () methods.

6 table scan operation

Scan scan = new Scan () can specify the start and end lines.

The setStartRow (), setStopRow (), setFilter () methods can be used to restrict the data returned.

The addColumn () and addFamily () methods can also specify columns and column families.

The data models of the HBase schema include:

Tables: HBase uses tables to organize data.

Rows: in a table, data is stored in rows, and rows are uniquely identified by row keys. The row key has no data type and is a byte array byte [].

Column families: the data in the rows are grouped by column families, which must be defined in advance and cannot be easily modified. Each row in the table has the same column family.

Column qualifier: data in a column family is located by a column qualifier or column, which does not need to be defined in advance.

Cell: the data stored in the cell is called the cell value, and the value is a byte array. Cells are determined by row keys, column families, or column qualifiers.

Time version: the unit value has a time version and is a long type.

An example of HBase data coordinates:

HBase can be thought of as a key-value database. HBase is designed for semi-structured data, and data records may contain inconsistent columns, uncertain sizes, and so on.

III. Distributed HBase, HDFS and MapReduce

1. HBase in distributed mode

HBase splits the table into small data units called region, which are distributed to multiple servers. The server that hosts region is called RegionServer. In general, RgionServer and HDFS DataNode are configured side by side on the same physical hardware. RegionServer is essentially a HDFS client, where access data is stored, HMaster assigns region to RegionServer, and each RegionServer hosts multiple region.

Two special tables in HBase, ROOT- and .meta., are used to find out where the region locations of various tables are. -ROOT- points to .meta. Region,.META of the table. The table points to the RegionServer that hosts the region to be found.

The 3-tier distributed B+ tree for a client lookup process is shown below:

Top-level structure diagram of HBase:

Zookeeper is responsible for tracking the region server and saving the address of the root region.

Client is responsible for contacting the Zookeeper subcluster and HRegionServer.

HMaster is responsible for assigning all region to each HRegion Server when starting HBase, including-ROOT- and .meta. Watch.

HRegionServer is responsible for opening region and creating the corresponding HRegion instance. When HRegion is opened, it creates an instance of Store for each table's HColumnFamily. Each Store instance contains one or more StoreFile instances, which are lightweight wrappers of the actual data store file HFile. Each Store has its own MemStore, and a HRegionServer shares an instance of HLog.

A basic process:

A. The client gets the name of the region server with-ROOT- through zookeeper.

B. Query the region server with-ROOT- that contains .meta. The corresponding region server name in the table.

C, query. META. The server gets the name of the region server where the row key data queried by the client resides.

D. Get the data through the region server where the row key data is located.

HFile structure diagram:

Trailer has pointers to other blocks, Index blocks record offsets of Data and Meta blocks, and Data and Meta blocks store data. The default size is 64KB. Each block contains a Magic header and a certain number of serialized KeyValue instances.

KeyValue format:

The structure begins with two fixed-length numbers representing the key length and the value length respectively, and the key includes row keys, column family names and column qualifiers, timestamps, and so on.

Prewrite log WAL:

Each update is written to a log, and only a successful write notifies the client that the operation was successful, and then the server is free to batch or aggregate data in memory as needed.

The process of editing a stream that splits between memstore and WAL:

Handling: the client sends an instance of the KeyValue object to the HRegionServer with the matching region through a RPC call. These instances are then sent to the HRegion instance that manages the corresponding rows, the data is written to the WAL, and then put into the MemStore that actually owns the storage file of the record. When the data in memstore reaches a certain size, the data will be continuously written to the file system asynchronously, and WAL can guarantee that the data in this process will not be lost.

2. HBase and MapReduce

There are three ways to access HBase from MapReduce applications:

HBase can be used as the data source at the beginning of the job, HBase can be used to receive data at the end of the job, and HBase can be used to share resources during the task.

Use HBase as the data source

Stage map

Protected void map (ImmutableBytesWritable rowkey,Result result,Context context) {

}

Jobs read from the HBase table receive key-value pairs of the corresponding types ImmutableBytesWritable and Result in [rowkey:scan result] format.

Create an instance to scan all rows in the table

Scan scan = new Scan ()

Scan.addColumn (…)

Next, use the Scan instance in MapReduce.

TableMapReduceUtil.initTableMapperJob (tablename,scan,map.class

Type of output key .class, type of output value .class, job)

Use HBase to receive data

Reduce stage

Protected void reduce (

ImmutableBytesWritable rowkey,Iterablevalues,Context context) {

}

Fill reducer into the job configuration

TableMapReduceUtil.initTableReducerJob (tablename,reduce.class,job)

3. HBase to achieve reliability and availability

As the underlying storage, HDFS provides a single namespace for all RegionServer in the cluster, and one RegionServer read and write data can be read and written to all other RegionServer. If a RegionServer fails, any other RegionServer can read data from the underlying file system and start providing services based on the HFile stored in the HDFS. Take over the region of the RegionServerz service.

Fourth, optimize HBase

1. Random read intensive

Direction of optimization: efficient use of caches and better indexes

Increase the percentage of heaps used by the cache, configured through the parameter hfile.block.cache.size.

Reduce the percentage occupied by MemStore, which is adjusted by hbase.regionserver.global.memstore.lowerLimit and hbase.regionserver.global.memstore.upperLimit.

Use smaller data blocks to make the index more granular.

Turn on the Bloom filter to reduce the number of HFile read to find the Key Value object for the specified row.

Setting a radical cache can improve the performance of random reads.

Close the column families that are not used for random reading and increase the cache hit ratio.

2. Sequential read intensive

Direction of optimization: reduce the use of caching.

Increase the size of the data block, so that each hard disk seek time to take out more data.

Set a high scanner cache value so that the scanner can retrieve more lines each time RPC requests the scanner to perform a large sequential read. The parameter hbase.client.scanner.caching defines the number of lines to retrieve when the next method is called on the scanner.

Turn off the cache of data blocks to avoid flipping the cache too many times. Set through Scan.setCacheBlocks (false).

Turn off the caching of the table so that the cache is no longer flipped with each scan.

3. Write-intensive

Direction of optimization: don't write, merge or split too often.

Increase the maximum size of the underlying storage file (HStoreFile). A larger region means fewer splits when writing. Set through the parameter hbase.hregion.max.filesize.

Increase the size of MemStore, adjusted by the parameter hbase.hregion.memstore.flush.size. The more data you write to HDFS, the larger the production HFile, which reduces the number of files generated at the time of writing, thus reducing the number of merges.

Increase the proportion of heaps allocated to MemStore on each RegionServer. Set upperLimit to accommodate the MemStore of each region multiplied by the expected number of region on each RegionServer.

Garbage collection optimization, which is set in the hbase-env.sh file. The initial value can be set to-Xmx8g-Xms8g-Xmn128m-XX:+UseParNewGC-XX:+UseConcMarkSweepGC.

-XX:CMSInitiatingOccupancyFraction=70

Turning on the MemStore-Local Allocation Buffer feature helps prevent fragmentation of the heap. Set by parameter hbase.hregion.memstore.mslab.enabled

4. Mixed type

Direction of optimization: you need to try various combinations over and over again, and then run the tests to get the best results.

Other factors that affect performance include:

Compression: can reduce IO pressure on the cluster

Good row key design

Handle large merges manually when the expected cluster load is minimum

Optimize RegionServer processor count

Thank you for reading! This is the end of this article on "what are the core knowledge points of HBase?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.