The Reading and Writing process and Optimization method of HBase 07/03 Update SLTechnology News&Howtos

The Reading and Writing process and Optimization method of HBase

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "the reading and writing process and optimization method of HBase". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "the reading and writing process and optimization method of HBase".

The read and write process of HBase-depends on the four major components of HBase: client, Zookeeper, HMaster and HRegionServer.

The reading and writing of HBase is initiated by the client. The first is the process of reading: according to the table name and row key provided by the user, the client goes to the cache in the client to query, and if no query is found, it goes to Zookeeper for query. Zookeeper is used in HBase to store the address of the ROOT table. There are two important tables in HBase, namely, the ROOT table and the META table. The ROOT table records the region information of the META table, while the META table records the region information of the user table. To put it simply, the row key of the META table consists of the table name of the table to which the Region belongs, and the start row key and timestamp of the region in the table. The column family info defines three columns, namely, the regionInfo stores the start and end row keys in the region, the server column stores the server address where the region is located, and the serverstartcode stores the state of the regionServer.

Because the META table is also an ordinary HBASE table, when there is more and more data in the META table, it will be split into multiple meta region, and each meta region will be managed by a different regionServer. Therefore, you need to have a table to store meta region information, this table is the ROOT table, and the ROOT table stores only one region information, that is, meta region. According to this process, theoretically, a table is also needed to store information about the ROOT table, but this will result in endless tables storing similar information. In this case, the developers of HBase believe that the amount of data in the ROOT table will not be very large, so there will be no data splitting, so there is no need for other tables to store region information for the ROOT table.

The client obtains the address of the ROOT table through Zookeeper, connects to the RegionServer where the ROOT table is located through RPC, queries the META table according to the ROOT table, and then forms a XXXXXXX () according to the table name and row key provided by the user, and then uses this row key to query the META table. After getting the values of the regionInfo and server columns in the info column family, the client establishes a connection with RegionServer according to the value of the server column, and submits the data of the regionInfo column to RegionServer.

After receiving the query request from the client, RegionServer first creates a RegionScanner object, navigates to HRegion through this object, then creates StoreScanner by HRegin object, and creates a MemStoreScanner object by navigating to HStore,HStore object through StoreScanner. This object is responsible for going to MemStore to query whether there is any related data. If there is no such object, multiple StoreFileScanner objects are created. Otherwise, each object is queried in a different HFile. If it is found, it will be returned. If not, null will be returned.

HBase writing process: when the client performs put operation, the data will be automatically saved to the HRegion. In HRegionServer, after finding the corresponding HRegion to be written, the data will be written to HLog and also written to HStore's MemStore memory, and the data will be sorted by row keys in memory. When the data in memory reaches a certain threshold, the flush operation will be triggered. The main purpose of Flush operation is to write the data in MemStore memory to StoreFile. When the number of StoreFile in HDFS reaches a certain threshold, compact (merge) operation will be triggered to merge all StoreFile in HDFS into a new SotreFile. When merging, it will sort according to row keys, and will carry out version merging and data deletion. When the StoreFile passes the continuous merge operation, the StoreFile file will become larger and larger. When the StoreFile reaches a certain threshold, the Split operation will be triggered. At the same time, the current region will be split into two new region, the original region will be offline, and the new two region will be assigned to the corresponding HRegionServer by the HMaster, so that the pressure of the original Region can be diverted to the two Region. In fact, HBase is just adding data. Both update and delete operations are done in the compact phase, so the sign of client write success is that there is data in both HLog and MemStore.

Write HLog first, but it's OK to display MemSotre, because MemStore's MVCC (multi-version concurrency control) does not scroll forward, and these changes cannot be seen by Scan before updating MVCC, so before writing to HLog, even if MemStore has data, the client cannot query it.

Optimization of HBASE

1. To optimize the length of row keys, column families, and column names, HBase introduces block because block contains a lot of key/value, each key contains rowkey, column family, column, timestamp, reducing the length of rowkey, column family, column can reduce the number of block, otherwise it will increase region server, region, index, memory, query scope.

two。 When processing data in batches, the client will save the data to the client's cache first. HBASE enables implicit flushing by default. When implicit flashing is disabled, put's data will also be saved to the client's cache and will not be saved to HRegion until the flush command is called.

3. Query optimization:

* set scan cache.

* specify columns when querying.

* shut down immediately after using ResultScanner.

* try to use filters or coprocessors when querying to reduce the amount of data.

* cache the data that are queried frequently. For example, caching to redis.

* use HtableTool query.

* use bulk read Htable.get (List).

4. Write optimization:

* disable WAL log. If WAL log is enabled, you can modify the time when the log is written to hdfs. (there is a risk of data loss optimization)

* set AutoFlush to false. (there is a risk of data loss optimization)

* Region pre-partitioning is available in two ways, the RegionSplitter that comes with hbase and the self-implementation. Generally, hbase comes with hbase.

* write via HtableTool.

* use bulk write Htable.put (List).

5. Configuration optimization:

* set the number of processing threads for RegionServer, but you need to test it first.

* adjust the memory size of BlockChche or MemStore, increase the BlockChche if you read too much, and increase the MemStore if you write too much. But the sum of the two cannot be more than

80% of the total memory size of an RegionServer. (flush of StoreFile)

* adjust the limit on the number of StoreFile merges. If the number of merges is too small, the frequent number of merges will seriously affect the performance, and the query will be slow to this point. (compart of StoreFile)

* set the size of a single StoreFile and adjust the split performance. (spill of StoreFile)

6. Row key design: maximum length 64KB

* because the rowkey of hbase is sorted from smallest to largest in byte order, it is necessary to keep rowkey loose, avoid monotonous increment and prevent region hotspots.

* because hbase indexes only rowkey, it is necessary to ensure the uniqueness of row keys.

* because row keys are immutable, it is necessary to meet business requirements at the beginning of the design.

* because rowkey is redundant storage, the row key length should be as short as possible as long as the above requirements are met.

7. Column family design:

* because the underlying hbase is stored in column families, and the data in a column family is stored in a disk file, columns with the same properties should be placed in a column family.

* try not to have too many column families in a table and keep it to 1-2 as many as possible.

* if updates are frequent and only need to get the latest data, the time version of the cell is 1, and the default is 3 (VERSION)

* you can set the life cycle of a cell (TTL)

* read more randomly and turn on the Bloom filter.

CAP principle

Consistency strong consistency, availability availability and partition tolerance partition fault tolerance can not exist in a large-scale distributed service system at the same time.

C means that after the update operation is successful and the return client is completed, all distributed nodes have the same data at the same time.

A means that both read and write operations can be successful.

P means that the downtime of the node does not affect the operation of the service.

HBASE only supports CP

At this point, I believe you have a deeper understanding of the "HBase reading and writing process and optimization methods". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.