Explore in depth why the performance of hbase read data (scan) is lower 07/12 Update SLTechnology News&Howtos

Explore in depth why the performance of hbase read data (scan) is lower

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Thursday, 2019-3-7 delves into why hbase read data (scan) performance is low

Brief description:

Compared with the write process, HBase reading data is a more complex operation process, which is mainly based on two reasons:

One is that because the entire HBase storage engine is implemented based on the LSM-Like tree, a range query may involve multiple shards, multiple caches, or even multiple data storage files.

Second, because the update operation and delete operation in HBase are very simple, the update operation does not update the original data, but uses the timestamp attribute to implement multiple versions. The delete operation does not really delete the original data, but inserts a piece of data labeled "deleted", while the real data deletion occurs when the system executes Major_Compact asynchronously.

Obviously, this implementation routine greatly simplifies the process of data update and deletion, but for data reading, it means putting on layers of shackles, and the reading process needs to be filtered according to the version. at the same time, the data that has been marked and deleted should also be filtered.

1. Scan once, and now look for it in memory. If you find it, return it directly to the client.

2. If it is not found, scan one line by one line in the blockcache hfile memcache and return it to the client after scanning to 100 lines

3. The client caches the 100 rows of data in memory and returns one to the upper-level business.

Here, 100 rows of data are called each time. After the client gets it, it scans another 100 rows of data until all the data is obtained.

The upper layer business continues to obtain scan data one by one. In fact, when the amount of data is large, the HBase client will continue to send next requests to the HBase server. Some friends may ask why scan needs to design a pattern for multiple next requests. Personally, I think this is based on a number of considerations:

1. HBase itself stores a large amount of data, so the next scan request in many scenarios will have a large amount of data. If the data set size of each request is not limited, it is likely to cause the system bandwidth to be tight and cause the instability of the whole cluster.

2. If you do not limit the size of the dataset for each request, in many cases, the client cache OOM may be dropped.

3. If there is no limit on the size of the dataset per request, it is likely that it will take a lot of time for the server to scan a large amount of data, and the connection between the client and the server will timeout.

The batch operation of get is to install the target region for grouping, and different grouped get requests perform reads concurrently. However, scan is not implemented in this way.

That is, scan is not a parallel operation.

So from the client point of view, the entire scanning time = client processing time + server scanning time, can this be optimized?

Summary:

According to the above analysis, the efficiency of scan API depends largely on the amount of data scanned. It is generally recommended that scan that scans a small amount of data in OLTP business can use scan API, and scan API can be used to scan a large amount of data. Sometimes, the scanning performance can not be effectively guaranteed.

This leads to the question: why is the scan performance of HBase so low as column storage? isn't column storage more conducive to scan operations? Parquet format is also a determinant, but its scan is so excellent, isn't the difference in performance due to the way the data is organized? Kudu also uses a LSM-like data structure, but it can reach the scanning speed of parquet (kudu is pure column). A column of kudu will also form a lot of files, but it does not seem to affect its performance.

Summary:

HBase is not exactly column storage, exactly column family storage, HBase can define a column family, there can be all columns under the column family, and the data of these columns exist together. And usually we recommend that the number of column families is not more than 2, so that there must be many columns under each column family. So HBase is not column storage, more like row storage. / / subtle explanation, personal opinion HBase scanning is essentially a random read, and cannot be scanned sequentially like HDFS (Parquet). Just imagine, 1000W data a piece of get out, the performance will not be very good. The question is, why doesn't HBase support sequential scanning? This is because HBase supports update operations and the concept of multiple versions, which is important. It can be said that if update operations and multiple versions are supported, the scanning performance will not be very good. The principle is like this, we know that HBase is a LSM-like data structure, after the data is written to memory, memory to a certain extent will form a file, so a column family of HBase will have many files. Because of updates and multiple versions, one data may exist in multiple files, so a file search is needed to locate the specific data.

Therefore, the HBase architecture itself is not suitable for large-scale scan. Large-scale scan suggests using Parquet. You can export HBase to Parquet on a regular basis to scan.

Ask again:

Personally, I don't know much about Kudu, but there is a great god Kudu next to it. My understanding is as follows:

The performance of kudu does not reach the scanning speed of parquet. It can be said that between HBase and HDFS (Parquet), kudu has better scanning performance than HBase, because kudu is pure memory, and scanning does not cause skip reads, while HBase may jump seek, which is the essential difference. However, kudu scanning performance is not as good as Parquet, because kudu is a LSM structure, it will scan multiple files sequentially at the same time, and compare the key value. On the other hand, Parquet only needs to scan the data in a block block sequentially. This is the difference between the two.

So hbase compared to parquet, these two aspects are the disadvantages of scan.

Reference link:

HBase principle-data Reading process parsing http://hbasefly.com/2016/12/21/hbase-getorscan/

HBase Best practices-Scan usage Magazine http://hbasefly.com/2017/10/29/hbase-scan-3/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.