Why is writing faster than reading in Hbase 07/03 Update SLTechnology News&Howtos

Why is writing faster than reading in Hbase

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "why writing is faster than reading in Hbase". In daily operation, I believe that many people have doubts about why writing is faster than reading in Hbase. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about why writing is faster than reading in Hbase! Next, please follow the editor to study!

First of all, it needs to be clear that Hbase writes faster than reads, the root cause of the LSM storage engine

Analysis from the point of view of storage engine

The underlying storage engine of Hbase is LSM-Tree (Log-Structured Merge-Tree).

The core idea of LSM is to give up partial reading ability in exchange for maximizing writing ability. LSM Tree, this concept is the meaning of structured merge tree, its core idea is actually very simple, that is, it assumes that the memory is large enough, so that you do not need to write the data to disk every time there is a data update. Instead, you can first reside the latest data in memory until the last amount is accumulated. Then use merge sort to append the in-memory data merge to the end of the disk queue (because all the trees to be sorted are ordered and can be quickly merged together by merge sort).

The design idea of the LSM tree is very simple: keep the modification increment of the data in memory, and write these modification operations in batches to disk after reaching the specified size limit, but it is a little troublesome to read, which requires merging the historical data in the disk and the most recent modification operation in memory, so the write performance is greatly improved, and you may need to see whether it hits memory first when reading, otherwise you need to access more disk files. At the extreme, the write performance of HBase based on LSM tree is one order of magnitude higher than that of MySQL, and the read performance is one order of magnitude lower.

The LSM tree principle divides a big tree into N small trees, which are first written into memory. As the small trees get bigger and bigger, the small trees in memory will flush to the disk, and the trees in the disk can regularly do merge operations and merge into a big tree to optimize read performance.

Reference blog: performance testing

Why is Hbase reading fast?

The main reason why HBase can provide real-time computing service is determined by its architecture and underlying data structure, that is, it is determined by LSM-Tree (Log-Structured Merge-Tree) + HTable (region partition) + Cache-- the client can directly locate the HRegion server server where the data is to be checked, and then directly look for the data to match on a region of the server, and the data is partially cached by cache.

As mentioned earlier, HBase will save the data in memory, the data in memory is ordered, if the memory space is full, it will be written to HFile, and the content saved in HFile is also in order. When the data is written to HFile, the data in memory is discarded.

HFile files are optimized for sequential disk reads and are stored on a page-by-page basis. The following figure shows the process of storing multiple blocks in memory and merging them to disk. Merging writes results in a new block, and eventually multiple blocks are merged into larger blocks.

Hbase read fast

Many small files will be produced after many times of brushing, and the background thread will merge the small files to form large files, so that the disk search will be limited to a few data storage files. HBase writes fast because it doesn't actually write to the file immediately, but writes to memory first, and then brushes it asynchronously into HFile. So from the client's point of view, the write speed is very fast. In addition, random writes are converted into sequential writes when writing, and the data writing speed is also very stable.

The read speed is fast because it uses a LSM tree structure instead of a B or B + tree. The sequential reading speed of the disk is very fast, but the speed of finding tracks is much slower. The storage structure of HBase causes it to require disk seek time within a predictable range, and reading any number of records contiguous with the rowkey to be queried does not incur additional seek overhead. For example, if there are 5 storage files, a maximum of 5 disk searches are required. On the other hand, relational databases cannot determine the number of disk seek even if they have indexes. Moreover, HBase reads will first be looked up in the cache (BlockCache). It uses LRU (the least recently used algorithm). If it is not found in the cache, it will look from the MemStore in memory, and the contents of HFile will be loaded only if neither of these places can be found. As mentioned above, the speed of reading HFile is also very fast, because it saves the cost of seeking.

For example:

A: if you query quickly (read data from disk), hbase is queried according to rowkey. As long as you can quickly locate rowkey, you can achieve fast query, mainly due to the following factors:

1. Hbase can be divided into multiple region, which you can simply understand as multiple partitions of a relational database.

2. The keys are arranged in order

3. Stored by column

First of all, we can quickly find the region (partition) where the row is located, assuming that the table has 1 billion records and occupies space 1TB, which is divided into five hundred region and one region occupies two G. If you read 2G records at most, you can find the corresponding records

Secondly, it is stored by column, which is actually a column family. Suppose it is divided into three column families, each column family is 666m. If the thing to be queried is on one column family, one column family contains one or more HStoreFile, suppose a HStoreFile is 128m, and the column family contains five HStoreFile on disk. The rest is in memory.

Again, it's sorted, and the record you want may be at the front or at the end, assuming that in the middle, we only need to traverse 2.5 HStoreFile for a total of 300m.

Finally, each HStoreFile (encapsulation of HFile) is stored as a key-value pair (key-value), as long as it traverses the location of the key in each data block and determines that it meets the conditions. Generally speaking, key is of limited length. Assuming that it is 1:19 with value (ignoring other blocks on HFile), it takes only 15m to obtain the corresponding record. According to the disk access 100M/S, it takes only 0.15s. With the addition of block caching mechanism (LRU principle), higher efficiency can be achieved.

B: real-time query

Real-time query can be thought of as a query from memory, and the general response time is less than 1 second. The mechanism of HBase is that the data is first written to memory, and then written to disk when the amount of data reaches a certain amount (such as 128m). In memory, there is no data update or merge operation, only data is added, which makes the user's write operation return immediately as soon as it enters the memory, which ensures the high performance of HBase I write O.

Real-time query, that is, response according to the data of the current time, it can be considered that the data is always in memory, ensuring the real-time response of the data.

At this point, the study of "why writing is faster than reading in Hbase" is over. I hope to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.