How to analyze the principle of HBase cold and hot separation technology 07/11 Update SLTechnology News&Howtos

How to analyze the principle of HBase cold and hot separation technology

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about how to analyze the principle of HBase hot and cold separation technology. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Preface

HBase is a popular distributed database for massive data storage. Often massive data storage will involve a cost problem, how to reduce the cost. A common solution is to manage data through hot and cold separation. Cold data can use a higher compression ratio algorithm (ZSTD), a lower copy number algorithm (Erasure Coding), and cheaper storage devices (HDD, highly intensive storage models).

Common solutions for HBase cold and hot separation 1. Active and standby cluster

The standby (cold) cluster uses cheaper hardware, and the main cluster sets TTL, so that when the data heat fades, cold data is naturally only available in the cold cluster.

Cdn.com/b0b393c4b9d03d513e943b490444ed8d40ca9937.png ">

Advantages: the solution is simple and can be done in ready-made kernel versions.

Disadvantages: high maintenance overhead and waste of cold cluster CPU

This is basically the only solution for version 1.x of HBase without kernel changes.

2.HDFS Archival Storage + HBase CF-level Storage Policy

A version later than 2.x is required to use. Combine the hierarchical storage capacity of HDFS + specify the data storage strategy at the Table level to realize the cold and hot separation of data from different tables in the same cluster.

Advantages: hot and cold separation of the same cluster, less maintenance overhead, and more flexible strategy for configuring different business tables

Disadvantages: disk allocation is a big problem, different business cold and hot ratio is not the same, more difficult to integrate together, once the business changes, the cluster hardware configuration can not be changed.

Cloud HBase hot and cold separation solution

Neither of the above two plans is the best one for Yunshang. The first set of plan is not to say, customers engage in two clusters, for customers with a small amount of data, there is no way to reduce costs at all. The second set of solutions, tens of millions of customers on the cloud, a variety of businesses, disk configuration is difficult to customize to the right state.

If you want to do cloud native on the cloud, you must meet the extreme elastic expansion in the same cluster in order to achieve production in a real sense. Low-cost, flexible storage on the cloud, only OSS. So it's natural to think of the following architecture:

The most direct idea to implement such an architecture is to directly change the HBase kernel: 1) add cold table data tags 2) increase the IO path to write OSS according to the tags.

The drawback of this is very obvious. It is very difficult for your external system (such as backup recovery, data import and export) to be compatible with these changes. They need to be aware of which cold files have to be read in which OSS location and which hot files have to be read on the HDFS deployed on the cloud disk. These are essentially repetitive tasks, so a layer must be abstracted from an architectural design point of view. This layer can read and write HDFS files, read and write OSS files, and sense hot and cold files. This layer, which is my final design of ApsaraDB FileSystem, implements Hadoop FileSystem API. For HBase, backup recovery, data import and export and other systems can obtain the function of hot and cold separation as long as they replace the original implementation of FileSystem.

The following will elaborate on the details and difficulties of this set of FileSystem design.

The core difficulty of ApsaraDB FileSystem design A1.OSS is not a file system

OSS is not really a file system, it's just a two-level mapping bucket/object, so it's object storage. You see a file like this on OSS:

/ root/user/gzh/file . You would think that there are 3-tier directories + 1 files. There is actually only one object whose key contains the / character.

One problem with this is that if you want to simulate a file system on it, you must first be able to create a directory. It's natural to think of special objects that end with / representing directory objects. This is what the open source OssFileSystem of the Hadoop community does. With this method, you can determine whether a directory exists and whether a file can be created, otherwise a file will be created out of thin air, and the file does not have a parent directory.

Of course, there will still be a problem with you. In addition to the high overhead (creating a deep directory for multiple HTTP requests for OSS), the most serious problem is correctness. Imagine the following scenario:

Change the directory / root/user/source rename to / root/user/target. In addition to this directory, the subdirectories under it and the subfiles in the subdirectories will change accordingly. Something like this: / root/user/source/file = > / root/user/target/file. It's easy to understand that a file system is a tree. When you rename a directory, you actually move one subtree to another node. This implementation in NameNode is also very simple, just change the tree structure.

But if you want to do rename on OSS, you have to iterate through / root/user/source to rename all the directory objects and file objects under it. Because you can't get there in one step by moving a subtree. The problem here is, suppose you are in the middle of recursive traversing and hang up. Then it is possible that half of the directories or files have arrived at the target location and half have not passed. So rename this operation is not atomic, originally you either rename success, the contents of the entire directory to a new place, or did not succeed in the same place. Therefore, there will be problems with correctness, and it is risky for HBase to rely on rename operations to move temporary data directories to official directories to do data commit.

2.OSS rename is actually a copy of data.

We mentioned rename earlier, which is supposed to be a lightweight data structure modification operation in a normal file system. But OSS does not have the operation rename. In fact, rename has to be done through both CopyObject and DeleteObject operations. First, copy becomes the target name, and then delete drops the original Object. There are two obvious problems here. One is that deep copy of copy costs a lot of money, which directly affects the performance of HBase. The other is that rename is split into two operations. These two operations cannot be in one thing, that is to say, there may be situations where copy succeeds without delete dropping. At this time, you need to roll back, and you need to delete the objects that come out of copy, but delete may still fail. Therefore, it is difficult to guarantee the correctness of the rename operation itself.

Solve the core difficulty A

To solve the above two problems, you need to do your own metadata management, that is, it is equivalent to maintaining a file system tree, and only data files are placed on the OSS. And because we still have HDFS stored in our environment (to release hot data), we directly reuse NameNode code and let NodeNode help manage metadata. So the overall architecture comes out:

ApsaraDB FileSystem (hereinafter referred to as ADB FS) divides cloud storage into main storage (PrimaryStorageFileSystem) and cold storage (ColdStorageFileSystem). The ApsaraDistributedFileSystem class (hereinafter referred to as ADFS) is responsible for managing these two types of storage file systems, and ADFS is responsible for sensing hot and cold files.

ApsaraDistributedFileSystem: main entry, responsible for managing cold storage and main storage, where data should be read and written (ADFS).

Main memory: the default implementation of PrimaryStorageFileSystem is DistributedFileSystem (HDFS)

Cold storage: the default implementation of ColdStorageFileSystem is HBaseOssFileSystem (HOFS). The Hadoop API file system based on OSS can be used alone to simulate directory objects, or can only be used as cold storage to read and write data. Compared with the community version, it is targeted and optimized, which will be discussed later.

Specifically, how NameNode helps manage metadata on cold storage is simple. ADFS creates an index file with the same name on main memory, and the content of the file is that the index points to the corresponding file in cold storage. The actual data is in cold storage, so it doesn't matter whether the files in cold storage have directory structure or not, only first-level files are fine. If we look at the next rename operation, we will see that the same is true for the scenario of the rename directory. Just rename it in NameNode. For hot files, it is equivalent to all proxy HDFS operations, while cold files create index files on HDFS, then write data files to OSS, and then associate them.

Core difficulty B

When the metadata management solution is introduced, we will encounter a new problem: the consistency of index files and data files in cold storage.

We may encounter the following scenarios:

Main memory index file exists, cold storage data file does not exist

Cold storage data file exists, main memory index file is not saved

The information of the main memory index file is incomplete and cannot locate the cold storage data file.

First exclude the BUG or artificial deletion of data files, the appeal of the three cases will be due to the program crash. That is to say, if we want to put the method, generate the index file, write and generate the cold data file, associate these three operations in one thing. Only in this way can it be atomic and ensure that either the cold file is created successfully, then the index information is complete and points to an existing data file. Either you fail to create a cold file (including the halfway program crash), and you will never see this cold file.

Solve the core difficulty B

The core idea is to use the rename operation of main memory, because the rename of main memory is atomic. We first produce the index file in the temporary directory of main memory, where the contents of the index file already point to a path in cold storage (but the data file for this path hasn't actually started to be written yet). After the cold storage completes the write and the correct close, then we already have a complete and correct index file & the data file. Then change the index file to the target path that the user actually needs to write to through rename.

If the process crash, either the index file has been rename successfully, or the index file is still in the temporary directory. In the temporary directory we think that the write is not completed and is a failure. Then we can clean up the files in the temporary directory N days ago regularly by cleaning up the thread. So once the rename is successful, the index file on the target path must be complete and will point to a written data file.

Why do we need to write the path information in the index file first? Because if we write the data file first and crash in the process, then we have no index information to point to the data file, thus causing problems like "memory leaks".

Hot and cold document marking

For the main memory, we need to realize the function of marking the file hot and cold, and determine what kind of data read and write stream to open through the mark. This NameNode can be achieved by setting StoragePolicy for the file. This process is very simple, without going into detail, HBaseOssFileSystem is written into the optimization design below.

HBaseOssFileSystem write optimization

Before we talk about HOFS writing design, we need to understand the Hadoop community version of OssFileSystem design (which is also a version that can be used directly by community users).

Community version writes design Write-> OutputStream-> disk buffer (128m)-> FileInputStream-> OSS

This process is to write to the disk first, and after the disk is full of 128m, the 128m block is packaged as FileInputStream and then submitted to OSS. This design mainly takes into account the cost of OSS requests. OSS is charged for each request, but private network traffic is not billed. If you 1KB once, the cost is very high, so you have to write it in large blocks. And OSS large file write, the design allows you to submit up to 10000 block (called MultipartUpload in OSS), if the block is too small, then the maximum file size you can support is also limited.

So to accumulate a big buffer, another factor is that Hadoop FS API provides OutputStream to keep you write. OSS provides InputStream, which allows you to provide what you want to write, which is constantly read by itself. So you have to convert it through a buffer.

There will be a big problem here, which is slow performance. Write to the disk, and then read the disk, more than two rounds will be relatively slow, although there is PageCache, the reading process does not necessarily have IO. Then you must think that it would be nice to use memory as buffer. The problem of memory when buffer is mentioned earlier, there is a cost, so the buffer can not be too small. So it's impossible for you to open 128 megabytes of memory per file. What's more, when you submit it to OSS, you have to ensure that you can continue to write new data, and you have to have 2 128m memory scrolls, which is almost unacceptable.

HBaseOssFileSystem write design

We have to solve both the cost problem and the performance problem, while ensuring that the cost is low and seems impossible, so how do we do it?

What you want to take advantage of here is this InputStream,OSS that lets you provide InputStream and read what you want to write from it. So we can design a streaming write, and when I pass this InputStream to OSS, there doesn't have to be data in the stream. At this point, OSS calls read to read data and block it on the read call. When the user actually writes the data, there will be data in the InputStream, and then the OSS can read the data smoothly. When OSS reads more than 128m of data, InputStream will automatically truncate and return EOF, so that OSS will think that the stream is over, so that the data is submitted.

So in essence, we just need to develop such a special InputStream. The user writes data to the OutputStream provided by the Hadoop API, and each time the data is filled with a page (2m), it is sent to the InputStream to make it readable. OuputStream is equivalent to producer and InputStream is equivalent to consumer. The memory overhead here will be very low, because when the producer speed is similar to the consumer speed, it will only cost 2 page. Finally, the whole implementation is encapsulated into an OSSOutputStream class. When the user wants to write a cold file, it actually provides OSSOutputStream, which includes the control process of this special InputStream.

Of course, in the actual production, we will control the page and set up a maximum of 4 page per file. And the four page are recycled to reduce the impact on GC.

Performance comparison 1: community version vs Cloud HBase version

Because you don't have to write to disk, the write throughput can be much higher than that of the community. The following figure shows the test results on HBase1.0. In some scenarios with large KV and higher write pressure, the actual measurement can be close to twice as much. This comparison avoids the deep copy problem of ADFS by replacing the implementation of rename cold storage (using both the community version and the cloud HBase version). If you directly use the community version instead of ADFS, the performance will be several times worse.

Performance comparison 2: heat meter vs cold meter

The heat meter data is in the cloud disk and the cold meter data is in OSS.

Thanks to the above optimization, coupled with the fact that the cold table WAL also holds HDFS, and the OSS is a large cluster compared to the HBase (with a high throughput limit), the cold table HDFS only needs to resist WAL write pressure. So the throughput of the cold meter is slightly higher than that of the heat meter.

In any case, the writing performance of cold meters is as good as that of hot meters, which is pretty good. Basically will not affect the user filling data, otherwise the use of cold storage, huff and puff a lot, which means more machines, then this function is meaningless.

The above is how to analyze the principle of HBase cold and hot separation technology. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.