Memory Management from HBase offheap to Netty 07/09 Update SLTechnology News&Howtos

Memory Management from HBase offheap to Netty

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article introduces Netty's memory management and performance.

HBase Offheap Status

As a popular distributed NoSQL database, HBase is widely used by various companies. There are many business scenarios, such as information flow and advertising business, with very high throughput and latency requirements for access. In order to avoid the performance impact caused by Java GC as much as possible, HBase 2.0 has offheaped the two core paths of reading and writing, that is, the application of objects is directly applied to JVM offheap, and the memory allocated by offheap will not be released by JVM GC, and users need to explicitly release it themselves. On the write path, the request packets sent by the client will be allocated to the offheap memory area until the data is successfully written to the WAL log and Memstore. The ConcurrentSkipListSet maintaining Memstore does not directly store Cell data, but stores references to Cell. The actual memory data is encoded in multiple chunks of MSLAB, which is more convenient to manage offheap memory. Similarly, on the read path, try to read BucketCache first. When Cache misses, read the corresponding Block from HFile. The BucketCache that occupies the most memory is placed on the offheap. After obtaining the Block, encode it into Cell and send it to the user. The whole process basically does not involve the application of objects in the heap.

However, in the recent performance test results of Xiaomi, it is found that the 100% Get scene is still seriously affected by Young GC. In the two pictures posted by HBASE-21879, it can be clearly observed that the p999 delay of Get operation is basically the same as that of G1 Young GC, which is about 100 ms. Logically, after HBASE-11425, all memory allocations should be in the heap, and there should be almost no memory requests in the heap. However, after carefully combing the code, it is found that the process of reading blocks from HFile is still copied to the heap first, and the DataBlock in the heap is not released until BucketCache's WriterThread asynchronously refreshes the Block to Offheap. However, in the disk type pressure test, due to the large amount of data, the Cache hit rate is not high (~70%), so there will be a large number of Block read disk IO, so a large number of young generation objects will be generated in the Heap, which will eventually lead to an increase in GC pressure in the Young area.

The direct way to eliminate Young GC is to read DataBlock directly from HFile to Offheap. Before leaving this pit, the main reason is that HDFS does not support the Pread interface of ByteBuffer. Of course, HDFS-3246 is opened later to follow up this matter. However, a problem found later is that DataBlock read out from Rpc path is actually placed in a temporary Map called RamCache after entering BucketCache, and once Block enters this Map, it can be hit by other RPC, so the DataBlock read out before cannot be released directly after RPC exits. It must be considered whether RamCache is also released. Thus, a mechanism is needed to track whether a block of memory is no longer referenced by all RPC paths and RamCache at the same time, and memory can only be freed if neither is referenced. Naturally, I thought of using the reference Count mechanism to track ByteBuffer, and later found that Netty has implemented this thing more completely, so I looked at Netty's memory management mechanism.

Netty Memory Management Overview

Netty, as a high-performance infrastructure, does a lot of offheaping to keep GC's impact on performance to a minimum. The offheap memory is applied for and released by the programmer himself, forgetting to release or releasing in advance will cause memory leakage problems, so a good memory manager is very important. First of all, what kind of memory allocator counts as a "good" memory allocator:

High concurrency and thread safe. Generally, a process shares a global memory allocator, so it is necessary to ensure that multi-threaded concurrent requests for release are efficient and problem-free.

Efficient request and release of memory, this goes without saying.

It is convenient to track the life cycle of allocated memory and locate memory leakage problems.

Efficient memory utilization. Some memory allocators allocate to a point where they can't allocate a bit more memory, even though they still have a lot of free memory fragments. Therefore, we need to achieve higher memory utilization through more refined management.

Try to ensure continuity of storage of the same object in physical memory. For example, allocators are currently unable to allocate a complete continuous 70MB memory, some allocators may splice out a 70MB memory through multiple memory fragments, but in fact, appropriate algorithm design can ensure higher continuity, thus achieving higher memory access efficiency.

To optimize the overhead of competing for memory from multiple threads, Netty's PooledByteBufAllocator initializes a memory pool per processor by default, and multiple threads select a particular memory pool via Hash. This allows each processor to use essentially its own pool of memory, even in the case of concurrent processing by multiple processors, thereby alleviating the synchronization latency overhead caused by contention.

Netty's memory management design is more elaborate. First, the memory is divided into 16MB chunks, each Chunk is composed of 2048 8 8KB pages. It needs to be mentioned here that for each memory request, binary alignment is required. For example, if you need to apply for 150B of memory, the actual memory to be applied is actually 256B, and a Page can only be occupied once before it enters the Cache (later will be mentioned Cache). That is to say, after a Page has applied for 256B of memory, the subsequent request will not be applied in this Page, but to find other completely free Pages. Some people may ask, is this not memory utilization is too low? Because an 8KB Page is allocated 256B, it is allocated again. In fact, it is not, because after entering the Cache later, 31 ByteBuffer of 256B can still be allocated.

Multiple Chunks can also form a ChunkList, which is divided into different levels of ChunkList according to the proportion of Chunk memory usage (Chunk memory usage/16MB * 100%). For example, in the figure below, ChunkList is divided into 6 different levels according to different memory usage ratios, where the chunks in q050 are occupied in the interval [50,100). As memory allocation continues, a Chunk in q050 may occupy a proportion equal to 100, and the Chunk is moved to q075, the ChunkList. Because memory has been applied for and released, the above Chunk may be released due to some objects, resulting in memory occupancy ratio less than 75, it will be put back to q050 ChunkList; of course, it is also possible that after a certain allocation, the memory occupancy ratio reaches 100 again, it will be moved to q100. One advantage of this design is that it allows requests to fall on the more idle chunks as much as possible, thus improving the efficiency of memory allocation.

Still taking the above example, some object A requested 150B memory, and after binary alignment, it actually requested 256B memory. After object A is released, the corresponding requested Pages are also released. Netty will put these Pages into the corresponding Cache in order to improve the efficiency of memory use. The Pages requested by object A are divided according to 256B, so they directly enter a buffer pool called TinySubPagesCaches as shown in the above figure. This buffer pool is actually composed of multiple queues. Each queue represents a different size of Page division, such as queue-> 32B, indicating that in this queue, the cached Pages are divided according to 32B. Once there is a 32B application request, directly go to this queue to find the pages that are not full. Here, we can find that the same Page in the queue can be applied for multiple times, but the memory size they apply for is the same, which does not exist before the low memory occupancy problem, but the occupancy rate will be relatively high.

Of course, Cache is divided into three different types of Cache according to the internal partition of Page (called elemSizeOfPage, that is, a Page will be divided into 8KB/elemSizeOfPage equal size blocks). For requests smaller than 512 MB, TinySubPagesCaches will be attempted; for requests smaller than 8KB, SmallSubPagesDirectCaches will be attempted; and for requests smaller than 16MB, NormalDirectCaches will be attempted. If there is no available memory in the corresponding Cache, go directly to the following 6 ChunkLists to find a Chunk application, of course, these chunks may be full, then you can only apply directly to Offheap for a Chunk to meet the needs.

cache coherence of Chunk internal allocation

The above basically clarifies the principle of memory application on Chunk. Overall, Netty's memory allocation is still very fine. From the algorithm point of view, both the application/release efficiency and memory utilization are relatively guaranteed. Here is a brief explanation of how memory is allocated inside Chunk.

One question is: if you want to request 32KB of memory in a Chunk, how should the Chunk allocate pages more efficiently and at the same time the user's memory access efficiency is higher?

A simple idea is to divide a 16MB chunk into 2048 8 8KB pages, and then use a queue to maintain these pages. If a Page is requested by the user, it is dequeued; if the Page is released by the user, it is requeued. The complexity of this algorithm is O(1). But the problem is that a 32KB object will be scattered among four discontinuous pages, and the user's memory access efficiency will be affected.

Netty's intra-chunk allocation algorithm takes into account both application/release efficiency and user memory access efficiency. One way to improve the efficiency of user memory access is to make it fall on a contiguous piece of physical memory, no matter how much memory the user requests. This feature is called Cache coherence.

Take a look at Netty's algorithm design:

First of all, the 16MB chunk is divided into 2048 8 8KB pages, which can form a complete binary tree (similar to heap data structure), which can be maintained with an int[] map. For example, map[1] represents root, map[2] represents the left child of root, map[3] represents the right child of root, and so on, map[2048] is the first leaf node, map[2049] is the second leaf node…, map[4095] is the last leaf node. These 2048 leaf nodes correspond to 2048 pages in turn.

The characteristic of this tree is that all pages of any subtree are contiguous in physical memory. Therefore, applying for 32KB of physical memory continuity can be transformed into finding a subtree with exactly 4 free Pages, which solves the problem of user memory access efficiency and ensures Cache Coherence characteristics.

But how to solve the problem of efficiency of distribution and release?

The idea is not particularly difficult, but Netty is not so easy to understand after using various binary optimizations. So, I drew a picture. The essence is that each node id of a complete binary tree maintains a map[id] value, which represents the depth of the root node corresponding to the first completely free subtree on the subtree rooted with id, according to the hierarchical traversal. For example, in step.3, id=2, the first completely empty subtree encountered by hierarchical traversal is the subtree with id=5 as root, and its depth is 2, so map[2]=2.

After understanding the concept of map[id], it is actually not so difficult to understand the picture. The figure shows a 64KB chunk (composed of 8 pages corresponding to the 8 leaf nodes at the bottom of the tree), which is allocated 8KB, 32KB, and 16KB maintenance processes in turn. It can be found that the complexity of the operation is log(N), whether it is requesting memory or releasing memory, where N represents the number of nodes. In Netty, N=2048, so the complexity of requesting and releasing memory can be considered constant.

Through the above algorithm, Netty ensures both efficient allocation/application of multiple Pages within Chunk and efficient access to user memory.

Reference counting and memory leak checking

As mentioned above, ByteBuf of HBase also tries to track the life cycle of a piece of memory by reference counting. If it is referenced once, refCount++; if it is dereferenced, refCount--; once refCount=0, it is considered that the memory can be reclaimed into the memory pool. The idea is simple, just need to consider the next thread safety issues.

But in fact, even with reference counting, it's still easy to forget the explicit refCount--operation, and Netty provides a tracker called ResourceLeakDetector. In the Enable state, any ByteBuf that is split out enters the tracker, and when ByteBuf is recycled, it is deleted from the tracker. Once the total number of ByteBuffs in the tracker at a point in time is found to be too large, a memory leak is considered. Turning on this feature will inevitably affect performance, so this feature is not turned on in production environments, only when there is a suspected memory leak problem.

summary

Netty's memory management is actually very fine, and it has a lot of inspiration for HBase's Offheap design. There are currently at least three types of memory allocators in HBase:

Offheap memory allocator on Rpc path. The implementation is relatively simple, with a fixed length of 64KB as the unit to allocate Pages to objects, found that the Offheap pool can not be separated, then directly go to Heap application.

Memstore MSLAB memory allocator, the core idea is not much different from RPC memory allocator. It should be able to combine two into one.

BucketAllocator on BucketCache.

As far as points 1 and 2 are concerned, I don't think it's a problem to try to use Netty's PooledByteBufAllocator in the future, after all, Netty has done a lot of optimization on multicore concurrency/memory utilization and CacheCoherence. BucketCache can store both memory and SSD disks, even HDD disks. Therefore, BucketAllocator has made a higher degree of abstraction, maintaining a binary group such as (offset,len). Netty's existing interface cannot meet the requirements, so it is estimated that it can only maintain the status quo for the time being.

It can be expected that HBase 2.0 performance will definitely develop in a better direction, especially as GC will have less and less impact on P999.

- end -

References:

https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf

https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/

https://netty.io/wiki/reference-counted-objects.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.