Case Analysis of PolarDB Database performance 07/02 Update SLTechnology News&Howtos

Case Analysis of PolarDB Database performance

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "PolarDB database performance case analysis", the article explains the content is simple and clear, easy to learn and understand, now please follow the editor's ideas slowly in depth, together to study and learn "PolarDB database performance case analysis" bar!

An overview of the competition questions

The overall competition is divided into two stages: the preliminary and the semi-finals, which requires the implementation of a simplified and efficient kv storage engine.

Write and Read interfaces are required in the preliminary round.

Public abstract void write (byte [] key, byte [] value); public abstract byte [] read (byte [] key)

The semi-finals need to implement an additional Range interface on top of the preliminary questions.

Public abstract void range (byte [] lower, byte [] upper, AbstractVisitor visitor)

The program evaluation logic is divided into two stages: 1) Recover correctness evaluation: at this stage, the evaluation program will write specific data (key 8B, value 4KB) concurrently and perform kill-9 any times to simulate the unexpected exit of the process (the participating engine needs to ensure that the data persistence is not lost when the process exits unexpectedly), then reopen DB and call Read and Range interfaces to verify the correctness.

2) performance evaluation

Random writes: 64 threads write concurrently, each thread writes 1 million random data using Write (key 8B, value 4KB)

Random read: 64 threads read randomly, each thread uses Read to read 1 million random data

Sequential reading: 64 threads read concurrently, each thread uses Range sequentially (increasing order) to traverse the full amount of data twice Note: phase 2.2 will check whether all read kv checks match, if not, it will be terminated, and the evaluation will fail; phase 2.3 will not only check whether each iterative kv is matched, but also check whether it is strictly incremented in dictionary order, if not, it will be terminated, and the evaluation will fail.

Language qualification: C++ & JAVA, ranking together

Analysis of competition questions

I've already introduced some basic knowledge about file IO manipulation in the feature article, and if you haven't browsed that article, I suggest you take a look at it first: some best practices for file IO manipulation. Then return to the topic, first of all, several key words in the title to interpret.

3.1 key 8B, value 4kb

Key is a fixed 8-byte, so it can be represented by long.

Value is 4kb, which saves us a lot of work, because the integer multiple of 4kb is very disk IO-friendly.

Another benefit of value for 4kb is that when we index in memory, we can use int instead of long to record the logical offset of the data: LogicOffset = PhysicalOffset / 4096, which can halve the memory footprint of offset.

3.2 kill-9 data is not lost

First of all, the question clearly indicates that we will carry out kill-9 and verify the consistency of the data, which makes it more difficult for us to do write buffer in memory. But it doesn't require a power outage not to be lost, which indirectly illustrates one point: we can use pageCache for write caching, and in the specific code I used PageCache to act as write buffering for data and indexes (the two strategies are different). At the same time, this also limits the ability of contestants to use asynchronous closing methods such as AIO.

3.3 phased evaluation

The competition is divided into three stages: random writing, random reading and sequential reading. Open will be re-created in each stage, and random reading will not occur until half of the random write is checked, so we do not need to maintain the index in memory in the random write phase, but directly drop the disk. In both random read and sequential read phases, there is data on disk, and the index needs to be restored in open phase, which can be recovered concurrently by multi-thread.

At the same time, there are some hidden evaluation details that have not been disclosed to you, but through the test, we can know this information.

3.4 time spent emptying PageCache

Although we can use PageCache, the evaluation program uses scripts to empty the PageCache after each phase, and counts that part of the time in the final score, so some people find it strange that the time taken by the three stages is worse than the output, but those seconds are actually the time it takes to empty the PageCache.

# cleaning pagecache (page cache) sysctl-w vm.drop_caches=1# cleaning dentries (directory cache) and inodessysctl-w vm.drop_caches=2# cleaning pagecache, dentries and inodessysctl-w vm.drop_caches=3

This inspires us that we can not use PageCache without restraint, and it is precisely because of this that, to some extent, the operation of Direct IO has become the silver bullet of this competition.

3. 5 key distribution

This implicit condition is the key to this competition because it involves the architectural design of the Range part. The total key of this competition is 6400w, but their distribution is uniform. In the article "some Best practices for File IO Operation", we have already mentioned the benefits of data partitioning, which can greatly reduce the lock conflicts of sequential reading and writing. The uniform distribution of key inspires us to do hash according to the n bits of key when doing data partitioning. This ensures that the overall order between the two partitions of the key (internal disorder of the partition). In fact, I tried to divide the data into 1024 or 2048 partitions, and the results were best.

3.6caching Design of Range

The competition requires 64 threads to Range the full amount of data twice, with a time limit of 1 hour, which also inspires us that it is impossible to complete the race within 1 hour without caching the data. therefore, our architecture design should take Range as the core as far as possible, taking into account random writing and random reading. The Range part is also the easiest part to close the gap.

4 detailed explanation of the architecture

First of all, it needs to be clear that random writes mean that key writes are random, but we can convert random writes to the order of the corresponding partition files according to key hash.

/ * using high ten bit of the given key to determine which file it hits. * / public class HighTenPartitioner implements Partitionable {@ Override public int getPartition (byte [] key) {return ((key [0] & 0xff) > 6);}}

After clarifying the premise of the high-level partition, the overall structure becomes clear.

Global perspective

Zoning perspective

Memory perspective

Only ordered key [1024] [625000] arrays and offset [1024] [625000] arrays are maintained in memory.

The above two diagrams give a good interpretation of the overall architecture. Taking advantage of the uniform distribution of data, we can hash the global data into 1024 partitions and store two types of files in each partition: index files and data files. In the random write phase, the corresponding partition location of the data is obtained according to the key, and is appended to the end of the file according to the time sequence, and the global random write is converted to the local order write. Making use of the one-to-one correspondence between index and data, we do not need to drop the logical offset of data. In the recover phase, we can reverse the logical offset of value according to the order of restoring value.

In the range phase, because we partitioned according to the high 10 of key in advance, we can assume that any data in patition (N) must be larger than any data in partition (NMel 1), so we can read a whole partition into memory for consumption by 64 visit threads. This sets the overall tone: the disk reader thread is responsible for reading the disk into memory by partition, 64 visit threads are responsible for consuming memory, random access to memory in the order of key, and Visitor callback.

Random writing process

After introducing the overall structure, let's take a look at some of the detailed optimization points of each stage in stages. Some optimizations will occur in all links, without avoiding repetition. I will not repeat the same optimization point that appears for the second time, only one sentence will be mentioned.

Using pageCache to implement write buffer

Mainly look at the data disk, and then discuss the index disk. In the disk IO type competition, the first step is to measure the IOPS of the disk and how many threads can read and write how many caches can fill the IO. Under the premise of fixed 64-thread writes, 16kb IOPS 64kb can achieve the ideal IOPS, so it is only natural to think that you can allocate a write cache for each partition and assemble four value disks. But in this competition, in order to ensure that kill-9 does not lose data, it is not possible to simply allocate a ByteBuffer.allocate (4096 memory 4) in memory; instead, consider using mmap memory to map out a write buffer and put together four refresh disks, so that after kill-9, PageCache will not be lost. The measured 16kb disk setting is about 6s faster than that of 4kb.

The setting of the index file is not too controversial, because the amount of data of key is fixed 8B, so mmap can give full play to its advantage of writing small data and take advantage of pageCache. The measured mmap is about 3s faster than filechannel to write the index. It is believed that if the polardb disk is replaced with other ordinary ssd, this value will be increased.

Do not maintain memory index when writing, do not write data offset

The question is not clear at the beginning, mistakenly thinking that it will be read randomly immediately after random writing, in fact, each stage is independent, so there is no need to maintain the memory index when writing. Second, as mentioned in the previous architecture diagram, there is no need to write the associated key+offset to write the file together. The recover phase can reverse the logical offset of the data in the order in which the index is restored, because the positions of our key and data in the same partition correspond one to one.

Recovery proc

The logic of the recover phase is actually contained in the open interface of the program, and we need to restore the index from the data file to memory when the database engine starts, and there are also some detailed optimization points.

Due to the existence of 1024 partitions, we can use 64 threads (experience) to recover the index concurrently, sort the key [1024] [62500] array and offset [1024] [62500] using quick sort, and then compact to deduplicate the key. It is important to note that do not use structures to encapsulate key and offset together, which makes sorting and subsequent dichotomy very inefficient, which involves the knowledge of CPU cache lines. Readers who do not understand can refer to my previous blog: "CPU Cache and cache lines".

/ / wrongpublic class KeyOffset {long key; int offset;}

The whole recover phase takes 1 second. After communicating with cpp players, I found that the recovery process was slower than 600ms, which made me feel weird. Loading indexes and sorting should not be so slow, and the final optimization was not successful.

Random reading process

The random reading process does not have too much optimization point, and the optimization space is really limited. The realization idea is to locate the partition according to key first, then find the key/offset in the ordered key data, get the logical offset and partition number of data, and then you can happily read it randomly. There is not much optimization point in the random reading stage, but it is still 2-3s slower than the cpp players, which may be a gap that the language cannot cross.

Sequential reading process

The Range link is not only the big head of the whole competition, but also a watershed to open the gap. We have mentioned earlier that the overall idea of Range is a producer-consumer model, where n generators are responsible for reading data from disk into memory (n as variables, determine how much is appropriate through benchmark, and the final measured n is 4), 64 consumers are responsible for calling visit callback to verify data, and the visit process is the process of randomly reading memory. In the Range phase, there was about 1G of memory left, so I allocated four out-of-heap buffers, a 256m, to cache data from four partitions, and I allocated a read thread for each partition, which was responsible for caching load data for 64 consumers.

For a specific sequential read architecture, please see the following figure:

Generally speaking, four fetch threads are responsible for reading the disk, and fetch thread n is responsible for the partition of the partitionNo%4==n number, and notifies visit consumption when it is finished. There is a lot of mutually exclusive waiting logic, which is not reflected in the figure, roughly as follows:

Fetch thread 1: 4 loading disk data into the cache is concurrent

Visit group 1x64 accesses the same buffer concurrently

Visit group 1: 64 accesses the corresponding buffer of different partition in order (hit to global order)

Loading partitonN blocks visit bufferN,visit bufferN blocking loading partitionN+4 (equivalent to reusing 4 blocks of cache)

The loading of large chunks into the cache and maximum reuse is the key to the ReadSeq part. The score of two rounds of reading in sequence is about 1968 198s, which is about 4 seconds slower than C++.

The devil is in the details

Here is a watershed. After introducing the overall architecture and the detailed implementation of the four stages, here are the specific optimization points.

Implementing Direct IO with Java

Since the time of drop cache is included in the evaluation program in this competition, pageCache should be avoided as far as possible where it is not necessary, that is, pageCache should not appear at any stage other than writing an index. This may be a big obstacle for Java contestants, because Java does not provide Direct IO natively, so you need to encapsulate a set of JNA interfaces, which draws lessons from the open source framework jaydio. Thanks to the assistance of @ Di Yang, you can see the implementation details in the code at the end of the article. This can be said to have stopped a large number of Java players.

There are two details that Direct IO needs to pay attention to:

The allocated memory needs to be aligned, corresponding to the jna method: posix_memalign

The written data needs to be aligned, which is usually an integer multiple of pageSize, and pread's O_DIRECT is actually used.

Direct memory is better than in-heap memory

This is mentioned in "some Best practices for File IO Operations". The two benefits of out-of-heap memory are that it reduces a copy of memory and is gc-friendly. In the implementation of Direct IO, it should be equipped with a set of interfaces for out-of-heap memory to maximize its effectiveness. Especially in the Range phase, the urine and urine of a cache corresponds to the size of a partition data partition: 256m, large chunks of memory, which is more suitable for loading with DirectByteBuffer.

JVM tuning-server-Xms2560m-Xmx2560m-XX:MaxDirectMemorySize=1024m-XX:NewRatio=4-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:-UseBiasedLocking

It is well known that newRatio controls the ratio of young area to old area, and the official recommended parameter is-XX:NewRatio=1. Many Java players who do not pay attention to it may unconsciously modify it and will be invisibly dragged down by gc. After discussion with @ Adu, the final conclusion is:

The young area is too large, and the object has been in the younger generation for too long. It has been copied many times.

If the old area is too small, the cms gc in the old area will be triggered frequently.

This is particularly important in the competition-XX:NewRatio=4 magnifying the old age can effectively reduce the number of cms gc, from 126cms gc to the final 5.

Pooled object

Whether it is the ObjectPool of apache, the Recycler of Netty, or the pre-allocated objects in RingBuffer, they all convey the idea that for those things that repeatedly need new, they can be pooled, allocated memory and reclaimed, which is also a lot of overhead. In the scenario of this competition, there is no need to make too much effort to use the object pool, it can be done with a ThreadLocal. In fact, I cache both key/value writes and reads with ThreadLocal, so that objects are never allocated in the loop.

Reduce thread switching

The time slices of both network IO and disk IO,io worker threads are particularly valuable. In my architecture, the range phase is mainly divided into two types of threads: 64 visit threads concurrently reading memory randomly, and 4 io threads reading disk simultaneously. Bucket effect, we can easily locate the bottleneck lies in four io threads. In wait/notify 's model, in order to reduce the loss of time slices of io threads as much as possible, we can consider using while (true) for polling, while visit threads can sleep (1us) to avoid the overall performance degradation caused by cpu idling. Because the evaluation machine has 64 core, this allocation is more reasonable. To do this, I implemented a simple and rough semaphore.

Public class LoopQuerySemaphore {

Private volatile boolean permit

Public LoopQuerySemaphore (boolean permit) {

This.permit = permit

}

/ / for 64 visit thread

Public void acquire () throws InterruptedException {

While (! permit) {

Thread.sleep (0Phone1)

}

Permit = false

}

/ / for 4 fetch thread

Public void acquireNoSleep () throws InterruptedException {

While (! permit) {

}

Permit = false

}

Public void release () {

Permit = true

}

Correct acquireNoSleep in IO and acquire in Visit can improve the performance by about 6s compared to using normal blocking Semaphore.

Binding nucleus

Online machine jitter is inevitable, to avoid IO thread switching can not only rely on while (true) polling, a CPU-level optimization is to free up four cores for IO threads to completely avoid time slice contention of IO threads. This is not difficult to achieve in Java, depending on the omnipotent github, we can easily implement Affinity.

Mode of use:

Try (final AffinityLock al2 = AffinityLock.acquireLock ()) {/ / do fetch.}

This method can make your code 1 second faster and keep the stability of the evaluation.

Thank you for your reading, the above is the content of "PolarDB database performance example analysis". After the study of this article, I believe you have a deeper understanding of the problem of PolarDB database performance example analysis, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.