150x acceleration machine disk, UCloud CVM IO acceleration technology revealed 10/26 Update SLTechnology News&Howtos

150x acceleration machine disk, UCloud CVM IO acceleration technology revealed

2025-10-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Nowadays, the gap between the computing power of CPU and the latency of disk access is gradually widening, which makes the disk IO of users' CVM often become a serious performance bottleneck, especially in cloud computing environment. In order to solve the problem of low performance of mechanical disk IO, we use the self-developed cloud host IO acceleration scheme to improve the highest performance of 4K random writing from 300 IOPS to 4.5W IOPS, which is 150 times higher, that is, the performance of SSD is obtained by using the cost of mechanical disk. Since it was launched for 13 years, this solution has gone through five years of operation practice, and has been successfully applied to 93% of the standard CVMs in the network, covering 127000 instances, with a total capacity of 26PB.

one。 Why IO acceleration is needed

Traditional mechanical disks need to move the head to the target location when addressing, and the operation of moving heads is the main reason for the poor performance of mechanical disks. Although various system software or IO schedulers are committed to reducing the movement of heads to improve performance, they only improve the effect in most scenarios. Generally speaking, a SATA mechanical disk has only about 300K random IOPS. For most CVMs, 300 random IOPS is not enough, not to mention that in cloud computing scenarios, there will be multiple CVMs on one physical host. Therefore, there must be other ways to significantly improve IO performance.

Early SSD is expensive, and the adoption of SSD will inevitably lead to an increase in the cost of using it. So, we began to think about whether we can solve this problem from a technical point of view, through the analysis of disk performance characteristics, we began to develop the first generation of IO acceleration solution. Even though SSD is becoming more and more popular today, mechanical disks are still widely used because of their low cost and stable storage, and IO acceleration technology enables mechanical disks to meet the high IO performance requirements of most application scenarios.

II. Io acceleration principle and the first generation IO acceleration

We know that the characteristic of mechanical disk is that the performance of random IO is poor, but the performance of sequential IO is better. For example, the 4K random IO mentioned earlier can only have a performance of 300 IOPS, but its sequential IO performance can reach 45000 IOPS.

The basic principle of IO acceleration is to make use of this performance characteristic of the mechanical disk. First of all, the system has two disks: one is the cache disk, which is a mechanical disk with a smaller capacity, which is used to temporarily store the written data; the other is the target disk, which is a mechanical disk with larger capacity and stores the final data.

1. Reading and writing of IO

When writing, the IO of the upper layer is sequentially written to the cache. because it is written sequentially, the performance is very good, and then when the cache disk is idle, the data of the disk is brushed back to the target disk according to the order of writing, so that the cache disk maintains some free space to store new write data.

In order to make the upper layer business unaware, we choose the device mapper layer in the kernel state of the host (referred to as the dm layer) to achieve this function. The structure of the dm layer is clear and the modularization is more convenient. After implementation, the upper layer is embodied as a dm block device, and the upper layer does not need to care about how the block device is implemented, just need to know that this is a block device that can be directly used as a file system.

In the above way, when a new IO is written, the dm layer module first writes the IO to the IO disk, and then flushes it back to the target disk, where an index is needed to record the location of the written IO on the cache disk and the location information on the target disk, and subsequent flashback threads can use the index to determine the location of the IO data source and the target location. We design the size of the index as 512 bytes, because the sector of the disk is 512 bytes, so each write becomes the mode of 4K data + 512-byte index. for performance reasons, a copy of the index information is also retained in memory.

The reading process is relatively simple, through the index in memory to determine whether the data to be read is in the cache disk or in the target disk, and then read to the corresponding location.

The size of the written data is generally 4K, which is determined by the characteristics of the kernel dm layer. When the write IO is greater than 4K, the dm layer will split the data by default. For example, if the write IO is 16K, it will be split into four 4K IO. If the written data is not aligned according to 4K, for example, it is only 1024 bytes, then special processing will be carried out first. First check the data area covered by the IO. If the covered memory area has data in the cache disk, then the data needs to be written to the target disk first, and then the IO will be written to the target disk. This process is relatively complex, but most of the IO in the file system scenario is 4K aligned. Only a few IO are unaligned, so it doesn't have much impact on the performance of the business.

two。 Fast recovery and backup of Index

During the operation of the system, accidental power outage or system shutdown can not be avoided. A robust system must be able to ensure the reliability of the data in the event of these situations. When the system starts the recovery, it needs to rebuild the index data in memory, which has been written along with the IO data on the cache disk, but because the index is stored at intervals, if the index is read from the cache disk every time, the recovery speed of the data will be very slow.

For this reason, we design a periodic dump mechanism for in-memory indexing, which dump the index data in memory to the system disk every 1 hour. When starting, we first read the dump index, and then read the latest index within 1 hour from the cache disk. In this way, the startup time of system recovery is greatly improved.

According to the above principles, UCloud developed the first generation IO acceleration scheme. After adopting this scheme, the system has achieved remarkable results in accelerating random writing, and has been running stably on-line.

three。 Problems in the first Generation IO acceleration Scheme

But with the operation of the system, we also found some problems.

1) Index takes up a lot of memory.

The minimum disk index is 512 bytes because of the sector, but there is no need to use so many indexes in memory, too large indexes will consume too much memory.

2) when the load is high, there is more IO data accumulated in the Cache disk.

The principle of IO acceleration is mainly to accelerate random IO. For sequential IO, there is no need to accelerate because of the good performance of the mechanical disk itself, but there is no distinction between sequential IO and random IO in this version. All IO will be written to the cache disk, resulting in too much IO.

3) Hot upgrade is not friendly.

The scenario of online upgrade was not considered enough in the initial design, so the hot upgrade of the first generation IO acceleration scheme was not friendly.

4) not compatible with the new 512e mechanical disk

The physical sector and logical sector of the traditional mechanical disk are both 512 bytes, while the physical sector of the new 512e disk is 4K. Although the logical sector can still use 512 bytes, the performance is seriously degraded, so the write mode of 4K data + 512-byte index in the first generation IO acceleration scheme needs to be adjusted.

5) performance cannot be scaled

The performance of the system depends on the load of the cach disk and cannot be extended.

four。 The second generation IO acceleration technology

All of these problems were discovered during the online operation of the first generation of IO acceleration technology. And in the aspect of compatibility with the new mechanical disk, because the traditional 512-byte physical sector and logical sector 512n type disk has been gradually no longer produced, if the system is not improved, the system may not be able to meet the future needs. Therefore, on the basis of the first generation scheme, we embarked on the research and development and optimization iteration of the second generation IO acceleration technology.

1. New index and index backup mechanism

The first generation of IO acceleration technology can not follow the 4K+512Byte format because of the disk. In order to solve this problem, we separate the data from the index and create an index file on the system disk. Because the system disk is basically idle, there is no need to worry about the high impact of the system disk load on index writing. At the same time, we also optimize that the size of the index is reduced from 512B to 64B, while the data is still written to the Cache disk. As shown in the following figure:

Among them, the index file header and the data disk header retain two 4Ks for storing the header data, which contains the offset of the start and end of the current cache disk data, followed by the specific index data. There is an one-to-one corresponding relationship between the index and the 4K data in the cache disk, that is, every 4K data will have an index.

Because the index is placed on the system disk, it is also necessary to consider how to restore the index if the system disk is irreparably damaged. Although the damage to the system disk is a very small event, once it occurs, the index file will be completely lost, which is obviously unacceptable. Therefore, we have designed the index backup mechanism, every time 8 indexes are written, the system will merge these indexes into a 4K index backup block and write them to the cache disk (see the purple block in the figure above), and fill the parts that are less than 4K with 0, so that when there is an accident on the system disk, we can also use the backup index to recover the data.

The index and data are written at the same time when writing. In order to improve the writing efficiency, we optimize the writing mechanism of the index and introduce the method of merging writes:

When writing, multiple indexes that need to be written can be merged into a 4K write buffer, and the write buffer can be written at once, which avoids the inefficient behavior that each index will produce a write, and ensures that each write is 4K aligned.

two。 Sequential IO recognition ability

In the process of operating the first generation of IO acceleration technology, we found that users will produce a large number of writes when they do data backup, import and other operations. These writes are basically sequential, but there is no need for acceleration. However, the first generation of IO acceleration technology does not make a distinction, so these IO will be written to the cache disk, resulting in excessive IO accumulation. To this end, we add a sequential IO recognition algorithm, through the algorithm to identify the sequential IO, these IO do not need to go through the accelerator, will be directly written to the target disk.

When an IO starts to write, it is impossible to predict whether the IO is sequential or random. The general processing method is that when the location of an IO stream is continuous and lasts to a certain number, the IO stream can be considered to be sequential, so the key of the algorithm is how to judge whether an IO stream is continuous. Here we use the trigger method.

When an IO stream starts to write, we will set a trigger in the next block where the IO stream is written. If the trigger is triggered, it means that the block has been written. Then move the trigger back to the next block. When the trigger is triggered a certain number of times, we can think that the IO stream is sequential. Of course, if the trigger is triggered for a certain period of time, it does not continue to be triggered. Then we can recycle the trigger.

3. Non-perceptual thermal upgrade

The first generation of IO acceleration technology is not friendly to hot upgrade support in design, and the stock version can only be updated through hot migration and then restart, so the whole process is tedious, so when developing the second generation IO acceleration technology, we designed a solution of non-perceptual thermal upgrade.

Because our module is located in the kernel state of the dm layer, once initialized, a virtual dm block device will be generated, and the block device will be referenced by the upper file system, so this module cannot be uninstalled once initialized. To solve this problem, we designed a parent-child module. The parent module acts as a bridge between the child module and the dm layer. The parent module only has a very simple IO forwarding function and does not contain complex logic, so it can ensure that the parent module does not need to be upgraded, while the sub-module contains complex business logic. The sub-module can be unloaded from the parent module to achieve unaware thermal upgrade:

The binlogdev.ko in the image above is the parent module, and cachedev.ko is the child module. When you need hot upgrade, you can set the IO disk to read-only mode, so that the cache disk only flushes back data and no longer writes. After the Cache disk is brushed back, it can be considered that the subsequent write IO can be directly written to the target disk without having to worry about overwriting the data in the Cache disk, so that the module can be unplugged and replaced smoothly.

This hot upgrade mechanism not only realizes the function of hot upgrade, but also provides a failure avoidance mechanism. In the grayscale process of IO acceleration technology, we found that an occasional bug will cause the host to restart. We immediately set all cachedisks to read-only to avoid the recurrence of failures and buy time for debug.

4. Compatible with 512e mechanical disk

The new generation of mechanical disk is mainly 512e, and this type of disk needs to be written to IO according to 4K alignment in order to achieve maximum performance, so the original 4K+512B index format can no longer be used, and we have also considered expanding the 512B index to 4K, but this will lead to excessive index space and extra disk bandwidth when writing is too inefficient So in the end, the problem is solved by putting the index on the system disk and combining the merge writing technology mentioned above.

5. Performance expansion and improvement

In the first generation of IO acceleration technology, only one IO disk can be used, which will affect the performance of the system when the load is high. In the second generation of Cache acceleration, we designed to support multiple IO disks, and according to the way that the system load is inserted on demand, the ability to accelerate random IO increases with the increase of the number of disks. Under the condition that SATA mechanical disks are used in both local and network cache disks, it is found that the performance of random writes has been greatly improved with the increase of the number of cache disks used.

When only one local cache disk is used, the random write performance can reach 4.7 W IOPS:

When using a local cache disk and a network cache disk, the random write performance can reach 9W IOPS:

When using one local cache disk plus two network cache disks, the random write performance can reach 13.6W IOPS:

At present, we have deployed a large-scale CVM using the second-generation IO acceleration technology. Thanks to the above design, the previously troubled problems such as excessive IO accumulation and performance bottlenecks on IO disks have been greatly alleviated, especially the problem of IO accumulation, which is often triggered when the load is high under the first generation solution, resulting in performance loss and operation and maintenance overhead. After adopting the second generation IO acceleration technology, the monitoring alarm is triggered only occasionally.

five。 Write at the end

The IO acceleration technology of CVM greatly improves the processing capacity of random writes on mechanical disks, so that users can take advantage of lower prices to meet business needs. And the essence of this technology is not only to accelerate the mechanical disk, but also to make the system level have the ability to separate the performance from the storage media, so that the performance of IO is not limited by the media it stores. In addition, the design of a large-scale application of the underlying technology in the actual production environment is very critical, especially the hot version upgrade, fault tolerance and performance considerations need to be carefully considered. I hope this article can help you better understand the characteristics and applications of the underlying technology, and make better improvements in the future design.

-END-

At the "UCloud user Conference and TIC Shanghai Station" to be held on December 21, UCloud will discuss product design concepts, technical details and future development topics such as cloud host IO acceleration. Welcome to click on the QR code below or click to read the original text to register, looking forward to your visit!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.