Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Technical Analysis of SmartX products: SMTX distributed Block Storage-Storage engine

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Note: the content of this article is sorted out from the speech of SmartX CTO Zhang Kai at the launch of SMTX OS 3.5 new products.

Let's take a look at what kind of requirements we will have for the data storage engine module.

First of all, it must still be reliable. Because most of our customers' application scenarios are core applications, the reliability of data is absolutely guaranteed, and there is no room for compromise.

The second is performance, which is very popular in 10 Gigabit networks and SSD, including NVMe SSD. As the speed of hardware becomes faster and faster, the bottleneck of performance will shift from hardware to software. Especially for storage engines, performance is critical.

In addition to pursuing absolute performance, we also want to be efficient. We hope that every CPU instruction is not wasted. We aim to complete an IO operation with a minimum of CPU instructions. The reason behind this is that storage hardware devices are getting faster and faster, and the fastest storage can be accessed in a single time in 10 nanoseconds. If a lock is added to the program and a context switch is made, hundreds of nanoseconds may pass. If it is not efficient, the current CPU may not be able to perform as well as SSD. In addition to efficient use of CPU, we should also make efficient use of memory resources and network bandwidth resources. At the same time, because the price of SSD with the same capacity is higher than that of HDD, we also try our best to save disk space and improve the space use efficiency of SSD by using compression, de-duplication and other technologies.

Last but not least, the storage engine needs to be easy to Debug and easy to upgrade. For software engineers, more than 50% of their working time is spent on Debug, while for storage software engineers, the proportion may be even higher. We hope to make a software product that is very easy to Debug, if you find a problem, you can quickly locate and fix it. Upgrade is the same, now the iterative speed of the software is getting faster and faster, we hope that the software can be easily upgraded, so that we can let users use the new version of the software faster, enjoy the features of the new version, as well as performance optimization.

Next, let's take a look at the specific implementation. When implementing the storage engine, many traditional storage manufacturers often choose to put the implementation of the entire IO path in Kernel Space. For example, in the image above, the upper layer is a core storage engine, and the lower layer is the file system, block devices, and drivers. Since the network stack is also implemented in the kernel, putting the storage engine in the kernel can maximize performance and reduce context switching (Context Switch). But there are many very serious problems with this implementation, first of all, it is difficult to Debug. If you have done kernel development, you will know that Debug in the kernel is a very troublesome thing. And the development language can only be used in C, not in other languages. At the same time, developing in the kernel, upgrading will be very difficult. An upgrade, whether it's Bugfix or adding new features, may require a reboot of the entire server, which can be costly for the storage system. Another important factor is that the fault domain is very large. If there is something wrong with the module in Kernel, the whole Kernel may be contaminated, it may be deadlock, it may be Kernel Panic. It usually takes a restart of the server to fix it.

Since there are so many problems, we certainly will not choose to use Kernel Space when designing. We chose to implement our storage engine in Userspace, that is, in user mode.

In the User Space implementation, many projects choose to build the storage engine on the data structures of LSM Tree. LSM Tree runs on top of the file system. User Space is more flexible than Kernel and can be used in a variety of languages; upgrading is also very convenient, you only need to restart the process, there is no need to restart the server; the failure of User Space will only affect the service process itself, not the operation of Kernel. But the problem with this approach is that the performance is not good enough. Because the IO still needs to go through Kernel, there will be a context switch, which will introduce performance overhead.

Next, let's talk about LSM Tree. The data structure and implementation of LSM Tree will not be described in detail here. Overall, LSM Tree is the core of many storage engines.

The advantage of LSM Tree is that it is relatively easy to implement, there are many open source implementations to refer to, and it optimizes small chunks of data writing very well, merging small chunks of data and writing them in batches.

However, LSM Tree is not a silver bullet, and its biggest problem is "read magnification" and "write magnification" caused by its data structure. How serious the problem will be. We can take a look at this picture, which is the result of a test of "read-write magnification". As you can see from the figure, if the data is written into 1GB, it will eventually result in three times the amount of data written, that is, three times the "write magnification". If you write 100G, it will be magnified to 14 times, that is, if you write 100G of data, it will actually generate 1.4TB write traffic on disk. And "read magnification" will be more serious, in this scenario will be magnified to more than 300 times. This goes against our initial request that we want to improve hardware efficiency.

Although LSM Tree has a variety of benefits, we will not use LSM Tree as a data storage engine because of the serious problem of "read-write magnification". We can learn from the excellent ideas of LSM Tree, combined with our own needs, to achieve a set of storage engine. This includes data allocation, space management, IO and other logic.

Next, we see that there is also a file system in this figure. This file system is implemented in the kernel, on top of the block device. Common file systems include ext4,xfs,btrfs, and many storage engines are also implemented on the file system. However, we need to think about whether we really need a file system.

First of all, the file system provides far more functionality than the storage engine needs. For example, the file system provides ACL function, Attribute function, multi-level directory tree function, these functions are not needed for a dedicated storage engine. These additional features often result in some Performance Overhead, especially some global locks, which have a serious impact on performance.

Secondly, most file systems are designed for a single disk rather than multiple disks. In general, 10 or more disks are deployed on a storage server, and it may be SSD, HDD, or hybrid deployment.

Third, many file systems do not support asynchronous IO well. Although asynchronous IO interfaces are supported, blocking occasionally occurs in actual use, which is also a very bad place in the file system.

Finally, in order to ensure the consistency of data and metadata, the file system will also have a Journaling design. But these Journaling also introduce the problem of write magnification. If multiple file systems are mounted on the server, the Journaling of a single file system cannot be atomic across file systems.

In the end, when we designed the storage engine, we chose to abandon the file system and LSM Tree, and we were making an ideal storage engine, removing unnecessary functions and avoiding write magnification as much as possible. Implement the function we want directly on the block device.

We do not want to implement the Block Layer layer ourselves, because in Linux Kernel, Block Layer is a very thin layer, and the algorithms implemented in it are also very simple. These algorithms also have adjustable parameters and have ways to turn them off, so there will not be too much extra performance overhead.

The figure on the left shows how ZBS is currently implemented. But the biggest problem with this approach is performance. The IO of both Block Layer and Driver running on Kernel Space,User Space's storage engine passes through Kernel Space, resulting in Context Switch. In the future, we will turn to the figure on the right, through the User Space driver provided by the SSD manufacturer, combined with the PMD (Poll Mode Driver) engine, to provide better performance.

Next, let's take a look at the specific implementation of ZBS's User Space storage engine.

IO Scheduler is responsible for receiving IO requests from the upper layer, building a Transaction, and submitting it to the specified IO Worker. IO Worker is responsible for executing the Transaction. The Journal module is responsible for persisting Transaction to disk and for recycling Journal. Performance Tier and Capacity Tire are responsible for managing the free space on the disk and persisting the data to the corresponding disk, respectively.

At present, the SmartX R & D team is developing the next generation User Space storage engine. Interested students can communicate through the comment area, and you are welcome to send your resumes to jobs@smartx.com.

For more information, you can also visit SmartX's official website: https://www.smartx.com

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report