In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
UFS (UCloud File System) is a distributed file storage product independently developed by UCloud, which has previously launched a capacity-based UFS version. With its flexible online expansion, stability and reliability, UFS provides a shared storage solution for many public, physical and managed cloud users. The storage capacity of a single file system can reach 100 PB.
In order to cope with IO scenarios such as data analysis, AI training and high-performance sites with high performance requirements, the UFS team also launched a performance-based UFS based on NVMe SSD media to meet the business demand for shared storage in high IO scenarios. The 4K random write delay of the performance-based UFS can be kept below 10ms, and the 4K random read delay is below 5ms.
The improvement of performance is not only due to the upgrade of storage media, but also to the improvement of architecture. This paper will introduce the technical details of the upgrade and transformation of performance-based UFS from the aspects of protocol, index, storage design and so on.
Protocol improvement
Previously, the protocol supported by capacity UFS was NFSv3, which was designed with the idea that the interface was stateless and the logic of fault recovery was simple. In addition, NFSv3 is widely supported on Linux and Windows, making it easier to use across platforms. However, the high latency caused by the design shortcomings of NFSv3 is unacceptable in high IO scenarios, so in performance-based UFS, we choose to support only better performance and more advanced NFSv4 protocols.
Compared with NFSv3, the more advanced features of NFSv4 include: support for stateful lock semantics, compound mechanism between multi-protocols and so on. In particular, the compound mechanism allows multiple NFS protocol interactions to be completed in one RTT, which well solves the problem of inefficient performance of NFSv3. A typical open for write operation looks like this on NFSv3 and NFSv4, respectively:
Cdn.xitu.io/2019/9/4/16cfae3a9aa57527?w=502&h=286&f=jpeg&s=15141 ">
As you can see, in the key part of IO, NFSv4 saves half the number of interactions than NFSv3, which can significantly reduce IO latency. In addition to the protocol, the core of performance-based UFS consists of two parts: business index and underlying storage. Due to the improvement of the performance of the underlying IO, these two parts need to be deeply modified to adapt to this structural change. Below we will introduce the details of the transformation of these two parts.
Service index
Indexing service is one of the core functions of distributed file system. The indexing of file storage needs to provide more complex semantics than other storage services such as object storage, so it will have a greater impact on performance.
The functional module design of indexing service is a kind of "bionic" based on the design idea of stand-alone file system, which is divided into two parts:
Directory index: implements a tree-level directory, recording files and subdirectory entries in each directory
File index: record file metadata, including block storage information and access permissions, etc.
The function of each module of the indexing service is clear, which mainly solves two problems:
Business characteristics: in addition to implementing various operations that conform to the semantics of the file system, it is also necessary to ensure the external consistency of the index data. In all kinds of concurrent scenarios, there is no static modification of the index data, resulting in data loss or corruption.
Distributed system characteristics: including system scalability, reliability and other issues, so that the system can deal with all kinds of node and data failures, ensure the high availability and flexibility of the system, etc.
Although there are differences in functionality, catalog indexes and file indexes are similar in architecture, so we will only introduce the file index (FileIdx) architecture below. Under the guidance of the above objectives, the final FileIdx adopts stateless design and relies on the Lease mechanism between the index nodes and the master to manage the nodes to achieve its disaster recovery and flexible architecture.
Lease mechanism and pessimistic lock
The master module is responsible for maintaining a routing table, which can be understood as a consistent hash ring composed of virtual nodes, in which each FileIdx instance is responsible for some virtual nodes. Master detects the viability of the virtual nodes through the heartbeat and each instance node, and uses the lease mechanism to inform the FileIdx instance and each NFSServer who is responsible for handling the specific virtual nodes. If a FileIdx instance fails, master only needs to assign the virtual node that the node is responsible for to other instances for processing after the current lease expires.
When NFSServer needs to request specific operations from the file service (such as requesting the allocation of IO blocks), it will hash the file handle involved in the request to confirm which FileIdx is responsible for the virtual node of the file and send the request to that node. A processing queue is maintained for each file handle on each node, and the queue is executed in FIFO mode. In essence, this constitutes a pessimistic lock. When the operation of a file encounters more concurrency, we guarantee queuing on specific nodes and queues to minimize conflicts caused by concurrent modifications.
Update protection
Although the lease mechanism ensures the concurrent security of file index operations to a certain extent, in extreme cases, leases can not maintain the absolute mutual exclusion and order of concurrent operations. Therefore, we update the index on the index database based on CAS and MVCC technology to ensure that the index data will not lose external consistency because of concurrent updates.
IO block allocation optimization
In performance-based UFS, the significant reduction in IO latency of underlying storage brings higher IOPS and throughput, and also challenges the allocation performance of index modules, especially IO blocks. Frequent requests for IO blocks result in a higher percentage of latency contributed by the index on the entire IO link, which harms performance. On the one hand, we improve the read-write separation of the index, introducing cache and batch update mechanism to improve the performance of single IO block allocation.
At the same time, we increase the size of the IO block, and the larger IO block reduces the frequency of allocating and obtaining the data block, and the allocation overhead is shared equally. Later, we will asynchronize the key index operations to remove the allocation of IO blocks from the IO critical path to minimize the impact of index operations on IO performance.
Underlying storage
Design concept
Storage function is the top priority of a storage system, and its design and implementation is related to the final performance and stability of the system. Through the analysis of the requirements of UFS in data storage and data operation, we believe that the underlying storage (named nebula) should meet the following requirements: simple: a simple and understandable system is conducive to later maintenance and reliable; it is necessary to ensure high availability, high reliability and other distributed requirements for easy expansion: including processing cluster expansion, data equalization and other operations to support random IO to make full use of high-performance storage media.
Nebula: append-only and centralized index
Based on the above goals, we design the underlying storage system nebula as an append-only-based storage system (immutable storage). The additional write-oriented way makes the storage logic simpler, and the fault-tolerant complexity of data consistency can be effectively reduced in the synchronization of multi-copy data. More importantly, because the additional write is essentially a log-based recording mode, the historical records of the entire IO are saved, on which it is convenient to achieve data snapshots and data rollback, and it is easier to do data recovery operations in the event of data failure.
In the existing storage system design, according to the way of data addressing, it can be divided into two types: decentralized and centralized index. the typical representative systems of these two systems are Ceph and Google File System. The decentralized design eliminates the fault risk points on the index side of the system, and reduces the overhead of data addressing. However, it increases the complexity of data migration, data distribution management and other functions. For the simple and reliable design goal of the system, we finally choose the design method of centralized index, which makes expansive operations such as cluster expansion easier.
Block Management: extent-based concept
The performance bottleneck of centralized index is mainly in the allocation of data blocks. We can compare the design idea of stand-alone file system in this aspect. The inode of the early file system managed the data blocks as block-based, and each IO applied for block to write, and the typical block size was 4KB, which led to two problems: 1. The data blocks of 4KB are relatively small, and frequent block application operations are needed for large writes, which is not conducive to giving play to the advantages of sequential IO. 2. Inode needs more metadata space when representing large files based on block, and the file size that can be represented is also limited.
In more advanced file system designs such as Ext4/XFS, inode is designed to be implemented using extent-based. Each extent is no longer limited by a fixed block size, but can be used to represent a variable length of disk space, as shown in the following figure:
Obviously, in this way, IO can get more and more continuous disk space, help to take advantage of the sequential write ability of disks, and effectively reduce the overhead of allocating block, the performance of IO has also been improved, and more importantly, it can be well integrated with the additional write storage system. We can see the design idea of extent-based not only in the stand-alone file system, but also in Google File System, Windows Azure Storage and other distributed systems. Our nebula is also based on this concept of model design.
Storage architecture
Stream data flow
The data stored in the nebula system is organized in stream units. Each stream is called a data stream, which is composed of one or more extent. Each write operation for the stream is appended to the last extent in block units, and only the last extent is allowed to write. The length of each block varies, which can be determined by the upper layer business combined with the scenario. Each extent logically forms a replica group, and the replica group physically maintains multiple replicas on each storage node according to the redundancy policy. The IO model of stream is as follows:
Streamsvr and extentsvr
Based on this model, the storage system is divided into two main modules: streamsvr: responsible for maintaining the mapping between each stream and extent and metadata such as the location of replicas of extent, and managing extentsvr for data scheduling, balancing, etc.: each disk corresponds to an extentsvr service process, which is responsible for storing the actual extent data, processing IO requests from the front end, performing multi-copy operations and repairs of extent data, etc.
In a storage cluster, all disks are represented as a large storage pool through extentsvr. When an extent is requested to create, according to its global perspective of cluster management, streamsvr selects the extentsvr where its multiple copies are located from multiple perspectives such as load and data balancing, and then the IO request is completed by the client directly interacting with the extentsvr node. When a storage node fails, the client only needs to seal off the extent currently being written, create a new extent for writing, and the node disaster recovery can be completed at the delay level of a rpc call of streamsvr. This is also a reflection of the system simplicity brought by the implementation based on additional write.
As a result, the architecture of each module in the storage layer is as follows:
At this point, the data can be written to the extentsvr node through the cooperation of each module. As for the storage layout of the data on the specific disk, this is the work of the single disk storage engine.
Single disk storage engine
The previous storage architecture describes the functional division of the entire IO in the storage layer. In order to ensure the high performance of the performance-based UFS, we have made some optimizations on the single disk storage engine.
Thread model optimization
The great improvement in the performance of storage media brings new requirements to the design of storage engine. On the SATA media of capacity UFS, disk throughput is low and latency is high. The overall throughput of a storage machine is limited by disk throughput. A single-thread / single-process service can fill up disk throughput. With the improvement of the processing capacity of storage media, the system bottleneck of IO has gradually shifted from disk to processor and network bandwidth.
Because of its multi-queue parallel design on NVMe SSD media, the single-thread model has been unable to give full play to the advantages of disk performance, and system interrupts and network card interrupts will become new bottleneck points of CPU. We need to convert the service model to multi-thread mode, so as to give full play to the parallel processing capability of multi-queues in the underlying media. For this reason, we rewrite the programming framework, the new framework uses the threading model of one loop per thread, and through the design of Lock-free to maximize the performance of mining disk.
Block addressing
Let's consider a question: when the client writes a piece of data block, how to find the location of block data when reading? One way is to assign a unique blockid to each block and address it through a two-level index transformation:
Level 1: query streamsvr to locate the relationship between blockid and extent
Level 2: find the copy of extent, query the offset of blockid in extent, and then read the data
This implementation faces two problems: (1) the conversion requirements of the first level lead to a large number of indexes that streamsvr needs to record, and query interaction will lead to an increase in IO latency and reduce performance. (2) the second-level transformation is typical of Facebook Haystack system, where each extent is represented by a separate file on the file system. Extentsvr records the offset of each block in the extent file, and loads all index information in memory at startup to improve query overhead. Querying this index in multi-threaded framework must lead to query delay due to mutual exclusion mechanism, so it is not desirable in high-performance scenarios. Moreover, the operation based on the file system makes the IO path of the whole storage stack too long, performance tuning is not controllable, and it is not conducive to the introduction of SPDK technology.
To avoid the above disadvantages, our storage engine is based on a bare disk, and a physical disk will be divided into several core parts:
Superblock: Super block, recording the segment size, segment start position, and other index block positions, etc.
Segment: a unit of data allocation in which all areas of the disk except super blocks are segment areas. Each segment is of fixed length (default is 128MB), and each extent consists of one or more segment.
Extent index / segment meta region: extent/segment index area, which records the list of segment corresponding to each extent, as well as the status of the segment (available) and other information
Based on this design, we can optimize the addressing of block to pure computing without query. When a block is written, the offset of the block in the entire stream is returned. When the client requests the block, it only needs to pass this offset to the extentsvr. Because the segment is of fixed length, extentsvr can easily calculate the location of the offset on the disk and locate the data for reading, thus eliminating the query overhead when addressing the data.
Random IO support: FileLayer middle tier
We designed the storage system as append-only for the sake of simplicity and reliability, but because of the business characteristics of file storage, we need to support random IO scenarios such as overwriting.
Therefore, we introduce a middle-tier FileLayer to support random IO and implement random writing on an additional write engine, which is based on the implementation of Log-Structured File System. Both LSM-Tree and FTL in the SSD controller used by LevelDB have similar implementations. The overwritten data is only indirectly modified at the index level, rather than directly overwriting or COW (copy-on-write). This can not only achieve overwriting at a lower cost, but also retain the simplicity of the underlying additional writing.
The unit where the IO operation occurs in FileLayer is called dataunit. The block involved in each read and write operation is processed on a dataunit. The logic of dataunit consists of the following parts:
Dataunit consists of multiple segment (note that this is not the same concept as the underlying stored segment), because the design based on LSM-Tree eventually needs to do compaction, and the partition of multi-segment is similar to the concept of multi-tier sst in LevelDB. The lowest segment is read-only, and only the uppermost segment allows writing, which makes the compaction operation easier and more reliable to perform or even rollback, and because the data domain involved in each compaction is determined. It is also convenient for us to verify the invariant of the compaction operation: the valid data in the data domain must be the same before and after collection.
Each segment consists of an index stream and a data stream, which are stored on the underlying storage system nebula. Each write to the IO requires a synchronous write of the data stream. In order to improve the IO performance, the write of the index stream is asynchronous, and maintain a pure memory index to improve the query operation performance. To do this, the data written to the data stream each time is self-contained, which means that if the index stream is missing part of the data or even corrupted, we can build the entire index from the data stream.
The client writes to dataunit at the file granularity, and dataunit assigns a globally unique fid,fid to each file as a data handle to store in the business index (FileIdx's block handle).
The dataunit itself is responsible for the fileserver service process, and each fileserver can have multiple dataunit,coordinator to schedule and recover the dataunit among the instances according to the load of each node. The architecture of the entire FileLayer is as follows:
So far, the storage system has met our file storage requirements in accordance with the design requirements, so let's take a look at how the various modules work together to complete a file IO.
The Big Picture: the whole process of writing IO in one file
On the whole, the general process of writing an IO to a file is as follows:
When a ① user initiates an IO operation on the host, the kernel layer is intercepted by nfs-client at the VFS layer (only take the Linux system as an example) and sent to the access layer of the UFS service through the isolated VPC network.
The ② access layer decomposes this operation into index and data operations by parsing and escaping the NFS protocol.
③ converts the IO range involved in this operation in the file into an IO range represented by multiple file system block (fixed size, default 4MB) through the indexing module.
④ NFSServer gets the bid of the block that needs to be operated and requests the FileLayer to IO (each bid represents a file in the FileLayer).
The request will be sent by NFSServer to the fileserver responsible for processing the file corresponding to the bid. After fileserver obtains the dataunit number where the file is located (this number is encoded in bid), it directly appends and writes to the current data stream (stream) of the dataunit, updates the index after completion, and records the location of the newly written data. This IO is completed, and you can return to NFSServer and respond. Similarly, when the append IO generated by the fileserver arrives at the extentsvr to which it belongs, the extentsvr determines the location of the last extent corresponding to the stream on the disk, and performs an additional write landing data, which is returned after completing the multi-copy synchronization.
At this point, a file to write IO is complete.
Performance data
After the above design and optimization, the actual performance data of the performance-based UFS are as follows:
Summary
Starting from the requirements of UFS performance-based products, this paper introduces in detail the design considerations and optimization of protocol, business architecture, storage engine and other aspects when building a distributed file system based on high-performance storage media, and finally implements these optimizations into the product. The launch of performance-based UFS enriches the range of products, and various business scenarios such as big data analysis and AI training that require higher IO latency will get better help.
In the future, we will continue to improve the experience of using UFS in many aspects. The product will support SMB protocol to improve the performance of file storage used by Windows hosts; the underlying storage will introduce technologies such as SPDK and RDMA, combined with other higher-performance storage media; and introduce Erasure Coding and other technologies in cold storage scenarios, so that users can enjoy the performance and price dividend brought by more advanced technology.
The latest product offer: the original price of performance-based UFS is 1.0RMB / GB/ month, now the discount price of Fujian availability zone is 0.6RMB / GB/ month, and that of other domestic availability zones is 0.8RMB / GB/ month. Welcome to contact the account manager to apply for experience!
If you have questions related to this article, you are welcome to add the author's consultation. WeChat ID:cheneydeng
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.