The Design and Optimization of distributed Storage 04/16 Update SLTechnology News&Howtos

The Design and Optimization of distributed Storage

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

With the continuous improvement of the degree of information, the global data is expanding day by day. In the face of the current PB-level massive data storage requirements, the traditional storage system has a bottleneck in the expansion of capacity and performance. Cloud storage has been widely recognized by the industry because of its strong scalability, high performance-to-price ratio, good fault tolerance and other advantages. Because of its foresight, many enterprises regard it as the first step into cloud computing. As important technologies in cloud storage, distributed file system and distributed block storage have become an important cornerstone for the development of cloud storage.

For most IT technicians who focus on cloud computing itself, they may not have an in-depth understanding of distributed file systems and distributed block storage. To this end, UCan afternoon Tea-Wuhan Station, we invited technical experts related to distributed file system, distributed block storage and cloud storage to talk about distributed storage.

Analysis of distributed File system Product Architecture-- UCloud Deng Jin

Distributed storage products are essential infrastructure in all kinds of product business. Understanding the design ideas and usage scenarios of storage products can enable users to better build their own business logic based on storage products. Focusing on the design concept and development practice of UCloud distributed file system UFS, Deng Jin, a UCloud file storage R & D engineer, shared how to solve the requirements of business diversity for storage products, how to solve the limitations encountered in the previous generation of products, and how to avoid the bottlenecks of similar open source products.

Deng Jin believes that distributed file system is an extension of traditional file system, and users can obtain storage capacity that traditional file system does not have through distributed technology and public cloud scale effect: 1) scale out: linear / near-linear improvement of capacity and performance; 2) fault tolerant: shielding hardware failures to improve data reliability and system availability 3) lower TCO & pay-as-you-go: this is a unique feature of cloud computing products, which can provide some low TCO for application layer users.

UFS (UCloud File System) is a highly available / highly reliable file storage service developed by UCloud and designed for public cloud business. At the beginning of the design, the R & D team mainly used the open source software GlusterFS to quickly verify the product prototype in the public cloud environment, but in the course of operation, it was found that GlusterFS had many pain points and difficulties in the multi-tenant public cloud environment, such as bottleneck in scale scalability (high peering overhead), limited number of nodes, and inability to manage multi-clusters and grayscale management. Index operation is easy to cause high IO, which affects the performance of data operation, and the performance of small file access and large directory operation is very poor. Based on these problems, UCloud finally decided to improve the design of self-developed products.

According to the pain points of the operation of the open source solution, UCloud first separates the index from data, and defines the index structure and semantics to facilitate the subsequent expansion of non-NFS protocols; then independently designed storage services that support set management, grayscale and other strategies; in addition, the design supports millions of large directories and TB+ file sizes and supports QoS to isolate user access in multi-tenant scenarios Finally, through the data encryption and slicing strategy to ensure the security of the data. The following figure shows the solution architecture of UFS 1.0.

In general, a mature architecture needs to go through the process of finding problems-> transformation practice-- > finding new problems-- > reinventing and upgrading. Through this constant iterative upgrade, it finally tends to be stable, and so is the UFS architecture. In the course of operation, the UCloud storage R & D team found that UFS 1.0 still has some limitations, such as the storage model is suitable for small shard scenarios and is not universal enough, the fixed underlying storage shards cause a certain amount of space waste, the file scale supported by the storage layer is small, and the support for random writes is not enough. As a result, the team made a new round of architectural upgrades based on UFS 1.0.

The new architecture optimizes the storage layer and uses the append-only model. As shown in the figure below, Stream represents a file stream, which can be appended at the tail. Extent is a data shard in stream. The shard size is not fixed, and each extent is landed as a file. In the data layer, streamsvr is responsible for maintaining the index / routing information of stream and extent, while extentsvr maintains the block index information in extent and provides read and write requests from directly connected clients.

Plug-in engine design, can reduce write burr, and make full use of memory buffer to reduce read burr. In addition, in order to solve the problem of random write unfriendliness of the underlying storage engine, the system adopts FileLayer design to cache hot data and reduce the storage pressure.

Data Distribution algorithm in distributed Storage-- Oz data Li Mingyu

Data distribution algorithm is one of the core technologies of distributed storage, which takes into account not only the uniformity of data distribution and the efficiency of addressing, but also the cost of data migration when expanding and reducing capacity, as well as the consistency and availability of replicas. Li Mingyu, founder and CTO of Oz data, analyzed the advantages and disadvantages of several typical data distribution algorithms on the spot, and shared some problems encountered in the implementation.

Consistent hashing algorithm can locate data without table lookup or communication process, and its computational complexity does not change with the increase of data volume, and it has the characteristics of high efficiency, good uniformity and small data migration when adding / decreasing nodes. However, in practical applications, this algorithm also meets many challenges because of its own limitations, such as it is almost impossible to obtain a global view, or even stable for a moment, in the "storage blockchain" scenario; in the enterprise IT scenario, there exists the problem of reliable storage of multiple copies, and the cost of data migration is huge.

The so-called storage block chain can be understood as distributed storage (P2P storage) + block chain. Through token incentives, it encourages people to contribute storage resources and participate in the construction of a worldwide distributed storage system. Because a large number of users need to be encouraged to participate spontaneously, it will involve hundreds of millions or even billions of node addressing and routing problems. At present, the main solutions in the industry are Chord, Kademlia and so on. However, the Chord algorithm is inefficient and produces high latency. Finger table can be used to record not only the location of the current node and the next node, but also the position of the current node 2 ^ I + 1, which reduces the computational complexity and ultimately reduces the delay.

In the enterprise IT scenario, the data distribution algorithms include Dynamo, CRUSH of Ceph, Elastic Hashing of Gluster and Ring of Swift. These algorithms have similar characteristics. First of all, they are based on / draw lessons from the consistent hash, and the amount of data migration is small when adding / decreasing nodes. Secondly, the modeling of physical topology of data center (Cluster Map) is introduced, and multiple copies of data / EC fragments are distributed across fault domains / availability zones. In addition, these algorithms can also divide the weight of nodes, data distribution and capacity / performance matching, auxiliary capacity expansion.

Generally speaking, these two kinds of schemes are based on consistent hashing algorithm, and only because of different requirements, there are different directions of improvement. The enterprise pays more attention to the distribution of replica failure domain, while for P2P storage, it pays more attention to ensure that the data can be addressed in the effective time when the node exits and joins at any time.

Architecture upgrade and performance improvement of cloud disk-- UCloud Ye Heng

As the basic storage product of cloud computing, cloud disk provides high availability, high reliability and persistent block-level random storage for cloud servers. The performance and data reliability of cloud disks are particularly important. Based on past operational experience, UCloud redesigned the underlying architecture of cloud disks in the past year to improve the performance of ordinary cloud disks and support NVME high-performance storage. Ye Heng, an engineer in the research and development of UCloud block storage, focused on the architecture upgrade and performance improvement of UCloud cloud disk.

Through the analysis of the problems and requirements at this stage, the UCloud storage research and development team first sorted out the objectives of architecture upgrade: 1) to solve the limitation that the original software architecture can not give full play to the hardware capabilities; 2) to support SSD cloud disks, provide QoS guarantee, and give full play to the IOPS and bandwidth performance of back-end NVME physical disks. IOPS of a single cloud disk can reach 2.4W. 3) support larger cloud disks, 32T or even larger 4) Hot issues to fully reduce IO traffic; 5) support concurrent creation of thousands of cloud disks and concurrent mounting of thousands of cloud disks; 6) support online migration of old cloud disks to new structures, and online migration of ordinary cloud disks to SSD cloud disks.

According to the above goals, UCloud customizes five transformation directions: IO path optimization, metadata slicing, thread model design, anti-overload strategy and online migration.

IO path optimization: in the old architecture, the entire IO path has three layers, the first layer hosts the Client side, the second layer Proxy side, and the third layer stores the Chunk layer. In order to reduce the delay, the optimized scheme splits the function of Proxy and hands the route acquisition to the Client,IO read-write Client directly accessible storage Chunk layer. The entire IO path becomes layer 2, and for reading IO, a network request can go directly to the back-end storage node, and its latency can be reduced by an average of 0.2-1ms.

Metadata sharding: in the old architecture, the shard size supported by UCloud was 1G. However, in special scenarios (for example, business IO hotspots are limited to a small range), 1G sharding will make the performance of ordinary SATA disks very poor. In the new architecture, UCloud reduces metadata fragmentation to support data fragmentation of 1m size. And use a set of unified rules to calculate the route to save the IO path consumption and ensure that under 1m slicing, the distribution and mounting of metadata are unimpeded.

Thread model design: the traditional architecture uses single-thread transmission, with a single thread writing IOPS up to 6W and reading IOPS up to 8W, so it is difficult to support hundreds of thousands of IOPS and 1-2GB bandwidth of back-end NVME hard disk. In order to take advantage of the performance of NVME disk, UCloud adopts a multi-thread transmission model, and through the optimization of software details such as IO path and route acquisition, CPU consumption is reduced.

Anti-overload strategy: when multithreading works in parallel, UCloud simulates a scenario in which hotspots are concentrated on a thread and finds that the thread CPU is basically fully loaded at 99% Rue 100%, while other threads are idle. In order to solve this problem, the storage team reports thread CPU and disk load status on a regular basis. When a thread is continuously busy and a thread is idle, the storage team selects the IO of some disk fragments to switch to idle threads to avoid thread overload.

Online migration: the performance of ordinary cloud disk in the old architecture is poor, and the business of some ordinary cloud disk users is developing rapidly. We hope to migrate from ordinary cloud disk to SSD cloud disk to meet the higher needs of business development. In the face of users' demands, UCloud adopts the way of supporting online migration from the periphery of the system to achieve the purpose of online migration quickly.

It is understood that SSD cloud disk compared with ordinary cloud disk, IOPS increased by 13 times, stability increased by 3 times, and the average delay decreased by 10 times. Since the launch of the new architecture, it has served more than 3400 cloud disk instances of current network users, with a total storage capacity of 800TB and a cluster average IOPS of 310000 per second.

Improvement and Optimization based on CephFS-- convinced Science and Technology Lu Bo

According to the IDC survey, 80 per cent of corporate data are unstructured, and these data are growing at an exponential rate of 60 per cent every year. Distributed file storage is more and more favored by users in government, education, medical and other industries because of its flexible expansion and rapid deployment. With the development of OpenStack technology, Ceph has become the star of distributed storage. Lu Bo, a storage research and development expert from Shenzhen convincing Technology, combined with the distributed storage system EDS developed by Shenzhen convincing Technology, shared some improvements and optimizations made for Ceph file storage, as well as some thoughts for the future.

Ceph is a hierarchical architecture, the bottom layer is CRush (hash)-based distributed object storage-Rados, and the upper layer provides object storage (RadosGW), block storage (RDB) and file system (CephFS). Among them, CephFS begins with Sage Weil's doctoral thesis research, and its goal is to implement distributed metadata management to support EB-level data scale. The following figure shows the overall structure of CephFS.

Ceph-mds: caches file system metadata and persists it to rados, which runs on top of rados and does not store data itself.

Open: the client gets file metadata from MDS

Write: the client writes data directly through rados

All data operations are directly accessed through the client raods, multi-client data access operations rely on OSD to control, metadata and data can be in different storage pools. As we can see, CephFS is a distributed system, which needs cross-network access, so in practical use, its IO path is longer, resulting in high latency and limited performance. In order to solve this problem, I am convinced that Technology has made a series of improvements based on this architecture.

Global cache: the cache is shared globally by the whole system, that is, as long as the file striping data on any node is cached, any other node can hit the data from the first-level cache after receiving the data access request again. Through the global cache, data merging is realized, and the high throughput of Kmuri V storage is used, thus the overall performance of the system is greatly improved.

FusionStorage: FusionStorage block storage intelligently processes IO of different sizes according to different IO sizes of the business. For small IO,FusionStorage block storage, multiple copies are written to the distributed EC Cache and stripe aggregation is done in Cache, while for large IO, the distributed EC Cache is bypassed and EC is directly submitted to the back-end hard disk. Due to the direct downloading of large blocks of IO, the system can release the precious Cache resources occupied by the original large blocks of IO, cache more random small blocks of Cache O, indirectly improve the hit rate of random small blocks of Cache, and improve the performance of the system of random small IO. When HDD writes chunk sequential IO, the write performance gap is not so obvious compared with SSD. Coupled with multi-block HDD concurrent processing, the system can also obtain good write bandwidth in the scenario of chunk sequential IO, taking into account the overall performance of the system.

Due to the limited space, this article only collates some of the wonderful contents of the live lecture. Interested readers can click the link to download the lecturer's lecture PPT for more technical details.

Download address (can be downloaded after submitting the form): PPT download address

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.