For cloud database, ultra-low latency file system PolarFS was born. 07/06 Update SLTechnology News&Howtos

For cloud database, ultra-low latency file system PolarFS was born.

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

With the wonderful debut of POLARDB, the first domestic self-developed database of Cloud Native, in ICDE 2018, the related paper "PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database" of PolarFS file system, which is its core support and enabling platform, has also been hired by the top-level meeting of database VLDB 2018. In August, Ali Cloud Database team unveiled VLDB 2018 in Rio, Brazil, which had a very positive impact on the entire industry.

VLDB (Very Large Data Base) and the other two database conferences SIGMOD and ICDE constitute the three top meetings in the database field. VLDB International Conference, founded in 1975 in Framingham MA, USA, is the annual top international forum for database researchers, suppliers, participants, application developers, and users.

VLDB is mainly composed of four topics: Core Database Technology (core database technology), Infrastructure for Information Systems (infrastructure information system), Industrial Applications and Experience (industrial application and experience) and Experiments and Analyses (experiment and analysis).

From the data analysis from 2009 to now, the overall acceptance rate of papers in VLDB is relatively low, among which, the acceptance rate of papers in core database topics is about 16.7%; that of infrastructure information systems is about 17.9%; that of industrial applications and experience is 18%; and that of experiments and analysis is about 19%. Thus it can be seen that it is not easy for papers to be accepted by VLDB. Only papers with high innovation and great contribution can be hired.

This paper focuses on the system design and implementation of PolarFS.

Background

Just as Oracle has a matching OCFS2,POLARDB as a database with a separate structure of storage and computing, PolarFS plays a vital role in giving full play to the characteristics of POLARDB. PolarFS is a distributed file system with ultra-low latency and high availability, which is built with lightweight user space network and Imax O stack, while abandoning the corresponding kernel stack. The purpose is to make full use of the potential of emerging hardware such as RDMA and NVMe SSD, and greatly reduce the end-to-end latency of distributed non-volatile data access. At present, the total access latency of 3 replicas of PolarFS written across nodes is very close to that of stand-alone local PCIe SSD, which successfully enables POLARDB to perform extremely well under the distributed multi-replica architecture.

Original intention of design

Designing a distributed file system for a database brings the following benefits:

Compute nodes and storage nodes can use different server hardware and can be customized independently. For example, computing nodes do not need to consider the ratio of storage capacity to memory capacity, which is heavily dependent on application scenarios and is difficult to predict.

Storage resources on multiple nodes can form a single storage pool, which can reduce the risk of storage space fragmentation, load imbalance between nodes and space waste, and the storage capacity and system throughput can be easily expanded horizontally.

The persistent state of database applications can be moved down to the distributed file system, and distributed storage provides high data availability and reliability. Therefore, the high availability processing of the database can be simplified, and it is also conducive to the flexible and rapid migration of database instances on computing nodes.

In addition, cloud database services will also bring additional benefits:

Cloud database can be deployed in virtual computing environments such as KVM, which is more secure, easier to expand and easier to upgrade and manage.

Some key database features, such as write-to-read instances and database snapshots, can be enhanced by distributed file system data sharing, checkpointing and other technologies.

Cdn.com/d089eef647b6416cfd15480ad2b6a40618d1be07.png ">

System structure

System component

The PolarFS system is mainly divided into two layers of management:

Virtualization management of storage resources, which is responsible for providing a logical storage space for each database instance.

The management of file system metadata, which is responsible for the file management on the logical storage space, and is responsible for the synchronization and mutual exclusion of file concurrent access.

The system structure of PolarFS is shown in the figure:

Libpfs is a user-space file system library that is responsible for accessing the database.

The PolarSwitch runs on the compute node and is used to forward Imax O requests from the database.

ChunkServer is deployed on the storage node and is used to handle Icano requests and the distribution of storage resources within the node.

PolarCtrl is the control plane of the system, which consists of a group of managers who are implemented as micro-services, and accordingly Agent agents are deployed to all computing and storage nodes.

Before going any further, let's take a look at how PolarFS storage resources are organized:

The storage resource management unit of PolarFS is divided into three layers: Volume, Chunk and Block.

Volume

Volume is an independent logical storage space for each database, on which a specific file system is established to be used by this database, with sizes ranging from 10GB to 100TB, which can fully meet the capacity requirements of typical cloud database instances.

Metadata for specific file system instances is stored on Volume. File system metadata includes objects such as inode, directory entry, and free resource blocks. Because POLARDB uses a shared file storage architecture, we achieve file system metadata consistency at the file level. In each file system, in addition to the data files created by DB, we also have Journal files and a Paxos file for metadata updates. We first record the update of the file system metadata in the Journal file, and realize the mutually exclusive write access to the Journal file by multiple instances with disk paxos algorithm based on the Paxos file.

Chunk

Each Volume is divided into multiple Chunk,Chunk, which is the minimum granularity of data distribution, and each Chunk is only stored on a single NVMe SSD disk of the storage node, which is designed to facilitate the management of data with high reliability and high availability. The typical Chunk size is 10GB, which is much larger than other similar systems, such as GFS's 64MB.

The advantage of this is that it can effectively reduce the size of the first-level mapping metadata of Volume (for example, 100TB's Volume contains only 10K mapping items). On the one hand, it is easier to store and manage the global metadata; on the other hand, it makes it easy to cache the metadata in memory, thus effectively avoiding the extra metadata access overhead on the critical Imax O path.

But the potential problem is that when there are hot spots in the upper database application at the area level, the hot spots in the Chunk can not be further dispersed, but because the number of Chunk provided by each storage node is often much larger than the number of nodes (node: Chunk is in the order of 1000), PolarFS can support the online migration of Chunk and serve a large number of database instances. Therefore, the hotspots of different instances and the hotspots of the same instance across Chunk can be distributed to different nodes to achieve overall load balancing.

Block

Within ChunkServer, Chunk is further divided into multiple Block, whose typical size is 64KB. Blocks is dynamically mapped to Chunk to achieve on-demand allocation. The Chunk-to-Block mapping information is managed and saved by ChunkServer itself, and in addition to the data Block, each Chunk contains some additional Block to implement Write Ahead Log. We also cache all the local mapping metadata in the memory of ChunkServer, so that the access of user data can be advanced at full speed.

Let's describe the various system components of PolarFS in detail below.

Libpfs

Libpfs is a lightweight user space library. PolarFS adopts the form of compiling to the database and replaces the standard file system interface, which makes all the Imax O paths in the user space, and the data processing is completed in the user space, reducing the copy of data as much as possible. The purpose of this is to avoid the messaging overhead of traditional file systems from kernel space to user space, especially the overhead of data copying. This is particularly important for the performance of low-latency hardware.

It provides a Posix-like file system interface (see table below), so the user space of the database can be completed with little modification cost.

PolarSwitch

PolarSwitch is the Daemon deployed on the compute node, which is responsible for mapping Iramo requests to specific back-end nodes. The database sends the libpfs O request to PolarSwitch, each containing the Volume ID, starting offset, and length of the database instance. PolarSwitch divides it into one or more corresponding Chunk and sends the request to the ChunkServer to which the Chunk belongs to complete access.

ChunkServer

ChunkServer is deployed on the back-end storage node. A storage node can have multiple ChunkServer. Each ChunkServer is bound to a CPU core and manages a separate NVMe SSD disk, so there is no competition for resources between ChunkServer.

ChunkServer is responsible for resource mapping, reading and writing within Chunk. Each Chunk includes a WAL, and changes to the Chunk will be modified by advanced Log to ensure the atomicity and persistence of the data. ChunkServer uses a mixture of 3DXPoint SSD and normal NVMe SSD WAL buffer,Log will be preferentially stored in a faster 3DXPoint SSD.

ChunkServer replicates write requests to the corresponding Chunk replicas (other ChunkServer). We use the Parallel Raft consistency protocol defined by ourselves to ensure the correct synchronization of data between Chunk replicas under various failure conditions and to ensure that the Commit data is not lost.

PolarCtrl

PolarCtrl is the control core of PolarFS cluster. Its main responsibilities include:

Monitor the health of ChunkServer to determine which ChunkServer is entitled to belong to the PolarFS cluster

Volume creation and layout management of Chunk (that is, which ChunkServer is assigned by Chunk)

Metadata Information maintenance from Volume to Chunk

Push meta information cache updates to PolarSwitch

Monitor the performance of Volume and Chunk

CRC data validation within and between replicas is initiated periodically.

PolarCtrl uses a relational database cloud service to manage the above metadata.

Distributed management with central control and local autonomy

There are two paradigms for the design of distributed systems: centralization and decentralization. Centralized systems include GFS and HDFS, which contain a single central point and are responsible for maintaining metadata and managing cluster members. Such a system is relatively simple to implement, but from the point of view of availability and scalability, a single center may become a bottleneck of the whole system. Decentralized systems such as Dynamo are on the contrary, the nodes are peer-to-peer, and the metadata is segmented and redundant on all nodes. Decentralized systems are considered more reliable, but the design and implementation are more complex.

PolarFS makes some tradeoffs in these two design methods, adopting a central control and local autonomy way: PolarCtrl is a centralized master, which is responsible for management tasks, such as resource management and processing control plane requests such as creating Volume. ChunkServer is responsible for the management of mapping within Chunk and data replication between Chunk. When the ChunkServer interacts with each other, the fault is handled through the ParallelRaft consistency protocol and the Leader election is automatically initiated without the participation of the PolarCtrl.

Because the PolarCtrl service does not directly deal with the highly concurrent I ChunkServer O stream, its state update frequency is relatively low, so it can adopt the typical multi-node high availability architecture to provide the persistence of PolarCtrl service. when the PolarCtrl recovers from a brief failure gap due to crash, due to the cache of PolarSwitch, the local metadata management of the ChunkServer data plane and the autonomous leader election, PolarFS can try its best to ensure that most of the data can still be served normally.

Ipaw O process

Let's illustrate the interaction process of each component through an Istroke O process.

The PolarFS executes the process of writing the IWeiO request as shown in the figure above:

POLARDB sends a write request through libpfs and to PolarSwitch via ring buffer.

The PolarSwitch sends the request to the master node of the corresponding Chunk based on the locally cached metadata.

When the new write request arrives, the RDMA NIC on the primary node puts the write request into a pre-divided buffer and adds the request item to the request queue. An Iswap O polling thread constantly polls the request queue and starts processing as soon as it finds that a new request is coming.

The request is written to the log block of the hard disk through SPDK and sent to the replica node through RDMA. These operations are called asynchronously and the data transfer is carried out concurrently.

When the replica request reaches the replica node, the RDMA NIC of the replica node will also put it into the pre-split buffer and join the replication queue.

The Ibank O polling thread on the replica node is triggered to request that the log of the Chunk be written asynchronously through SPDK.

When the write request of the replica node is successfully called back, a reply response is sent to the primary node through RDMA.

After the primary node receives a successful return from most of the nodes in a replication group, the primary node applies the write request to the data block through SPDK.

The master node then returns to PolarSwitch via RDMA.

PolarSwitch marks the success of the request and notifies the upper layer of POLARDB.

Data copy consistency model

ParallelRaft protocol design motivation

A production-level distributed storage system needs to ensure that all committed changes are not lost under various boundary conditions. PolarFS introduces a consistency protocol at the Chunk level to ensure the reliability and consistency of file system data. At the beginning of the design, considering the maturity of the engineering implementation, we chose the Raft algorithm, but for the ultra-low latency and high concurrent storage system we built, we quickly encountered some pitfalls.

For the sake of simplicity and protocol understandability, Raft adopts a highly serialized design. Logs are not allowed to have holes on leader and follower, which means that all log entries are sequentially confirmed by follower, submitted by leader, and apply to all replicas. Therefore, when a large number of concurrent write requests are executed, they are submitted sequentially. Requests at the end of the queue must wait until all previous requests have been persisted to the hard disk and returned before they are submitted and returned, which increases the average delay and reduces throughput. We found that when the concurrent Iripple O depth increases from 8 to 32, the Iripple O throughput is halved.

Raft is not very suitable for multi-connection environments with high concurrency. In practice, it is common for leader and follower to use multiple connections to deliver logs. When a link is blocked or slows down, the order in which log items arrive at follower will be out of order, that is, some lower-ordered log items will arrive before the first-ordered log items. However, the follower of Raft must receive log items in order, which means that even if these log items are recorded to the hard disk, they can only be returned after all the previous missing log items have arrived. And if most follower are blocked by some missing items, leader will also get stuttered. We hope to have a better agreement to adapt to this situation.

Because Database transaction processing system runs on PolarFS, their parallel control algorithms at the database logic level enable transactions to be interleaved or out of order, while generating serializable results. These applications naturally need to tolerate the possible out-of-order completion of standard storage semantics, and further ensure data consistency by the application itself. Therefore, we can take advantage of this feature to loosen some constraints of the Raft consistency protocol according to the storage semantics in PolarFS, so as to obtain a consistency protocol that is more suitable for high I-O concurrency.

On the basis of Raft, we provide an improved consistency protocol ParallelRaft. The structure of ParallelRaft is the same as that of Raft, except that it loosens its strict ordering constraints.

Out of order log replication

Raft ensures serialization in two ways:

When leader sends a log entry to follower,follower, it needs to return ack to confirm that the log entry has been received and recorded, and also implicitly indicates that all previous log entries have been received and saved.

When leader submits a log item and broadcasts to all follower, it also confirms that all previous log items have been submitted. ParallelRaft breaks these two restrictions and allows these steps to be performed out of order.

Therefore, the most fundamental difference between ParallelRaft and Raft is that when an entry is successfully committed, it does not mean that all previous entry have been successfully committed. So we need to make sure that:

In this case, the state of a single storage does not violate the correctness of the storage semantics.

All submitted entry will not be lost under all boundary conditions.

With these two points, combined with the default tolerance of database or other applications for storage I _ big O disordered completion, we can ensure their normal operation on PolarFS and obtain the reliability of data provided by PolarFS.

The disorderly execution of ParallelRaft follows the following principles:

When the storage ranges of the written Log items do not overlap each other, it is considered that the Log items can be executed out of order without conflict.

Otherwise, conflicting Log entries will be completed in writing order.

It is easy to know that the Icano completed according to this principle will not violate the correctness of the traditional storage semantics.

Next, let's take a look at how the ack-commit-apply segment of log is optimized and consistent.

Out-of-order acknowledgement (ack): when a log entry is received from leader, Raft follower will not send ack until it and all previous log entries are persisted. Unlike ParallelRaft, any log entry can be returned immediately after successful persistence, thus optimizing the average latency of the system.

Out-of-order submission (commit): Raft leader serially submits log items. A log item can only be submitted after all previous items have been submitted. ParallelRaft's leader can be submitted after most copies of a log entry have been confirmed. This is in line with the semantics of the storage system, for example, the NVMe SSD driver does not check the LBA of read and write commands to ensure the order of parallel commands, and there is no guarantee of the order in which the commands are completed.

Out-of-order Application (apply): for Raft, all log entries are apply in strict order, so the data files for all copies are consistent. However, due to the out-of-order confirmation and submission of ParallelRaft, the log of each copy may appear holes in different locations. The challenge here is, how to ensure that a log item is safely apply when the previous log item is missing?

ParallelRaft introduces a new data structure, look behind buffer, to solve the problems in apply.

Each log entry of ParallelRaft is accompanied by a look behind buffer. Look behind buffer stores the LBA summary information modified by the first N log items.

The function of look behind buffer is like a bridge built on a log hole. N represents the width of the bridge, that is, the maximum length of a single hole. The specific value of N can be statically adjusted to an appropriate value according to the probability of continuous missing log terms in the network to ensure the continuity of the log bridge.

Through look behind buffer,follower, you can know whether a log item conflicts, that is, whether there is a missing preorder log entry that modifies the LBAs with overlapping ranges. Log items without conflicts can be securely apply. If there are conflicts, they will be added to a pending list and will not be followed by apply until the previously missing conflict log item apply.

Through the writing and submission of the chunk log entry of asynchronous ack, asynchronous commit and asynchronous apply,PolarFS, the extra waiting time caused by order is avoided, thus the average delay of highly concurrent 3 replica writes is effectively reduced.

Correctness of ParallelRaft protocol

In the design of ParallelRaft, we ensure that the key features of Raft protocol are not lost, thus ensuring the correctness of the new protocol.

The design of ParallelRaft protocol inherits the Election Safety, Leader Append-Only and Log Matching features of the original Raft protocol.

The conflicting log is submitted in a strict order, so the State Machine Safety feature of the protocol can eventually be guaranteed.

We introduce an additional Merge phase in the Leader election phase to fill the hole in log in Leader and effectively guarantee the Leader Completeness feature of the protocol.

Design closely related to POLARDB in PolarFS

High-speed write of multiple replicas in file system-ultra-high TPS of single instance of database, high reliability of data

The following techniques are used in the design of PolarFS to make full use of its performance:

PolarFS uses a single-threaded finite state machine bound to CPU to deal with Icano, which avoids the context switching overhead of multi-threaded Icano pipeline.

PolarFS optimizes the allocation of memory, uses MemoryPool to reduce the overhead of memory object construction and deconstruction, and uses giant pages to reduce the overhead of paging and TLB updates.

Through the central plus local autonomous structure, PolarFS caches all the metadata in the memory of each part of the system, which basically completely avoids the additional metadata.

PolarFS adopts the full user space Istroke O stack, including RDMA and SPDK, which avoids the overhead of kernel network stack and storage stack.

Compared with the same hardware environment, the write performance of block 3 replicas in PolarFS is close to the latency performance of single-replica local SSD. Thus, the performance of single instance TPS of POLARDB is greatly improved while ensuring the reliability of data.

The following figure shows a preliminary test comparison of different loads using Sysbench.

POLARDB on PolarFS

Alibaba MySQL Cloud Service RDS

Use case load: OLTP, read-only, write-only (update: delete: insert = 2:1:1), mixed read-write (read: write = 7:2). The data amount of the database test set is 500GB.

It can be found that POLARDB has achieved good performance under PolarFS, and PolarFS supports both the high TPS of POLARDB and the high reliability of data.

File system sharing access-write multi-read database QPS strong extension, database instance Failover

PolarFS is a shared access distributed file system, and each file system instance has a corresponding Journal file and a corresponding Paxos file. The Journal file records the modification history of metadata and is the center of metadata synchronization between shared instances. The Journal file is logically a fixed-size circular buffer. PolarFS reclaims journal based on the water level. Paxos file implements distributed mutex based on Disk Paxos.

Since journal is critical to PolarFS, their modifications must be protected by Paxos mutexes. If a node wants to append items to the journal, it must use the DiskPaxos algorithm to acquire the locks in the Paxos file. Typically, the user of the lock releases the lock as soon as the record is persisted. However, in some failure cases, the user does not release the lock. To do this, there is a lease lease assigned on the Paxos mutex. Other competitors can restart the competition process. When PolarFS starts to synchronize metadata modified by other nodes, it scans from the last scanned location to the end of the journal and updates the new entry to the memory cache.

The following figure shows the process of updating and synchronizing file system metadata.

After node 1 allocates blocks 201 to file 316, the mutex is requested and obtained.

Node 1 starts recording transactions to journal. The last write item is marked pending tail. When all items are recorded, pending tail becomes a valid tail for journal.

Node1 updates the superblock to record the modified metadata. At the same time, node2 attempts to acquire the mutex owned by node1, and Node2 fails to retry.

Node2 gets the lock after Node1 releases lock, but the new item appended by node1 in journal determines that node2's local metadata is out of date.

Node2 releases lock after scanning for new items. Node2 then rolls back the unlogged transaction and updates the local metadata. Finally, Node2 retries the transaction.

Node3 starts to automatically synchronize metadata, it just needs load increments and replays them locally.

The above sharing mechanism of PolarFS is very suitable for the typical application expansion mode of POLARDB with one write and multiple reads. In write-multiple-read mode, there is no lock contention overhead, and read-only instances can obtain Journal information without locks through the atomic Ihammer O, so that POLARDB can provide near-linear QPS performance expansion.

Because PolarFS supports the basic guarantee of multi-write consistency, when a writable instance fails, POLARDB can easily upgrade a read-only instance to a writable instance without having to worry about the inconsistency of the underlying storage, so it conveniently provides the function of database instance Failover.

File system-level snapshot-instantaneous logical backup of POLARDB

Database snapshot is a feature that must be supported for backup of 100 TB database instances.

PolarFS uses its own patented snapshot technology to build a unified real-time image of the file system on Volume based on local snapshots of multiple ChunkServer at the bottom. Using the logs of its own database, POLARDB can quickly build a database snapshot at this specific time based on this file system image, so as to effectively support the needs of database backup and data analysis.

It can be found that the excellent characteristics of POLARDB, such as high performance, strong expansion and light operation and maintenance, are closely related to the close cooperation of PolarFS, and PolarFS plays a powerful enabling role.

Conclusion

PolarFS is a distributed file system designed for cloud databases, which can support high reliability across nodes while providing extreme performance. PolarFS uses emerging hardware and advanced optimization technologies, such as OS-bypass and zero-copy, to make the write performance of block 3 replicas in PolarFS close to the latency performance of single-replica native SSD. PolarFS implements POSIX compatible interface in user space, so that database services such as POLARDB can be modified as little as possible to obtain the advantage of high performance brought by PolarFS.

It can be seen that the database-oriented proprietary file system is an indispensable key link to ensure the leading database technology in the future. The development of database kernel technology and the enabling of proprietary file system is a complementary evolution process, and the combination of the two will become closer and closer with the progress of today's system technology.

In the future, we will explore new hardware such as NVM and FPGA, in order to further optimize the performance of POLARDB database through the deep combination of file system and database.

Authors: Ming Song, Hong ran, Xuan, Xuwei, Ning Jin, Wen Yi, Han Yi, Yiyun

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.