Amazon Aurora: design considerations for High Throughput Cloud Native Relational Database 10/16 Update SLTechnology News&Howtos

Amazon Aurora: design considerations for High Throughput Cloud Native Relational Database

2025-10-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Translator: Aceking, partner of extreme Yunzhou technology, database kernel R & D expert, responsible for enterprise-level cloud native database ArkDB and other core database products, graduate degree in database department of Huazhong University of Science and Technology, former Dameng database product R & D engineer, responsible for Dameng database kernel development, long-term commitment to database principle and source code development, proficient in Cmax database + There is a famous saying spread within the company: never dare to say that he is proficient in C++ any more.

Abstract

Amazon Aurora is a part of Amazon Web Services (AWS), which provides an OLTP payload type of relational database service. This article describes this architectural design consideration. We are convinced that the main constraints of high-throughput data processing have shifted from computing and storage to the network. Aurora presents a novel architecture to address this constraint, the most prominent feature of which is the decentralization of redo log (redo) processing to storage services designed specifically for Aurora. This article will introduce how to not only reduce network traffic, but also quickly crash recovery (crash recovery), but also achieve replication failure without loss of data, as well as storage fault tolerance, self-healing. Then it introduces an efficient asynchronous scheme for Aurora to maintain the consistency of persistent state in multiple storage nodes, which avoids the costly and complicated recovery protocol. Finally, we shared more than 18 months of experience in operating Aurora products, which was summed up from the expectations of customers in modern cloud applications for the database layer.

Keyword

Database; distributed system; Log processing; Arbitration Model; replication; recovery; performance; OLTP

01 introduction

More and more IT loads are being transferred to the public cloud, and major reasons for this industry-wide shift include the flexibility of public cloud service providers to provide capacity on demand, and a model that only pays operating expenses rather than asset costs. Many IT workloads require a relational database for OLTP. Therefore, it is very important to provide an equivalent or even beyond preset database to support this transformation.

More and more modern distributed cloud services achieve resilience and scalability by decoupling computing and storage and replicating across multiple nodes. Doing so allows us to do some operations, such as replacing faulty or unreachable hosts, adding replication nodes, failover between write nodes and replication nodes, expanding or shrinking the size of database instances, and so on. In this environment, the IO bottleneck faced by the traditional database system has changed. Because IO can be distributed among multiple nodes and disks in a multi-tenant cluster, a single disk and node is no longer a hot spot. Instead, the bottleneck shifts to the network between the database layer that initiated these IO requests and the storage layer that executed these IO requests. In addition to basic bottlenecks such as packets per second (packets per second, PPS) and bandwidth, a high-performance database magnifies network traffic by writing parallel to the initiating storage cluster. The performance, disk or network path of an abnormal storage node will seriously affect the response time.

Although most of the operations in the database can overlap each other, there are some situations that require synchronous operations, which can lead to pause and context switching. In one case, the database misses the cache (buffer cache), resulting in disk reading, and the reader thread can no longer wait for the disk read to complete. Force the dirty page out and flush the disk to make room for the new page. It can also cause the cache (cache) to miss. Although background processing, such as checkpoints and dirty page writing threads, can reduce this coercive behavior, it can still lead to pause, context switching, and resource contention.

Transaction commit is another source of interference. A pause in a transaction that is being committed prevents another transaction from proceeding. Handling the commit of multi-node synchronization protocols, such as two-phase commit (2PC), is even more challenging for cloud-scale (cloud-scale) distributed systems. These agreements cannot tolerate failure. But high-scale (high-scale) is always filled with "background noise" of software and hardware failure. There is also high latency because high-scale systems are always distributed across multiple data centers.

In this article, we introduce Amazon Aurora, a new database service that solves these problems through bold and aggressive use of logs in highly distributed cloud environments. We present a service-oriented architecture using multi-tenant scalable storage (figure 1), which is loosely coupled to a cluster of database instances and extracts virtual segmented redo logs (redo log). Although each database instance still has most of the traditional database kernel (query processing, transactions, locks, caching, access methods, rollback management), some functions (redo logging, persistent storage, crash recovery, backup recovery) have been stripped and put into storage services.

Compared with traditional methods, our architecture has three obvious advantages: first, by building storage as a fault-tolerant, self-healing, independent service across multiple data centers, we can protect the database so that the performance of the network layer or storage layer is different. temporary or permanent failure no longer affects the database. We observe that persistent storage failures can be modeled as a long-term availability event (longlasting availability event), while availability events can be modeled as a long-term performance change, which has been handled uniformly by a well-designed system. Secondly, by storing and writing logs, the IOPS of the network can be reduced by an order of magnitude. Once we remove this bottleneck, we can boldly optimize on a large number of other contention points and achieve a more significant throughput improvement than the MySQL baseline source code on which we are based. Third, we transform some of the most important and complex functions (such as backup, log recovery) from one-time costly operations in the database engine to continuous asynchronous operations allocated to a large distributed cluster. As a result, near-instantaneous crash recovery is achieved without checkpoint. at the same time, the backup is not expensive and will not affect the foreground processing.

This paper introduces our three contributions.

How to analyze the persistence of cloud scale and how to design an arbitration system with elasticity to related failures. (paragraph 2)

How to use smart detached storage so that the load of storage is lower than that of the traditional 1PUX 4. (paragraph 3)

How to eliminate multi-phase synchronization, crash recovery, and checkpointing in a storage distributed system. (paragraph 4)

We put the three ideas together to design the overall architecture of Auraro (paragraph 5), then (paragraph 6) to review our performance results, (paragraph 7) to show our operational experience, (paragraph 8) to summarize the relevant work, and (paragraph 9) to give concluding remarks.

02 scalable persistence

If the database does nothing else, it must meet the following requirements: once written, it must be readable. Not all systems can do this. In this section we discussed the principles behind the aurora arbitration model, why we segment storage, how to combine persistence with availability, reduce jitter, and help solve operational problems with large storage clusters.

2.1 replication and related failures

The lifecycle of the instance is not much related to the lifecycle of the storage. If the instance fails, the user will shut it down and they will resize it according to the load. These features help us decouple the computing layer from the storage layer.

Once you do this, these storage nodes and disks will still fail, so some form of replication is needed to be resilient to failures. In the large-scale cloud environment, there is continuous underlying background noise, such as node failure, disk and network path failure. Each failure has a different duration and failure radius. For example, a node is temporarily unable to reach the network, temporary downtime, disk, node, rack, leaf node or backbone node of the gateway, or even the permanent failure of the entire data center caused by restart.

In systems where replication exists, fault tolerance is implemented using a [] arbitration-based voting protocol. If a replicated data item has V copies, each copy can be assigned a vote, a read arbitration that requires Vr votes for a read operation or a write arbitration for Vw votes, respectively. Then in order to achieve consistency, arbitration must have two rules. First, each read operation must be able to obtain the latest changes. The formula is:

Vr+Vw > V

This rule ensures that the read node set intersects with the write node set, and the read arbitration can read at least one copy of the latest version of the data. Second, each write must account for the majority of copies, in order to avoid write conflicts. That is the formula:

Vw > Vapor 2

A common way to tolerate the loss of a node for replicated data items (Viter3) is to write a vote arbitration (Vw=2) that requires 2 to 3, and to read it to require a vote arbitration (Vr=2).

We are convinced that the number of voting arbitrations in 2 + 3 is insufficient. To explain why, let's first understand the concept of AWS, where the AZ,availability zone is a subset of the region. One availability zone can be connected to other availability zones with low latency, but it can isolate failures in other availability zones, such as power, network, software deployment, floods, and so on. Distributed replication of data across AZ ensures that a typical large-scale failure mode affects only one replica, which means that a wide range of event fault tolerance can be supported by simply placing three replicas in different availability zones, with the exception of a small number of individual failures.

However, in a large storage cluster, the background noise of the failure means that at any given moment, a node or a subset of disks may be broken and repaired. These faults may be distributed independently in availability zones A, B, and C. However, due to fire, flood, roof collapse, etc., in availability zone C, at the same time, there is also a fault in availability zone An or B (fault background noise), which breaks the arbitration process of any copy. At this point, the two copies of the read arbitration model, which is 2amp 3 votes, have been lost, so it is not possible to determine whether the third copy is up-to-date. In other words, although the individual replication node failures in each availability zone are not related to each other, the failure of one availability zone is related to all nodes and disks in the availability zone. The arbitration system should be able to tolerate the simultaneous occurrence of the fault of the type of fault background noise and the fault of the whole availability zone.

In Aurora, we choose a design that can tolerate two kinds of errors: (a) No data loss in the case of loss of the entire availability zone plus a node (AZ+1).

(B) the loss of the entire availability zone does not affect the ability to write data.

In the case of three availability zones, we write a data item 6 ways to three availability zones, each with 2 copies. Using six voting models, one write arbitration needs 4x6 votes (Vw=4) and one read arbitration needs 3x6 votes (Vr=3). In this model, the entire availability zone is lost plus one node (a) is lost (a), and the read availability is not lost. As (b), 2 nodes, or even the entire availability zone (2 nodes) is lost, and the write availability is still maintained. Make sure to read the arbitration so that we can re-write the arbitration by adding additional copies.

2.2 segmented storage

Now let's consider whether the AZ+1 situation provides enough persistence. To provide sufficient persistence in this model, it is necessary to ensure that the probability of two independent failures (average failure time, Mean Time to Failure, MTTF) is lower than the time it takes to repair these failures (Mean Time to Repair, MTTR). If the probability of two failures is high enough, we may see the availability zone fail and break the blanking. It is difficult to reduce the independent failure of MTTF, even a little bit. Instead, we focus more on reducing MTTR and shortening the fragile time window for two failures. Here's what we did: split the database volume into small, fixed-size segments, which are currently 10G in size. Each data item is copied to the protection group (protection Group,PG) in 6 channels, and each protection group is composed of 6 10GB segments, scattered in 3 availability zones. Each availability zone holds 2 segments. A storage volume is a set of connected PG collections, which is physically implemented by using virtual machines provided by Amazon Elastic Computing Cloud (ES2) to attach SSD to form a large storage node cluster. The constituent storage volume PG is allocated as the volume grows. Currently, the volumes we support can grow to 64TB without replication enabled.

Section is an independent unit for the failure and repair of background noise. We monitor and automatically fix failures as part of our service. In the case of a 10Gbps network connection, a 10GB segment can be repaired in 10 seconds. Only when two segments fail in the same 10-second window, plus the failure of an availability zone, and the availability zone does not contain these two segments, can we see the arbitration failure. Based on the failure rate we have observed, it is almost impossible, even in terms of the number of databases managed for our customers.

2.3 operational benefits of elasticity

Once someone designs a system that is naturally resilient to long-term failures, it can also be resilient to short-term failures. A system that can handle long-term failures in the availability zone can also handle short pauses such as power failures or rollbacks caused by poor software deployment. A system that can handle the loss of availability of multi-second arbitration members can also handle short-term network congestion and overload of a single storage node.

Because our system has high fault tolerance, we can take advantage of this and do some maintenance work by causing the segment to be unavailable. For example, thermal management is straightforward, we directly mark a segment of a hot disk or node as a bad segment, and then the arbitration system fixes it by migrating to a colder node in the cluster. Operating system or security patches are also a short-term unavailable event when patching. Even software upgrades can be done in this way. At some point, we upgrade an AZ, but make sure that no more than one member (segment or node) in the PG is also patching. As a result, our systems can be quickly deployed in our storage services using agile methodologies.

03 log is database

This section explains why traditional database systems using segmented replication storage systems described in paragraph 2 are still unable to bear the performance burden caused by network IO and synchronous pauses. We explain the method of stripping log processing to storage services, and experiments show how this method can significantly reduce network IO. Finally, we describe various techniques to minimize synchronous pauses and unnecessary write operations.

3.1 the burden of writing magnification

Our segmented storage of storage volumes, as well as a model of 6-way write, 4 racks, and 6 arbitration votes, is highly resilient. Unfortunately, this model makes traditional databases such as MYSQL unable to bear this performance burden, because an application writes to the storage once, resulting in many different real IO in the storage. A large amount of IO is magnified by the replication operation, resulting in a heavy burden of PPS (packets per second, packets per second). At the same time, IO can also cause the synchronization point pipeline to stagnate and delay magnification. Although chain replication [8] and other alternatives can reduce network costs, there are still synchronization pauses and additional delays. Let's take a look at how traditional database writes work. A system like Mysql writes pages to known objects (for example, heap files, B-counts, and so on) while writing logs like a first log system (write-ahead log, WAL). The log records the changes from the front image to the back image of the page. By applying the log, you can change the page from the front image to the back image.

In fact, there is also other data to be written. For example, consider the master-slave synchronous mirror mysql configuration shown in figure 2 to achieve high availability across data centers. AZ1 is the master mysql instance and uses Amazon Elastic Block Storage (EBS). AZ2 is not only a slave Mysql instance, but also uses EBS storage. Writes to the master EBS are synchronized to the slave EBS volume through software mirroring.

Figure 2 shows the types of data that the engine needs to write: redo logs, binlog logs, modified data pages, double-write writes, and metadata FRM files. The following actual IO sequence is shown in the figure:

Step 1, step 2, initiate a write operation like EBS, and write like an EBS mirror of the local availability zone, as well as a response message for the completion of the operation.

Step 3, use block-level mirroring software to transfer writes to the slave node.

Steps 4 and 5 perform incoming writes from the slave node to the EBS of the slave node and its EBS image.

The mirror model of the above MySQL is not desirable not only because of how it is written, but also because of what data it is written into. First of all, the first step is a sequential synchronization process, resulting in additional delays because there are many sequential writes. Jitter is also magnified, even when writing asynchronously, because it has to wait for the slowest operation, leaving the system at the mercy of the exception node. From a distributed point of view, this model can be regarded as a write arbitration of 4thumb 4, which is very fragile in the face of failure or abnormal performance of abnormal nodes. Secondly, the user actions generated by OLTP applications can produce many different types of writes, and although these data are written in different ways, they represent the same information. For example, writes to the double write buffer are mainly designed to prevent damage to the storage layer page (but the content is the same as normal writing to the page)

3.2 Log processing is separated from storage

A log record is generated when a page is modified in a traditional database. Call the log application, apply the log to the front image of the page in memory, and get the back image of the page. The log must have been written before the transaction is committed, and the data page can be written later. In Aurora, only redo logs are written over the network. The database layer does not write data pages, there are no background writes, there are no checkpoints, and there is no Cache replacement. Instead, the log application has been placed in the storage layer, which is generated in the background, or the data pages of the database are generated as needed. Of course, the cost of applying all the changes from scratch is unbearable, so we continue to generate data pages in the background to avoid regenerating them each time they are crawled. From a correct point of view, background generation is completely optional: as far as the engine is concerned, the log is the database, and storing all materialized pages can be regarded as a cache for the log application. Unlike checkpoints, only pages with long modification chains are rematerialized. Checkpoints are controlled by the entire log chain length, while Aurora is controlled by the log chain length of a given page.

Although replication can lead to write magnification, our approach can significantly reduce network load and ensure performance and persistence. With embarrassing parallel writes, storage can overload IO without affecting the write throughput of the database engine. For example, figure 3 shows an one-master, multi-slave Aurora cluster deployed in multiple availability zones. In this model, the master library writes logs to the storage and sends these logs and metadata updates to the slave node. Batches of logs based on a common storage goal (a logical segment, for example, PG) are completely orderly. Each batch is transmitted to the replication node through 6 channels, and the storage engine waits for 4 or more responses to meet the write arbitration conditions to determine whether there is a problem with the persistence or hardening of the log. The replication node applies these log records to update its own cache buffer.

In order to measure the status of the network IO, we tested the 100GB dataset write-only load sysbench [9] under the two mysql configurations mentioned above: one is the mysql image configuration across multiple availability zones, and the other is the RDS of Aurora, which runs in the instance of r3.8large EC2 for 30 minutes.

The experimental results are shown in Table 1. In 30 minutes, Aurora has 35 times more transactions than mysql images, and although Aurora writes have been magnified 6 times, each transaction write IO is still 7.7 times smaller than mysql images. There is no record of chained replication of EBS and writing of mysql across availability zones. Each storage node is only one of six replication nodes, so write magnification is not visible. The number of IO here has decreased by 46 times. Write less data, save network power, radically achieve persistence and availability through replication, and minimize the impact of jitter when making parallel requests.

Putting log processing into the storage service can also minimize the time for crash recovery to improve availability and eliminate the jitter caused by checkpointing, background writing, backup and other background processing. We'll see the collapse and recovery. After the traditional database crashes, it must start with the most recent checkpoint, repeat the logs and report to ensure that all saved logs are applied. In Aurora, continuous logging occurs in storage, and it is persistent, asynchronous, and distributed in the cluster. Any page read request, if the page is not the current version, applies some redo logs. Therefore, the crash recovery process has been dispersed among all normal foreground processing and does not need to be performed when the database is started.

3.3 Design points for storage services

The core principle of our storage service design is to minimize the latency of foreground write requests. We moved most of the storage processing to the background. Given the natural variation between peak and average requests in the storage layer foreground, we have plenty of time to do these processes outside the foreground requests. And we also have the opportunity to swap cpu for disks. For example, when the storage node is busy with front-end request processing, there is no need for garbage collection (GC) of old pages unless the disk is almost full. In Aurora, background processing is negatively related to foreground processing. Different from the traditional database, the background page writing and checkpoint processing of the traditional database are positively related to the number of requests of the foreground of the system. If there is a backlog of requests in the system, we will suppress the front-end requests and avoid forming a long queue. Because the segments in the system are distributed in different storage nodes in the form of high entropy (high degree of chaos), a storage node is limited to be naturally treated as a slow node in the 4x6 arbitration system.

Now look at the various activities of the storage node in more detail. As shown in figure 4, several steps are included:

1. Receive log records and put them in memory queue.

2. Save to disk and return the reply.

3. Organize log records and identify gaps in the log, because some batches of logs will be lost.

4. Fill in the blanks when communicating with peers (gossip).

5. Merge the log into a new data page.

6. Back up new data pages and logs to S3 periodically.

7. Recycling the old version of the cycle.

8. Check the CRC encoding in the page.

Note that not all of the above steps are asynchronous, and that 1 and 2 have a potential delay effect in the foreground path.

04 log forward

In this section, we describe how the logs generated by the database engine are always consistent in persistent state, running state, and replication state. In particular, how to achieve consistency efficiently without the need for 2PC protocol. First, we showed how to avoid using expensive redo processing in crash recovery. We explain how to maintain the running state and replication state in normal operation. Finally, we disclose the details of the recovery from the crash.

4.1 solution Sketch: asynchronous processing

Because we model the database as a log stream, in fact the log is an ordered sequence of changes, which we also take advantage of. In fact, each log has a log sequence number (Log Sequence Number, LSN) associated with it, and LSN is a monotonously increasing value generated by the database.

This allows us to simplify consensus protocols, where we use asynchronous protocols instead of 2PC protocols. The interaction of 2PC protocol is complicated and fault-tolerant. At a high level, we maintain consistency and persistence points, advance these points when receiving responses to outstanding storage requests, and continue to move forward. Because individual storage nodes may lose one or more log records, they can communicate with other members of the PG to find gaps and fill gaps. The running state is maintained by the database, and we can read directly from the storage segment without reading the arbitration. However, at the time of crash recovery, except when the running state has been lost and needs to be rebuilt, arbitration is still needed.

The database may have many independent outstanding transactions that can be completed in a completely different order from the order in which they were initiated (reaching the end state of persistence). Assuming that the database crashes or restarts, the decision on whether or not to roll back these separate transactions is separate. The logic to track and undo partially completed transactions is retained in the database engine, just like writing to a simple disk. During the restart process, before the database is allowed to access the storage volume, the storage performs its own crash recovery, regardless of user-level transactions. Of course, make sure that the storage system that the database sees is a single view, even though the storage is actually distributed.

The storage service determines a highest LSN, ensuring that all log records prior to this LSN are readable. (VCL,volume Complete LSN). During the recovery process, any redo logs that correspond to LSN higher than VCL are discarded. However, the database can further constrain its subset to mark which logs can be discarded. This subset is the CPL set (Consistency Point LSNs). Therefore, we can also define a VDL (volume durable LSN) as the maximum CPL of small and VCL. For example, although the data completion to LSN is 1007, the CPL set of database tags is 900 and 1000. In this case, the position we discard is 1000 (everything after 1000 is discarded), that is, we finish to 1007 (VCL), but last to 1000 (VDL).

Therefore, completeness and persistence are different. CPL can be seen as some form of restriction that storage transactions must be accepted in an orderly manner. If the client does not distinguish between this, we mark each log record as a CPL. In fact, the database interacts with the storage as follows:

Each transaction at the database level is interrupted into multiple mini-transactions (mini-transactions,MTRs), which are ordered and executed in an atomic way (that is, indivisible).

A mini transaction consists of multiple consecutive log records (as many as you want).

The last log record of a mini transaction is a CPL

During crash recovery, the database tells the storage layer to establish a persistence point for each PG to establish a VDL, and then initiates a command to discard higher than the VDL log.

4.2 normal operation

We now describe the "normal operation" of the database, focusing on write, read, commit, and replication in turn.

4.2.1 write

In Aurora, the database constantly interacts with the storage service, maintains state to establish arbitration, promotes volume persistence (volume durablity), and registers transactions at commit time. For example, in the normal / forward path, when the database receives a response message to establish a write arbitration for each batch of logs, the current VDL is advanced. At any given time, the database may have a lot of concurrent transaction activity. Each transaction generates logs, and the database allocates a unique ordered LSN for each log, but the allocation is limited by one principle: LSN cannot be greater than the sum of the current VDL and a constant value. This constant is called the LSN allocation limit (LAL, LSN Allocation limit) (currently set to 10 million). This limit ensures that the LSN of the database does not exceed the storage system too far, and provides a backpressure to adjust incoming write requests when the storage or network fails to keep up.

Note that each segment of each PG can only see a subset of the storage volume log records, and the logs of that subset affect only the pages that reside in that segment. Each log record contains a backlink to the previous log record in PG. These backlinks can track the completion point of log records for each segment to establish a segment completion LSN (Segment Complete LSN, SCL), where SCL represents a maximum LSN under which all log records have been received by the PG. Storage nodes use SCL to communicate with each other to find and exchange their exact part of the log.

4.2.2 submit

In Aurora, transaction commits are done asynchronously. When the client commits a transaction, the thread that handles the transaction commits records a "commit LSN" (commit lsn) into a separate list of transactions waiting to commit, and then shelves the transaction to handle other tasks. This is equivalent to the fact that the WAL protocol is based on the completion of the transaction commit, that is, if and only if the latest VDL is greater than or equal to the transaction's commit LSN. With the advance of VDL, the database is waiting for the commit transaction to identify the eligible, and the proprietary thread sends a commit response message to the waiting client. Worker threads do not pause for committing transactions, they simply pull up other pending requests to continue processing.

4.2.3 read

In Aurora, like most databases, pages are provided by buf and cache. Storing IO requests occurs only if the page is in doubt in Cache.

If the buf cache is full, the system finds the injured page and pushes it out of the cache. In traditional databases, if the injured page is a "dirty page", flush the disk before replacing it. This ensures that subsequent page extraction is always up-to-date. However, Aurora does not write out the page when it expels the page, but it enforces a guarantee that the page in buff or cache is always the latest version. The way to achieve this guarantee is to simply make the "page LSN (Page LSN)" (the latest log LSN related to the page) greater than or equal to VDL. This protocol ensures that: (a) all changes to the page are hardened in the log, and (b) when the cache fails, it is sufficient to get the most recent persistent version of the page and request the current VDL version page.

The database does not normally need to use read quorum to establish consistency. When a page is read from disk, the database establishes a read-point.

Represents the VDL at the time the read was initiated. The database can select a complete storage node relative to the read point to obtain the latest version of the data. The data pages returned by the storage node must be consistent with the expected semantics of the database's mini-transaction (mtr). Because the database manages the feeds to the storage node logs and tracks the process (for example, the SCL on each side), it knows which segments meet the read requirements (those with large SCL and read points). Then a read request is made directly to the segment with sufficient data.

Considering that the database knows that those reads are not completed, the minimum read point LSN (Minimum Read Point LSN) can be calculated on a per-PG basis at any time. If there is a readable copy of the storage node, which is used to establish the minimum read point for each PG of all nodes, then this value is called the protection group minimum read point LSN (Protection Group Min Read Point LSN, PGMRPL). PGMRPL is used to identify the "low waterline", and PG logging below this "low waterline" is unnecessary. In other words, each storage segment must ensure that there are no page read requests below the read point of the PGMRPL. Each storage node can learn about PGMRPL from the database, so you can collect old logs, materialize these pages to disk, and then safely recycle log garbage.

In fact, the concurrency control protocol implemented in the database is organized in the same way as the traditional database using local storage.

4.2.4 replication Node (replicas)

"in Aurora, there are up to 15 read replications in a single write on the same storage volume." Therefore, read replication does not add additional cost in terms of storage consumption, nor does it add additional disk writes. To minimize latency, the log flow generated by the write is sent not only to the storage node, but also to all read replication nodes. In the read node, the database checks the daily log records of the log flow in turn. If the page referenced by the log is in the cache of the read node, the log application applies the specified log to the page in the cache. Otherwise, simply discard the log record. Note that from the point of view of the write node, the replication node consumes the log flow asynchronously, while the write node responds to the user's submission independently of the replication node. Replication nodes must follow the following two important rules when applying logs: (a) only logging with LSN less than or equal to VDL can be applied. (B) only logging that belongs to a single mini-transaction (MTR) can be applied (mtr incomplete logging cannot be applied) to ensure that the replication node sees a consistent view of all database objects. In fact, each replication node is delayed only for a short period of time (20ms or less) after the write node.

4.3 recovery

The recovery protocol adopted by most databases, such as ARIES, depends on whether or not a write log (write ahead log,WAL) is used to represent all the precise contents of the committed transaction. These system periodically do database checkpoints, establish persistence points in a coarse-grained manner by brushing dirty pages to disk and writing checkpoint records in the log. On restart, any specified page may lose data, which may have been submitted or may contain uncommitted. Therefore, during crash recovery, the system processes the logs from the most recent checkpoint and uses the log application to apply the logs to the relevant database pages. By performing the corresponding rollback logging, you can roll back transactions that were running at the time of the crash, which brings a consistent state to the database at the point of failure. Crash recovery is a costly operation, reducing the checkpoint interval can reduce the cost, but it can bring the cost of affecting front-end transactions. Traditional databases need to make a tradeoff between the two, while Aurora does not.

An important simplification principle for traditional databases is to use the same log application to promote database state and synchronous log recovery. And at this point, from the front end, the database library is offline. Aurora design follows the same principle. But the logging application has been decoupled from the database and is always running in parallel in the background directly in the storage service. Once the database starts performing volume recovery, it works with the storage service. As a result, the Aurora database recovers very quickly (usually in less than 10 seconds), even after crashing when running 100000 statements per second.

The Aurora database does not need to rebuild its running state after a crash. In the case of a crash, it contacts each PG, and as long as the data meets the conditions for write arbitration, the read arbitration of the segment is fully guaranteed to find the data. Once the database has established a read arbitration for each PG, the VDL can be recalculated to generate a discard range (truncation range) higher than that of the VDL to ensure that all logs after this range are discarded. The discard range is the end LSN that the database can prove as much as possible to ensure that the incomplete log (not the log of the complete MTR) is visible. The database can then infer the upper limit of LSN allocation, that is, how much the upper limit of LSN can exceed VDL (10 million described earlier). The discard range uses the time number (epoch number) as the version, which is written in the storage service. No matter crash recovery or restart, it will not confuse the persistence of the discarded log.

The database still needs to do undo recovery to roll back transactions that were in progress at the time of the crash. However, the rollback transaction can take place while the database is online, and before that, the database has established a list of outstanding transactions by rolling back the information of the segment.

05 put it together

In this section, we describe the modules of the Aurora as shown in figure 5.

The Aurora library is a branch of fork from the community version of the mysql/innodb database, which is different from reading and writing disk data. In the community version of Innodb, the write operation of the data modifies the pages in the buffer, and the corresponding logs are written to the cache of the WAL in LSN order. When a transaction commits, the WAL protocol requires only the log records of the transaction to be written to disk. Finally, the modified cache page is written to disk using double-write technology, which is used to avoid incomplete page writing (partial page writes). These writes are performed in the background, either when the page is expelled from the cache, or at a checkpoint. In addition to the IO subsystem, innodb also includes a transaction subsystem, a lock manager, the implementation of a B + tree, and a "mini transaction" (MTR). Innodb refers to a set of executed atomically operations as a mini transaction (for example, splitting / merging B+ tree pages).

In the Innodb variant of Aurora, the original indivisible execution logs in mtr are organized in batches, sliced and written to the PG to which they belong, marking the last log record of mtr as a consistent point (consistency point), and Aurora supports the same isolation level as community mysql in writing (ANSI standard level, Snapshot isolation level, consistency read). The Aurora read replication node can obtain persistent information about the start and commit of the transaction from the write node, and use this information to support the snapshot isolation level of the local read-only transaction. Note that concurrency control is fully implemented in the database and has no impact on storage services. The storage service provides a unified view of the underlying data, which is logically equivalent to the way the community version of innodb writes data to local storage.

Aurora uses Amazon Relational Database Services (RDS) as its control platform. RDS includes an agent on the database instance that monitors the health of the cluster to determine whether to fail over or replace the instance. The instance can be one of the clusters, which consists of one write node and 0 or more read replication nodes. All the examples in the cluster are in the same geographic area (for example, us-east-1,us-west-2, etc.). But placed in different availability zones. Connect a cluster of storage machines in the same area. For security reasons, connections between databases, applications, and storage are isolated. In fact, each database instance can communicate through three Amazon virtual private clouds (VPC): the user VPC, through which user applications can interact with the database instance. RDS VPC, through which database nodes interact with each other, and between database and control platform. Store VPC for interaction between database and storage.

The storage service is deployed in the EC2 virtual machine cluster. These virtual machines span at least three availability zones in an area and are jointly responsible for multi-user provision of storage volumes, as well as reading, writing, backup, and recovery of these volumes. The storage node is responsible for operating the local SSD, interacting with the peer node or database instance, and backup and recovery services. Backup and recovery services continuously back up changes in data to S3 and restore them from S3 as needed. The storage control platform uses Amazon DynamoDB service to store the persistent data of the cluster, the storage volume configuration, the metadata of the volume, and the description of the data backed up to S3. In order to coordinate long-term operations, such as database volume recovery operations, repair (re-replication) storage node failures, etc., the storage control platform uses Amazon simple Workflow Service (Amazon Simple Workflow Service). Maintaining high availability requires active, automatic, and early detection of real or potential problems before affecting users. All key aspects of the storage operation are continuously monitored through the metric collection service, and metric will alarm if there are anomalies indicated by the key performance and availability parameter values.

06 performance result

In this paragraph, we share our operational experience since Aurora reached "GA" (basically available) in May 2015. We give industry-standard benchmark results, as well as performance results from our customers.

Standard benchmark test results

The performance comparison results of Aurora and mysql under different experiments are shown here. The experiment used industry-standard benchmarks such as variants of sysbench and TPC-C. The additional storage for the mysql instance we are running is an EBS storage volume with 30K IPOS, unless otherwise noted, these mysql are running in a r3.8xlarge EC2 instance with 32 vCPU and 244G memory, equipped with an Intel Xeon E5-2670 v2 (Ivy Bridge) processor, and r3.8xlarge 's cache cache is set to 170GB.

07 lessons learned

We can see a large number of applications running by customers, from small Internet companies to complex organizations running a large number of Aurora clusters. However, many applications are standard use cases, and we focus on common scenarios and expectations in these cloud service applications to lead us in a new direction.

7.1 Multi-tenancy and database integration

Many of our customers operate software-as-a-service (SaaS) businesses, from proprietary cloud customers to customers who shift enterprise-deployed legacy systems to SaaS. We find that these customers often rely on an application that cannot be easily modified. As a result, they usually integrate their own different users into a single instance through one tenant, one library or pattern. This style can reduce costs: they avoid buying a proprietary instance for each user because their users cannot be active at the same time. For example, some SaaS customers say they have more than 50000 users.

This pattern is different from the well-known multi-tenant applications such as Saleforce.com, where saleFore.com uses a multi-tenant model in which all users are in a unified table of the same pattern and mark tenants in row-level records. As a result, we see that the users doing database consolidation have a large number of tables. Small databases with more than 150000 tables in production instances are very common. This puts pressure on components that manage metadata such as dictionary caching. These users need to (a) maintain high levels of throughput and concurrent multi-user connections, (b) buy and provide as much data as possible, it is difficult to further predict how much storage space will be used, and (c) reduce jitter and minimize the impact of spikes in individual user load on other tenants. Aurora supports these attributes and adapts well to the application of these SaaS.

7.2 High concurrent automatic scalable load

Internet load needs to deal with peak traffic caused by emergencies. One of our major customers had a special situation on a highly popular national TV show and experienced a peak well above normal throughput, but did not put pressure on the database. To support such a peak, it is very important for the database to handle a large number of concurrent user connections. This is feasible in Aurora because the underlying storage system is so scalable. Some of our users run up to 8000 connections per second.

7.3 Mode upgrade

Modern web application frameworks, such as Ruby on Rails, integrate ORM (object-relational mapping, object-relational mapping). As a result, it is very easy for application developers to change the database schema, but for DBA, how to upgrade the database schema becomes a challenge. In Rail applications, we have first-hand information about DBA: "database migration" is a "week-long migration" for DBA, or some avoidance strategies are used to ensure that future migration is less painful. On the other hand, Mysql provides free schema upgrade semantics, and most of the modifications are implemented by full table replication. Frequent DDL operations are a pragmatic reality, and we have implemented an efficient online DDL, using (a) a per-page database versioning pattern, using schema history information to decode individual pages as needed. (B) use the modify-on-write primitive to update the latest mode of a single page lazily.

7.4 availability and software upgrades

Our customers have strong expectations for how the cloud native database will resolve the contradiction between the running cluster and the patching and upgrading of the server. Customers hold certain applications, and these applications mainly use Aurora as OLTP service support. For these customers, any interruption is painful. As a result, many of our customers have a low tolerance for database software updates, even if it is equivalent to 6 weeks and 30 seconds of downtime. As a result, we recently released the zero downtime patch (zero downtime patch) feature. This feature allows patches to be made to users without affecting the running database connection.

As shown in figure 5, ZDP applies spooling to local temporary storage, patches the database engine, and then updates the application status by looking for moments when there are no active transactions. In the process, the user session is still active and does not know that the database engine has changed.

08 related work

In this paragraph, we discuss the contributions of others, which are related to the approach taken by Aurora.

Storage is decoupled from the engine although traditional data is designed as a single kernel daemon. However, there is still some work to decouple the database kernel into different components. For example, Deuteronomy is such a system, which divides the system into transaction components (Tranascation Component, TC) and data components (Data Componet, DC). Transaction components provide concurrency control functions that have been crash recovered from DC, and DC provides an access method on top of LLAMA. LLAMA is a latchless cache and storage management system based on log structure. Sinfonia and Hyder extract transactional access methods from extensible services and implement a database system that can abstractly use these methods. The Yesquel system implements a multi-version distributed balanced tree and separates concurrency control from query processing. Aurora decoupling storage is lower-level than Deuteronomy, Hyder, and Sinfonia. In Aurora, query processing, transactions, concurrency, cache buffering, and access methods are all decoupled from log storage, and crash recovery is implemented as an extensible service.

In the face of partition, the tradeoff between correctness and availability of distributed systems has long been known. In the case of network partition, it is impossible to serialize a replica. Recently, Brewer's CAP theory has proved that it is impossible to provide "strong" consistency in the case of high availability systems in the case of network partition. These results, along with our experience at the cloud scale, inspire our goal of consistency, even in the case of partitions caused by AZ failures.

Brailis et al studied the problem of high availability transaction systems (Highly Available Transactions, HATs). They have shown that partitioning or high network latency does not lead to unavailability. Serializable, snapshot isolation level, repeatable read isolation level is not compatible with HAT. However, other isolation levels can achieve high availability. Aurora provides all isolation levels, which gives a simple assumption that only one write node generates logs at any one time, and the LSN for log updates is allocated in a single ordered domain.

Google's Spanner provides external consistency of reads and writes, and globally consistent reads across data at the same timestamp. These features enable spanner to make consistent backups, consistent distributed query processing, and atomic schema updates on a global scale, even when running transactions. As Bailis explains, Spanner is highly customized for google's read-heavy load. Two-phase commit is adopted, and two-phase lock is used for read / write transactions.

The models of weak consistency and weak isolation of concurrency control are well known in distributed databases. As a result, optimistic replication technology and other methods adopted by the final consistency system in the centralized system include pessimistic mode based on lock (), optimistic mode such as multi-version concurrency control, such as Hekaton, such as VoltDB using slicing method, Deuteronomy and Hyper using timestamp sorting method. Aurora provides a persistent abstract local disk to the database system, allowing the engine to handle isolation level and concurrency control on its own.

Log structure storage LFS introduced the log structure storage system in 1992. Recent work on Deuteronomy and LLMA, and Bw-tree 's adoption of log structure technology in the storage engine stack in a variety of ways, like Aurora, reduces write magnification by writing a difference rather than a full page. Both Aurora and Deuteronomy implement a pure redo logging system and track the highest persistence of LSN to confirm the submission.

Crash recovery of traditional databases relies on ARIES's recovery protocol, and some databases now take other approaches for performance. For example, Hekaton and VoltDB use some form of update log to reconstruct the memory state after a crash. Systems like Sinfonia use techniques such as process pairing and replication state machines to avoid crash recovery. Graefe describes a system that uses a per-page log chain that can be redone page by page as needed. It can increase the speed of crash recovery. Aurora and Deuteronomy do not need to redo the crash recovery process. This is because Deuteronomy delays transactions so that only modifications to committed transactions are written to persistent storage. Therefore, unlike Aurora, the size of transactions that Deuteronomy can handle must be limited.

09 conclusion

We have designed a high-throughput OLTP database Aurora, which does not lose its availability and persistence in a cloud-scale environment. The main idea is that instead of using the single-core architecture of the traditional database, the storage is decoupled from the computing node. In fact, we only removed parts of the database kernel that were less than 1x4 into independent, scalable, distributed services to manage logs and storage. Since all IO writes go through the network, our basic constraint now lies in the network. So we focus on technologies that can relieve network pressure and improve throughput. We use the arbitration model to deal with related complex failures in large-scale cloud environments. And the performance penalty brought by special nodes is avoided. Use logging processing to reduce the overall IO burden. The asynchronous consensus method is used to eliminate the complex interaction and high cost of multi-phase commit synchronization protocol, as well as offline crash recovery and distributed storage checkpoints. Our approach leads to a simplified system architecture, reduced complexity, and easy scalability to lay the foundation for future development.

Personal interpretation of the article

In the Data Plane shown in figure 1, the box of the database engine wraps the Caching, and the box of the storage service wraps the Caching. afford much food for thought. It can be guessed that Cache is shared memory and can be accessed by both the database engine and storage services. And Cache is not only part of the database engine, but also part of the storage service. Most of the full text talks about how to optimize the network burden caused by the write request, but does not mention whether the read request brings the network burden. Therefore, it can be guessed that a write operation applies not only the log to disk storage, but also the page that is already in cache, so most read requests do not need to actually store IO, unless the page misses in cache. From the Aurora test of paragraph 6, we can see that the cache owned by the Aurora instance is so large that the database data of the general application software can be stored in memory. You can also see from figure 3 that the AZ1 master instance also sends logs and metadata to the AZ2,AZ3 database from the instance. Now that log processing has been decoupled from the database engine, who applies the sent logs? Personally, AZ2 and AZ3 still have storage service applications when they receive logs, but Cache is shared by storage services and database engine.

And with this guess, we further interpret the implementation of the zero-downtime patch. How does ZDP keep users connected to achieve spooling? Personally, there is proxy between applications to databases. It is necessary to use proxy to maintain the connection to the application and re-establish the connection to the updated database instance, and to make the user session insensitive, Cache is shared memory. Because Cache saves most of the database running state, the restarted database instance continues to query the user session. This is a bit like postgre using shared memory, where you kill certain processes in postgre at will and do not affect the query of the user session.

In addition, I personally think that the computing node of Aurora may also be a storage node. Although storage and computing are logically separated, you can also enjoy the benefits of non-separation. For example, storage services can apply logs directly to Cache, so that the database engine does not need real IO for most read requests.

Does Aurora have a checkpoint?

Aurora's PGMRPL can be thought of as a checkpoint. Data pages with LSN less than this point have been flushed, while pages greater than this point can be flushed or not, and logs with LSN less than this point can be discarded. PGMRPL is logically equivalent to the checkpoint of the storage engine. Can be thought of as a checkpoint on PG.

Aurora and polardb

Both Aurora and Polardb are architectures that separate storage from computing, read more and share distributed storage. Obviously, polardb is not optimized for write magnification. The slave node needs to synchronize the dirty pages of the master database, shared storage, and private dirty pages, which is a contradiction that is difficult to solve. Therefore, the read node of polardb affects the write node. Aurora can be thought of as having no dirty pages. Log is a database, which can be regarded as slower storage, and both storage service and cache can be regarded as the Cache of log application.

There is a delay between the slave node and the master node, while there are multiple versions of the data pages stored in aurora. It is clearly pointed out in this paper that old page recycling is stored. From the node reading to the specified version of the page according to the read point LSN, paragraph 4.2.4 points out that the delay between the write node and the read replication node is less than 20ms, so there should not be too many old version pages that cannot be recycled.

Temporary watch

Although each read node has only a query operation, the query operation generates a temporary table to hold the intermediate results of the query. Generating temporary table data does not generate logs. However, IO is still written here. Personally, it is possible for Aurora to be written directly in local storage, so that it will not cause a burden on the network.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.