The Choice of "consistency" in the Design of Modern distributed Storage system 07/02 Update SLTechnology News&Howtos

The Choice of "consistency" in the Design of Modern distributed Storage system

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

The influence of CAP theorem on the design of modern distributed database system is less than we expected. On the contrary, the trade-off between consistency and latency has a more direct impact on several mainstream DDBS. The new theory PACELC proposed in this paper combines this trade-off with CAP in order to make the design of distributed data system more systematic and comprehensive.

Although academia began to study distributed database systems decades ago, it is not until recently that DDBS has been widely used in some industries. There are two main drivers for this trend. First, modern applications need to increase data and transaction throughput, which leads to the need for flexible and scalable database systems. Second, globalization and accelerated business expansion require data to be placed near customers all over the world. Examples of DDBS that have attempted to achieve high scalability or global accessibility (or both) in the past 10 years include SimpleDB/Dynamo/DynamoDB,Cassandra,Voldemort (http://projectvoldemort.com) Magi Sherpa pNUTS http://wiki.basho.com) (www.mongodb.org), VoltDB/H-Store and Megastore.

DDBS is complex and it is difficult to build them. Therefore, any tool that helps designers understand the tradeoffs involved in creating a DDBS is useful. In particular, the CAP theorem is very useful in helping designers reason about the proposed system capabilities and exposing the exaggerated marketing hype of many commercial DDBS. However, since the initial official proof, CAP has become more and more misunderstood and misused. In particular, many designers mistakenly conclude that this theorem imposes some restrictions on DDBS during normal system operation, thus implementing unnecessary system constraints. In fact, CAP only adds restrictions in the face of certain types of failures, and does not impose any constraints or restrictions on system capabilities during normal operation.

However, the basic tradeoff of suppressing DDBS capabilities during normal operation affects the different design choices of mainstream systems. In fact, a specific trade-off between consistency and latency-it can be said that it has a greater impact on DDBS design than CAP theory. Both sets of trade-offs are important; unifying CAP and consistency / delay tradeoffs into one theory-PACELC correspondingly leads to a deeper understanding of modern DDBS design.

CAP IS FOR FAILURES

CAP shows that when building a DDBS, designers can choose two of the three ideal attributes: consistency (C), availability (A), and partition tolerance (P). Therefore, only CA systems (consistent and highly available, but not partition tolerant), CP systems (consistent and partition fault tolerant, but not highly available) and AP systems (high availability and partition tolerance but inconsistency) are possible.

Many modern DDBS (including SimpleDB/Dynamo,Cassandra,Voldemort,Sherpa/PNUTS and Riak) do not guarantee the consistency of CAP definitions by default. (although the consistency of some of these systems becomes adjustable after the initial release, the focus is on their original design.) In their CAP proof, Seth Gilbert and Nancy Lynch use the definition of atomic / linear consistency: "there must be a global order in all operations so that each operation appears to be completed in an instant. This is equivalent to requiring requests for distributed shared memory to respond to one operation at a time as if they were executed on a single node."

Given that early DDBS studies focused on consistent systems, it was natural to think that CAP was the main impact on modern system architects, who built more and more systems to implement simplified consistency models in the period after the theorem was proved. The reason behind this assumption is that because any DDBS must tolerate network partitions, according to CAP, the system must choose between high availability and consistency. For mission-critical applications where high availability is very important, it has no choice but to sacrifice consistency.

However, this logic is flawed and is inconsistent with what CAP actually says. It is not just partition tolerance that requires a trade-off between consistency and availability; instead, they are a combination of the following:

Zone tolerance

The existence of the network partition itself

The theorem simply states that network partitioning causes the system to make a decision between reduced availability or consistency. The probability of network partitioning is highly dependent on the details of the system implementation: is it distributed over a wide area network (WAN) or only through local clusters? What kind of hardware quality? What processes are in place to ensure that changes to network configuration parameters are carefully performed? What is the degree of redundancy? In general, however, network partitions are rare and generally do not occur as frequently as other serious types of failure events in DDBS.

It is wrong to reduce the consistency of DDBS based on the decision of CAP without any partition happening.

Since CAP does not specify system limits at baseline, it is wrong to assume that a consistent DDBS is reduced without any partitions. In fact, CAP allows the system to provide complete ACID (atomicity, consistency, isolation, and persistence) guarantees and high availability without partitions. Therefore, this theorem does not fully prove the default configuration of DDBS that reduces consistency (usually there are other ACID guarantees).

CONSISTENCY/LATENCY TRADEOFF

To understand modern DDBS design, it is important to implement the construction background of these systems. Amazon originally designed Dynamo to provide data to core services in its e-commerce platforms, such as shopping carts. Facebook built Cassandra to support its inbox search function. LinkedIn created Voldemort to handle online updates through various write-intensive features on its website. Yahoo has built PNUTS to store user data that can be read or written in each web view, to store list data for Yahoo shopping pages, and to store data to serve its social networking applications.

In each case, the system usually provides data for pages that are built immediately and sent to users of the active site, and receives online updates. Research shows that latency is a key factor in online interaction: an increase as small as 100 milliseconds may greatly reduce the likelihood that customers will continue to interact or return in the future.

Unfortunately, there is a fundamental trade-off between consistency, usability and latency. Note that availability and latency can be said to be the same: unavailable systems essentially provide extremely high latency. For the purposes of discussion, I think a system with a delay greater than a typical request timeout, for example, a few seconds means unavailable; a delay less than a request timeout, but still close to a few hundred milliseconds, is a "high latency". However, I will eventually give up this distinction and allow the low latency requirement to include two situations. Therefore, the tradeoff is really just between consistency and delay, as shown in the title of this article.)

This trade-off exists even if there is no network partition, so it is completely separate from the trade-off described by CAP. Nevertheless, it is a key factor in the design of the above-mentioned system. (irrelevant to this discussion, whether a single machine failure is regarded as a special type of network partition.)

The reason for the tradeoff is that high availability requirements mean that the system must replicate data. If the system runs long enough, at least one component of the system will eventually fail. In the event of this failure, all data controlled by the component becomes unavailable unless the system replicates another version of the data before the failure. Therefore, even without the failure itself, the possibility of failure means that availability requires a certain degree of data replication during normal system operation. Note the important difference between this trade-off and the CAP trade-off: although the occurrence of failure leads to the CAP trade-off, the possibility of failure itself leads to this trade-off.

In order to achieve the highest possible availability, DDBS must replicate data through WAN to prevent entire data centers from failures such as hurricanes, terrorist attacks, or individual network configuration errors, as in the famous Amazon EC2 cloud outage in April 2011. The five consistency reduction systems mentioned above are designed with extremely high availability and are typically used for replication through WAN.

DATA REPLICATION

Once DDBS replicates data, there is a trade-off between consistency and latency. This happens because there are only three options for data replication: the system sends data updates to all replicas at the same time, first to a consistent master node, or first to a single (any) node. The system can implement each of these situations in a variety of ways; however, each implementation has a consistency / delay trade-off.

(1) Data updates sent to all replicas at the same time

If the update does not first pass through the preprocessing layer or other protocol protocols, there may be replica divergence-an apparent lack of consistency (assuming multiple updates of the system are submitted at the same time, for example, from different clients). Because each copy may choose a different order in which updates are applied. (even if all updates are exchangeable-so that each replica will eventually become consistent, although the copy may apply updates in a different order-this is still not a consistent system in terms of Gilbert and Lynch's strict definition of consistency. However, the broad sense of Paxos can provide consistent replication at the cost of a RTT.)

On the other hand, if the update first determines the order of operations by preprocessing layers or coordinating all nodes during the write process, you can ensure that all replicas agree on the order of updates. However, this can lead to increased latency. For example, in the case of using a protocol, the protocol itself is one of the sources of delay.

In the case of preprocessing, there are two delay sources. First, routing updates through additional system components (preprocessing modules) increases latency. Secondly, the preprocessing module is composed of multiple machines or one machine. In the former case, a protocol is needed to determine the order of operation of the entire machine. In the latter case, even if another copy of the data is closer to the location where the update originated, the system forces all updates, no matter where they are initiated-possibly anywhere in the world-all the way to a single preprocessing module first.

(2) Data updates sent to an agreed-upon location first

I call this agreed location the "master node" (different data items can have different master nodes). This master node parses all requests to update data items, and the order in which it chooses to perform these updates determines the order in which all replicas perform the updates. After the primary node resolves the updates, it copies them to all replica locations.

Once DDBS replicates data, there is a trade-off between consistency and latency.

There are three replication options:

Replication is synchronous: the primary node waits until all updates enter the replica. This ensures that replicas are consistent, but synchronization across independent entities, especially WAN, increases latency due to the need to pass messages between these entities and the fact that latency is limited by the slowest entities.

Replication is asynchronous: the system treats updates as if they were completed before replication. Typically, before the initiator of the update learns that it has completed replication (if the primary node fails), the update has been persisted at least somewhere, but there is no guarantee that the system has propagated the update. In this case, the consistency / delay trade-off depends on how the system handles reads:

i. If the system routes all reads to the primary node and provides services from there, the consistency will not be reduced. However, there are two latency problems with this approach:

1. Even if there is a copy close to the initiator of the read request, the system must still route the request to the primary node, which may be physically farther

two。 If the primary node is overloaded or fails due to other requests, it cannot provide read services from other nodes. Instead, the request must wait for the primary node to become idle or resume. In other words, the lack of load balancing options increases potential latency

ii. If the system can provide reads from any node, the read latency is much better, but this can also result in inconsistent reads of the same data item, because different locations have different versions of the data item while the system is still propagating the update. and reads can be sent to any of these locations. Although you can limit the degree of consistency reduction by tracking update sequence numbers and using them to achieve sequence / timing consistency or read-write consistency, these are still a broken consistency model. In addition, if the primary node of the write operation is geographically far away from the write requestor, the write latency may be high

The combination of (a) and (b) is possible: the system sends updates to a subset of the copy synchronously, while the rest are sent asynchronously. In this case, the consistency / delay trade-off once again depends on how the system handles reads:

i. If it routes reads to at least one node that has been synchronously updated-for example, when R + W > N in the Quorum protocol, where R is the number of nodes involved in synchronous reads, W is the number of nodes participating in synchronous writes, and N is the number of replicas-then consistency can be maintained. However, the wait time problems of (a), (b) (I) (1) and (b) (I) (2) all exist, albeit to a lesser extent, because the number of nodes involved in synchronization is smaller and multiple nodes can provide read requests.

ii. If it is possible for the system to provide reads from nodes that have not been updated synchronously, for example, when R + W ≤ N, there may be inconsistent reads, as shown in (b) (ii). Technically, simply using the Quorum protocol is not enough to guarantee the consistency of the standards defined by Gilbert and Lynch. However, the additional protocols required to ensure full consistency are not relevant here, because even without these additional conditions, latency is inherent in the Quorum protocol.

(3) Data updates sent to an arbitrary location first

The difference between this situation and (2) is that the location where the system sends updates for specific data items is not always the same. For example, you can initiate two different updates for specific data items in two different locations at the same time. The consistency / delay trade-off again depends on two options

a. If the replication is synchronous, there is a latency problem of (2) (a). In addition, the system may incur additional delays to detect and resolve simultaneous updates of the same data item initiated in two different locations

b. If replication is asynchronous, there are consistency problems similar to (1) and (2) (b)

TRADEOFF EXAMPLES

No matter how DDBS replicates data, it is clear that it must weigh consistency against latency. For carefully controlled replication over short distances, there are reasonable options, such as (2) (a), because the network traffic latency in the local data center is small, but for replication over WAN, the consistency / latency trade-off cannot be bypassed.

To understand this trade-off more fully, consider how the four DDBS are designed to achieve extremely high availability-Dynamo,Cassandra,PNUTS and Riak. Because these systems are designed for low-latency interaction with active Web clients, each system sacrifices consistency to reduce latency.

Dynamo,Cassandra and Riak use a combination of (2) (c) and (3). In particular, the system sends updates to the same node and then propagates them synchronously to other W nodes-that is, case (2) (c). The system sends reads to the R node synchronously, and R + W is usually set to a number less than or equal to N, resulting in the possibility of inconsistent reads. However, the system does not always send updates to the same node of a specific data item-for example, this can occur in a variety of failure scenarios or due to rerouting of the load balancer. This leads to the situation described in (3) and possibly a greater consistency flaw. PNUTS uses option (2) (b) (ii) to achieve better latency while reducing consistency.

A recent study by Jun Rao,Eugene Shekita and Sandeep Tata further demonstrated the consistency / delay trade-offs in the baseline implementation of these systems. The researchers evaluated two options in the Cassandra consistency / delay trade-off through experiments. The first option, weak read, allows the system to serve the reading of any copy, even if the copy has not received all outstanding updates to the data item. The second option, Quorum read, requires the system to explicitly check for inconsistencies between multiple replicas before reading the data. Compared with the first option, the second option obviously increases consistency at the expense of delay. The difference in latency between the two options can be up to four times or more.

Another study by Hiroshi Wada and colleagues seems to contradict this result. These researchers found that requesting consistent reads in SimpleDB did not significantly increase latency compared to the default (and possibly inconsistent) read options. However, the researchers conducted experiments within a Zone in Amazon, and they speculated that SimpleDB uses master-slave replication, which can be achieved at a moderate delay cost if replication occurs over a short distance. In particular, Wada and his colleagues concluded that SimpleDB forces all consistent reads to be performed by master. As long as the read request comes from a location physically close to the master, and as long as the primary device is not overloaded, the additional delay of consistent reading is not visible (these conditions are correct in their experiments)

If SimpleDB replicates data across Amazon regions, and the read request comes from a different area than the master location, the delay cost of consistent reads will be more significant. Even if there is no cross-region replication (SimpleDB currently does not support cross-region replication), the official Amazon documentation warns users of increased latency and reduced throughput for consistent reads.

All four DDBS allow users to change the default parameters in exchange for better consistency and greater latency-for example, by making R + W greater than N. in the Quorum type system. However, even in the absence of network partitions, consistency / wait time trade-offs occur during normal system operation. If data is replicated through WAN, the trade-off is magnified. The obvious conclusion is that the loss of consistency can be attributed to runtime latency, not CAP. PNUTS provides the strongest evidence that CAP is not the main reason for reducing the level of consistency in these systems. In PNUTS, the master node owns each data item. Updates to this data item are routed to the primary node and then propagated asynchronously to the replica via WAN. PNUTS can provide reads from any copy, which puts the system in category (2) (b) (ii): it reduces consistency to achieve better latency. However, in the case of network partitions, the primary node becomes unavailable in a small number of partitions, and the system defaults to making data items unavailable for updates. In other words, the default configuration of PNUTS is actually CP: in the case of partitions, the system chooses consistency to avoid resolving conflicting updates initiated at different master nodes.

As a result, the choice to reduce baseline consistency is more clearly attributed to persistent consistency / latency trade-offs than to consistency / availability trade-offs in CAP, which occurs only on network partitions. Of course, the disadvantage of PNUTS's unfriendly consistency at baseline may be useful in the case of network partitions, because data in unavailable partitions can still be read.

CAP may have a greater impact on the other three systems. If network partitioning occurs, Dynamo,Cassandra and Riak switch more fully to the data replication option (3) and use special coordination code that runs when replica differences are detected to deal with the resulting consistency issues. Therefore, it is reasonable to assume that the design of these systems takes into account the possibility of network partitioning. Because these are AP systems, the ability to coordinate code and switch to (3) is built into the code from the start. However, once the code exists, you can easily reuse some flexible consistency models to select a point in the baseline consistency / delay trade-off. This argument is more logical than claiming that the designers of these systems choose to completely reduce the consistency of CAP (ignoring the delay factor).

Ignoring the consistency / latency trade-off of the replication system is a big problem because it always exists during the operation of the system.

In short, CAP is just one of the two main reasons why modern DDBS reduces consistency. Ignoring the consistency / latency tradeoff of the replication system is a big problem because it always exists during system operation, while CAP is only associated with potentially rare network partitions. In fact, the trade-off of the former may be more influential because it has a more direct impact on the baseline operation of the system.

PACELC

A more complete description of the potential consistency tradeoff space for DDBS can be achieved by rewriting CAP as PACELC (pronounced "pass-elk"): if there is a partition (P), how does the system weigh availability and consistency (An and C); otherwise (E), how does the system weigh latency (L) and consistency (C) when the system is operating normally without partitions?

Note that delay / consistency tradeoffs (ELC) apply only to systems that replicate data. Otherwise, the system will encounter any type of failure or availability problems in the event of node overload. Because such problems are only examples of extreme latency, the delay part of the ELC tradeoff can include the choice of whether or not to replicate the data.

The default version of Dynamo,Cassandra and Riak is the PA/EL system: if partitioning occurs, they choose availability over consistency, and under normal operation, they abandon consistency to reduce latency. Abandoning the two Cs in PACELC makes the design easier; once the system is configured to handle inconsistencies, there is an option to give up consistency and provide better availability and lower latency. However, these systems have user-adjustable settings to change the ELC trade-off-for example, by increasing R + W, they achieve higher consistency at the cost of latency (although they cannot achieve full consistency between Gilbert and Lynch definitions, even if R + W > N).

Full ACID systems such as VoltDB/H-Store and Megastore are PC/EC: they refuse to give up consistency and will pay the availability and delay costs of implementing it. BigTable and related systems (such as HBase) are also PC/EC.

MongoDB can be classified as a PA/EC system. In the case of a baseline, the system ensures that reads and writes are consistent. However, MongoDB uses the data replication option (2), and if the primary node fails or is partitioned with the rest of the system, it stores all writes that have been sent to the primary node but have not been replicated to the local rollback directory. At the same time, the rest of the system chooses a new primary server to keep it readable and writable. As a result, the state of the old primary server becomes inconsistent with that of the new primary server until the system fixes the failure and uses a rollback directory to coordinate the state, which is currently manual. (technically, when partitions occur, MongoDB is not available according to CAP's availability definition, because a few partitions are not available. However, in the context of PACELC, MongoDB can be classified as a PA/EC system because partitions cause more consistency issues than usability issues.)

PNUTS is a PC/EL system. In normal operation, it chooses better latency and abandons consistency; however, if partitioning occurs, it abandons availability to maintain consistency. This is certainly confusing: according to PACELC, PNUTS seems to be more consistent in network partitions. However, PC/EL should not be interpreted in this way. PC does not mean that the system is completely consistent; on the contrary, it means that when a network partition occurs, the system does not reduce consistency beyond the baseline consistency level-on the contrary, it reduces availability

The tradeoffs involved in building a distributed database system are so complex that neither CAP nor PACELC can explain them. Nevertheless, it is important to incorporate consistency / latency trade-offs into modern DDBS design considerations to ensure that the trade-offs are closer to the forefront of architectural discussions.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.