How to solve the consistency of distributed systems in cross-regional scenarios 07/13 Update SLTechnology News&Howtos

How to solve the consistency of distributed systems in cross-regional scenarios

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to solve the consistency of distributed systems in cross-regional scenarios". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and learn how to solve the consistency of distributed systems in cross-regional scenarios.

Cross-regional needs and challenges

1 demand

The cross-regional problem is a challenge brought by the rapid development of business under the group's globalization strategy. For example, Taobao unitary business or AliExpress regionalized business, there is an unavoidable problem-data read and write consistency across regions.

Its core requirements can be summarized as follows:

Cross-region business scenario

Cross-region configuration synchronization and service discovery are two common business requirements for cross-region consistency coordination services. Cross-region deployment can provide the nearest access capability to reduce service latency. According to specific business scenarios, it can be divided into scenarios such as multi-region write or simplified single-region write, strong consistent read or final consistent read. Cross-regional session management and cross-regional distributed locks also need to provide mature solutions.

The expansion of services and resources

When the service capacity of a data center in a region reaches the upper limit and cannot be expanded, the consistency system needs to expand at the level of multiple data centers in a region and be able to expand across regions.

Cross-regional disaster recovery capacity

When a catastrophic failure occurs in a computer room or a region, the consistency system is required to quickly migrate businesses from one region to another region through cross-region service deployment to complete disaster preparedness escape and achieve high availability.

2 Challenge

Combining network latency and business requirements, we can sum up the challenges that cross-regional consistency systems need to address:

Latency: network latency reaches tens of milliseconds

The core problem caused by multi-region deployment is high network latency. Take the cross-region cluster deployed online as an example. The machines in the cluster belong to the data centers in four regions: Hangzhou, Shenzhen, Shanghai and Beijing. The actual test results show that the latency from Hangzhou to Shanghai is about 6ms, while the delay to Shenzhen and Beijing is close to 30ms. The network latency between the same computer room or the same regional computer room is generally within milliseconds, compared with an order of magnitude increase in cross-region access latency.

Horizontal expansion: limited scale of Quorum Servers

The distributed consistency system based on Paxos theory and its variants will inevitably encounter Replication Overhead problems when expanding nodes. Generally, the number of nodes in a Quorum is less than 9, so it is impossible to simply deploy the consistency system nodes in multiple regions. The system needs to be able to expand continuously and horizontally to meet the expansion needs of services and resources.

Storage limit: limited storage data of a single node and slow failover recovery

Whether it is MySQL or Paxos-based consistency system, a single node will maintain and load full amount of mirror data, which will be limited by the capacity of a single cluster. At the same time, in the case of failover recovery, if the data version lags far behind, the recovery by pulling images from other regions will be unavailable for a long time.

Second, our exploration

1 Industry solution

There are many designs for cross-region consistency systems in the industry, mainly referring to the paper [1] and some open source implementations. Here are some common ones:

Cross-region deployment

Figure 1 Direct cross-region deployment

Direct cross-region deployment, read requests directly read the local region node, faster, consistency and availability are guaranteed by Paxos, no single point of problem. The disadvantage is also obvious, we will encounter the horizontal expansion problem mentioned in the first part, that is, we will encounter the Replication Overhead problem when Quorum expansion. And with the increase of the number of Quorum nodes, under the extremely high cross-regional network latency, the time for each majority to reach an agreement will be very long, and the writing efficiency will be very low.

Single-region deployment + Learner role

Figure 2 introduces the Learner role

The voting delay problem of direct multi-node deployment can be avoided by introducing the role of Learner (for example, raft learner [2] of Observer and etcd in zk), that is, a role that only synchronizes data and does not participate in majority voting, and forwards write requests to a certain area (such as Region An in figure 2). This approach can solve the problem of horizontal expansion and delay, but because the voting roles are deployed in one region, when the computer room in this area encounters a catastrophic time, the writing service is not available. This is the deployment mode adopted by Otter [3].

Multiservice + Partition& single region deployment + Learner

Figure 3 multiple service processing sub-Partition

The data is divided into different Partition according to the rules. One Quorum in each region provides services. The Quorum in different regions is responsible for different Partition. The Quorum between regions uses Learner to synchronize different Partition data and forward write requests to ensure that problems in a region only affect the availability of Partition in that region. At the same time, there will be a problem of correctness under this scheme, that is, the operation does not conform to the sequential consistency [4] (see paper [1]).

In actual implementation, there are various solutions according to the business scenario, which will be optimized and weighed pertinently to make up for the defects. The common solution in the industry is single-region deployment + Learner role, which ensures high availability and high efficiency by synchronizing cross-region data with multiple users in the same city and Learner. Other solutions also have their own optimization schemes. Cross-region deployment can reduce latency and bandwidth problems by reducing inter-region communication when a resolution is reached, such as TiDB's Follower Replication [5]. The correctness of the solution of multi-service + Partition& single-region deployment + Learner can also be ensured by adding sync operation before reading at the expense of the availability of part of the read, as described in the paper [1].

The final conclusion is as follows, the key items of which will be explained in detail later:

2 cross-regional trade-offs

Through the requirements challenges summarized in the first part and the previous investigation of cross-regional consistency system solutions in the industry, we can summarize the core tradeoffs of Paxos-based distributed consistency systems in cross-regional scenarios:

Write operation cross-regional consistency agreement is too slow to reach a resolution.

More activity in the region does not provide availability in extreme cases

Need to have the core horizontal expansion ability of distributed system.

In order to solve these three problems, we design a cross-regional consistency system with log image decoupling.

3 decoupling of cross-region log images

Fig. 4 schematic diagram of log image decoupling

As shown in figure 3, our system is divided into a back-end log synchronization channel and a front-end full-volume state machine-an architecture in which logs are decoupled from mirrors. The back-end cross-region global log synchronization channel is responsible for ensuring the strong consistency of request logs in each region; the front-end full-volume state machine is deployed in each region to handle client requests and is also responsible for interacting with the back-end log service. provide global strong consistency metadata access service, and the interface can be implemented by quickly modifying the state machine according to business requirements.

Under the framework of the separation of global logs and local images, in addition to the improvement of system operation and maintenance and scalability brought about by decoupling itself, we can also solve many problems under the undecoupled architecture. The following analysis is a solution to some of the major problems previously considered under this architecture:

Write operation efficiency

From the point of view of deployment mode, it looks similar to direct multi-region and multi-node deployment, and then adds Learner roles to each region. It is a combination of direct multi-node deployment and the introduction of Learner, which combines the advantages and disadvantages of the two methods. The biggest difference is that our logs and images are decoupled, that is, the cross-region part is lightweight and efficient enough for simple log synchronization, and since there is only one node in each region, cross-region bandwidth can be saved (similar to TiDB's Follower Replication). At the same time, the back-end log synchronization channel can also achieve the function of multi-Group, dividing the data into Partition, and each consistent Group is responsible for a different Partiton.

Since the read operation of most business scenarios is to read local data, there is little difference in various ways, so the delay analysis of write operation is mainly carried out. The following is the delay analysis of write operation (or highly consistent read):

RTT (Round-Trip Time) can be simply understood as the time it takes for the sender to send a request to get a response. Due to the large latency of cross-region network, the latter RTT mainly refers to cross-region RTT.

(1) Direct cross-region deployment

For a common master consistency protocol, our request is divided into two situations:

Access 1 RTT in the region where Leader resides (ignore the small latency in the region for the time being)

Client-> Leader-> Follower-> Leader-> Client

Access 2 RTT in the region where Follower is located

Client-> Follower-> Leader-> Follower-> Leader-> Follower->

(2) single-region deployment + Learner synchronization

In the solution of more active users within regions and Learnner synchronization between regions, our delay is as follows:

0 RTT in local region

Client-> Quorum-> Client

1 RTT between regions

Client-> Learner-> Quorum-> Learner-> Client

(3) Multi-service Partition, single-region deployment + Learner synchronization (similar to B result)

Write local region Partition 0 RTT

Write 1 RTT across Partition

(4) the architecture of log image decoupling (similar to A result)

Write local region Partiton 1 RTT

Client-> Frontend-> LogChannel (local)-> LogChannel (peer)-> LogChannel (local)-> Frontend-> Client

Write 2 RTT across Partition (Paxos two-phase submit / forward leader)

Client-> Frontend-> LogChannel (local)-> LogChannel (peer)-> LogChannel (local)-> LogChannel (peer)-> LogChannel (local)-> Frontend-> Client

From the above comparison, we can see that as long as we use the consistency protocol to write across regions, there will be at least one RTT delay. However, if the Paxos Quorum is only deployed in a single region, it cannot guarantee the availability in any extreme cases. Therefore, we can make a tradeoff between availability and write efficiency according to business needs. The architecture of log image decoupling can ensure extreme availability and correctness in multi-region deployment scenarios. Of course, the efficiency is slightly lower than that of single-region deployment + learner, but it is lighter and more efficient than direct multi-region deployment, because the scale of Quorum will not increase due to horizontal expansion and will not affect voting efficiency. The scheme of multi-service sub-Partition deployment has no efficiency advantage, but it has advantages in terms of maintainability, correctness and availability.

Consistency

The strong consistency of cross-region deployment and single-region deployment + Learner is satisfied. Both zookeeper and etcd are introduced, so I won't repeat them here. The scheme of multi-service partition sub-Partion does not satisfy sequence consistency, mainly because multi-service cannot guarantee the sequence of each write operation commit, as shown in the following figure:

Figure 5 sequence consistency

As you can see, when two Client modifies XMagol y at the same time, the sequence consistency cannot be guaranteed in the case of high concurrency of write operations.

Sequence consistency means that the operations of each Client can be arranged in the correct order, in the example in figure 4:

Set1 (XMagne5) = > get1 (y)-> 0 = > set2 (yMagol 3) = > get2 (x)-> 5

Set2 (YP3) = > get2 (x)-> 0 = > set2 (YP3) = > get1 (y)-> 3

Are in accordance with the order of consistency.

The consistency of the architecture decoupled by log images can be simply understood as cross-region deployment + Learner. The write operation has the sync option, and the success will be returned only when the backend log is successfully submitted and the corresponding log is pulled. Therefore, it must be possible to pull the log corresponding to the write operations of other Client before this operation, so it is consistent with the sequence.

Usability

Availability is similar to the availability of direct cross-region multi-node deployment. The front-end state machine can forward requests when the backend node of a region is down, and it can also provide read availability when the backend global log service is not available. It can provide high availability guarantee for read and write in extreme cases.

At the same time, because the mirror is stored in the state machine of each region, the client can be switched to another frontend when a frontend state machine dies, and the data can be recovered directly from the backend when the failover is restored. It is only when the image lags too far behind that you need to pull the image from other frontend in the local region. There is no need to synchronize the image across regions, which makes the frontend unavailable for a very short time.

Horizontal expansion ability

Horizontal expansion capability is the core capability of distributed services. In the aforementioned solutions, the horizontal expansion capability of direct cross-region deployment is very poor. Other ways that rely on Learner also solve the problem of horizontal expansion, but the decoupling is not as clean as the design of log image decoupling.

Summarize and compare the above key issues:

Three cross-regions have more possibilities.

When the back-end log is decoupled from the front-end mirror, our exploration of cross-region scenarios is divided into two parts-- the back-end log synchronization is lightweight and efficient, and the front-end state machine is flexible and rich.

Lightweight, reflected in the architecture, the back-end storage pressure caused by synchronized logs is minimal, and only lightweight incremental logs are used.

Efficient, reflected in the back-end consistency protocol, because of light weight, so only need to consider the logic of voting and election, only pay attention to improve the efficiency of log synchronization, back-end resources do not need to be consumed on other business logic.

Flexible, reflected in the architecture, the front end can customize the upload log, CAS, transactions and so on can be packaged into logs to be parsed and processed by the front end.

Rich, mainly reflected in the front-end state machine, because the flexibility of the log leaves us a lot of space to explore and build, we can package the state machine to deal with all kinds of complex transactions according to the requirements.

There are new problems under the new architecture. This part mainly explores how to absorb the advantages of existing systems and make use of the lightweight and flexibility of log image decoupling to achieve the efficiency and richness of consistency protocols and state machines in cross-regional scenarios. There will also be a thinking and planning for the subsequent development of the cross-regional consistency system. The overall goal is to refine the back-end consistency protocol and make the front-end state machine bigger and stronger.

1 efficient back-end consistency protocol

Based on our previous discussion on the efficiency of write operations, in the scenario of writing the same data in multiple regions, the delay can only be controlled at 2RTT. Because in cross-region scenarios, the proportion of delay is mainly in cross-region network communication, regardless of whether it is a master forwarding or an ownerless Paxos two-phase submission, the delay has 2RTT. However, if you use an ownerless protocol, such as EPaxos [6], a variant of Paxos, you can improve the efficiency of writing in cross-region scenarios as much as possible. The delay is divided into two cases: Fast Path and Slow Path, 1RTT in Fast Path and 2RTT in Slow Path.

Quote a sentence from the EPaxos article:

If there is no conflict between the concurrent proposed logs, EPaxos only needs to run the PreAccept phase to commit (Fast Path), otherwise it needs to run the Accept phase to commit (Slow Path).

Compared with the sub-Partition operation, if the backend consistency protocol is selected as EPaxos, the availability in extreme cases and the delay to 1RTT in most cases can be guaranteed. This is the advantage of the ownerless consistency protocol in cross-region scenarios, mainly because the RTT that forwards the Leader operation is omitted. At present, we use the most basic implementation of Paxos in our system, and the delay is theoretically not different from that of the active protocol in multiple write scenarios. The subsequent development expects to use EPaxos to speed up the efficiency of write operations in cross-regional scenarios.

Because there is no need to implement a variety of business logic, efficiency is the biggest demand of the back-end consistency protocol, of course, its correctness and stability are also essential, while for the front-end state machine, there are rich scenarios to design and play.

CAS operation

It is natural for CAS operations to be implemented under this architecture. Since there is only consistent log at the back end, every CAS request will naturally have the order of Commit, to give an example.

Two clients write the same Key value at the same time:

Fig. 6 schematic diagram of CAS operation

At the beginning, the value of key is 0, and Client 1 and Client 2 perform CAS operations on key concurrently, namely CAS (key, 0,1) and CAS (key, 0,2). When these two operations are submitted and Commit at the same time, because the backend Quorum has reached a decision, the Replication Log must be in order, so the two concurrent CAS operations are naturally converted to sequential execution. When the Frontend is synchronized to the log of the two operations, the two operations will be apply to the local state machine in turn. The natural CAS (key, 0,1) succeeds, and the update key value is 1, while the CAS (key, 0,2) update fails. In this case, the frontend will return the result of the success or failure of the CAS request to the corresponding Client.

The principle is to turn a concurrent operation into a serial process executed sequentially, thus avoiding locking operations in cross-region scenarios. It is conceivable that if the backend maintains an kv structure data, it also needs to add a cross-region distributed lock to complete this operation, which is relatively more tedious and the efficiency is not guaranteed. By transferring complex computing to Frontend only by synchronizing logs, you can flexibly build front-end state machines to better implement CAS or more complex transaction functions (see pravega's StateSynchronizer [7] for this architecture).

Global ID

Global ID is a common requirement, distributed systems generate a unique ID, the common are the use of UUID, snow flake algorithm, or based on database, redis, zookeeper solutions.

Similar to using the znode data version of zookeeper for Global ID generation, in this log image separation architecture, you can use the CAS API call to generate a key as a Global ID, and perform atomic operations on the Global ID each time. Based on the above CAS design, locking is not required in cross-region concurrency scenarios, and it is similar to redis to perform atomic operations on key in the way of use.

2 Watch operation

Subscription is an indispensable part of distributed coordination services and is one of the most common business requirements. Here are the results of the survey on zk and etcd:

At present, the relatively mature distributed coordination systems that implement subscription notification in the industry include ETCD and ZooKeeper. We take these two systems as examples to explain their respective solutions.

ETCD will save multiple historical versions of the data (MVCC), and monotonously increase the version number to indicate the old or new version. As long as the client passes in the historical version it cares about, the server can push all subsequent events to the client.

Zookeeper does not save multiple historical versions of the data, only the current data state, the client cannot subscribe to the historical version of the data, and the client can only subscribe to events that change after the current state, so subscription is accompanied by reading, the server sends the current data to the client, and then pushes subsequent events, and at the same time, in order to prevent subscriptions to old data and events in abnormal scenarios such as failover The client refuses to connect to the server with older data (which depends on the server returning the current server's global XID on each request).

Among the above research results, ETCD is more in line with our interface design. At present, ETCDv3 uses HTTP/2 's TCP link multiplexing, and the performance of watch has been improved. Because of the structure of log plus state machine, we mainly refer to ETCD v3, learn from its two features of how to subscribe to multiple key and return all historical events. To achieve the function of etcd subscription, when the front-end state machine synchronizes and parses the log, if the log is written, the state machine Store of the kv structure and the state machine watchableStore specially provided to the watch interface are updated at the same time. The specific implementation can fully refer to etcd, and then return all the historical events after the subscription version to the client according to the log version number. On the other hand, subscribing to multiple key also uses segment tree as the range keys storage structure of watcher, which can realize the watcher notification of watch range keys.

3 Lease mechanism

It is a big challenge to implement an efficient Lease mechanism in an ownerless system. There is no Leader in the ownerless system, and any node can maintain Lease,Lease distribution on each node. When a node is unavailable, it needs to switch smoothly to other nodes. The difficulty of implementing an efficient Lease mechanism in an ownerless system is that with the increase of the number of Lease, how to avoid a large number of Lease maintenance messages in the back-end consistency protocol, which affects the system performance, it is best to let the Lease maintenance messages be directly processed locally in the front end without going through the back-end. So our idea is to aggregate the client and front-end Lease into front-end and back-end Lease, so that Lease maintenance messages can be processed locally at the front end directly.

Thank you for your reading. the above is the content of "how to solve the consistency of distributed systems in cross-regional scenarios". After the study of this article, I believe you have a deeper understanding of how to solve the problem of consistency of distributed systems in cross-regional scenarios, and the specific use needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.