How to analyze the common solutions of distributed transactions 07/03 Update SLTechnology News&Howtos

How to analyze the common solutions of distributed transactions

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how to analyze the common solutions of distributed transactions. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

Solution for distributed transactions 1: global transactions (DTP model)

Global transactions are implemented based on the DTP model. DTP is a distributed transaction model-X/Open Distributed Transaction Processing Reference Model, which is proposed by X/Open organization. It specifies that three roles are required to implement distributed transactions:

AP:Application application system is the business system we developed. In the process of development, we can use the transaction interface provided by the resource manager to achieve distributed transactions.

TM:Transaction Manager transaction Manager

The implementation of the distributed transaction is completed by the transaction manager, which provides the operation interface of the distributed transaction for our business system to call. These interfaces are called TX interfaces.

The transaction manager also manages all resource managers and schedules them together through the XA interface they provide to implement distributed transactions.

DTP is only a set of specifications to implement distributed transactions, and there is no specific definition of how to implement distributed transactions. TM can use 2PC, 3PC, Paxos and other protocols to implement distributed transactions.

RM:Resource Manager Explorer

The objects that can provide data services can be resource managers, such as databases, message middleware, caching, etc. In most scenarios, the database is the resource manager in a distributed transaction.

The resource manager can provide the transaction capability of a single database. Through the XA interface, they provide the commit and rollback capabilities of the database to the transaction manager to help the transaction manager to achieve distributed transaction management.

XA is an interface defined by the DTP model to provide the transaction manager with the commit, rollback, and so on capabilities of the resource manager (the database).

DTP is only a set of specifications to implement distributed transactions, and the specific implementation of RM is completed by database vendors.

Practical solution: two-phase submission based on XA protocol

XA is a distributed transaction protocol proposed by Tuxedo. XA is roughly divided into two parts: transaction manager and local resource manager. Among them, the local resource manager is often implemented by the database, such as Oracle, DB2 and other commercial databases all implement the XA interface, while the transaction manager, as the global scheduler, is responsible for the commit and rollback of local resources. The principles of XA to implement distributed transactions are as follows:

Generally speaking, the XA protocol is relatively simple, and once the commercial database implements the XA protocol, the cost of using distributed transactions is relatively low. However, XA also has a fatal disadvantage, that is, the performance is not ideal, especially in the transaction order link, the concurrency is often very high, XA can not meet the high concurrency scenario. At present, the support of XA in commercial database is ideal, but it is not ideal in mysql database. The XA implementation of mysql does not record the log of prepare phase, and switching back between master and slave leads to data inconsistency between master database and slave database. Many nosql also do not support XA, which makes the application scenario of XA very narrow.

Scheme 2: distributed transaction based on reliable message service (transaction message middleware)

This way of implementing distributed transactions needs to be implemented through message middleware. Suppose there are two systems An and B, which can handle task An and task B, respectively. At this point, there is a business process in system A, which needs to process task An and task B in the same transaction. Let's introduce the implementation of this kind of distributed transaction based on message middleware.

Before system A processes task A, first send a message to the message middleware

The message middleware persists the message after it is received, but does not deliver it. At this point, downstream system B still does not know the existence of the message.

After the message middleware is persisted successfully, an acknowledgement reply is returned to system A.

After system A receives a confirmation reply, it can start processing task A

When task An is finished, send a Commit request to the message middleware. After the request is sent, for system A, the processing of the transaction is over, and it can handle other tasks. However, the commit message may be lost in transit, so the message middleware does not deliver the message to system B, which leads to system inconsistency. This problem is accomplished by the transaction review mechanism of the message middleware, which will be described below.

After receiving the Commit instruction, the message middleware delivers the message to system B, which triggers the execution of task B.

When task B is completed, system B returns an acknowledgement reply to the message middleware, telling the message middleware that the message has been successfully consumed, and the distributed transaction is completed.

The above process can draw the following conclusions:

Message middleware plays the role of distributed transaction coordinator.

After system A completes task A, there will be a certain time difference between task B and task B. In this time difference, the whole system is in a state of data inconsistency, but this temporary inconsistency is acceptable, because after a short period of time, the system can maintain data consistency and meet the BASE theory.

In the above process, if task A fails, then you need to enter the rollback process.

If system A fails to process task A, a Rollback request is sent to the message middleware. As with sending a Commit request, system A can assume that the rollback is complete and it can do something else.

After receiving the rollback request, the message middleware discards the message directly and does not deliver it to system B, so that task B of system B is not triggered.

At this point, the system is in a consistent state because neither Task A nor Task B is executed.

The Commit and Rollback described above are ideal, but in a real system, both Commit and Rollback instructions may be lost in transit. So when this happens, how does message middleware ensure data consistency? The answer is the time-out inquiry mechanism.

In addition to realizing the normal business process, system An also needs to provide a transaction query interface for message middleware to call. When the message middleware receives a transactional message, it starts timing. If the timeout does not receive Commit or Rollback instructions from system A, it will actively call the transaction query interface provided by system A to inquire about the current status of the system. The interface returns three results:

Submit if the status obtained is submit, the message is delivered to system B.

If the status of rollback is "rollback", the message is discarded directly.

If the status obtained in the process is "in progress", continue to wait.

The timeout query mechanism of message middleware can prevent the system inconsistency caused by the loss of Commit/Rollback instruction in the upstream system during transmission, and can reduce the blocking time of the upstream system. As long as the upstream system issues the Commit/Rollback instruction, it can handle other tasks without waiting for a confirmation reply. The loss of Commit/Rollback instructions is compensated by the timeout query mechanism, which greatly reduces the blocking time of the upstream system and improves the concurrency of the system.

Let's talk about the reliability guarantee of the message delivery process. When the upstream system completes the task and submits the Commit instruction to the message middleware, it can handle other tasks. At this point, it can assume that the transaction has been completed, and then the message middleware will ensure that the message is successfully consumed by the downstream system! So how do you do that? This is guaranteed by the delivery process of message middleware.

After delivering the message to the downstream system, the message middleware enters the blocking waiting state, and the downstream system immediately processes the task, and then returns the reply to the message middleware after the task processing is completed. After receiving the acknowledgement reply, the message middleware thinks that the transaction is finished!

If the message is lost during delivery, or the acknowledgement reply of the message is lost on the way back, the message middleware will be redelivered after waiting for the acknowledgement timeout until the downstream consumer returns to the consumer to respond successfully. Of course, general message middleware can set the number and interval of message retries, for example, when the first delivery fails, retry every five minutes, a total of 3 retries. If the delivery still fails after 3 retries, then the message requires human intervention.

Some students may ask: why not roll back the message after the failure of message delivery, but keep trying to re-deliver it?

This involves the implementation cost of the whole distributed transaction system. We know that when system A will send Commit instructions to the messaging middleware, it will do something else. If the message delivery fails and needs to be rolled back, it is necessary to ask system A to provide a rollback interface in advance, which undoubtedly increases the additional development cost and increases the complexity of the business system. The design goal of a business system is to minimize the system complexity on the premise of ensuring the performance, so as to reduce the operation and maintenance cost of the system.

I wonder if you have found that upstream system A submits Commit/Rollback messages to message middleware in an asynchronous way, that is, when the upstream system submits the message, it can do something else, and then the submission and rollback will be completely left to the message middleware to complete, and fully trust the message middleware, thinking that it must be able to correctly complete the transaction commit or rollback. However, the process of message middleware delivering messages to downstream systems is synchronous. That is, after the message middleware delivers the message to the downstream system, it blocks the wait and cancels the blocking wait until the downstream system successfully processes the task and returns an acknowledgement reply. Why are the two inconsistent in design?

First of all, the asynchronous communication between the upstream system and message middleware is to improve the concurrency of the system. Business systems deal with users directly, and user experience is particularly important, so this asynchronous communication mode can greatly reduce the user waiting time. In addition, compared with synchronous communication, asynchronous communication has no blocking waiting for a long time, so the concurrency of the system is greatly increased. However, asynchronous communication may cause the loss of Commit/Rollback instructions, which is made up by the timeout query mechanism of message middleware.

So why should synchronous communication be used between message middleware and downstream systems?

Asynchronism can improve the performance of the system, but it will increase the complexity of the system, while synchronization reduces the degree of concurrency of the system, but the cost is low. Therefore, when the requirement of concurrency is not very high, or when the server resources are abundant, we can choose synchronization to reduce the complexity of the system. As we know, message middleware is a third-party middleware independent of business system. It does not have direct coupling with any business system, nor is it directly related to users. It is generally deployed on an independent server cluster. It has good scalability, so there is no need to worry too much about its performance. If the processing speed can not meet our requirements, we can add machines to solve it. Moreover, even if there is a certain delay in the processing speed of message middleware, that is acceptable, because the BASE theory introduced earlier tells us that we are pursuing ultimate consistency, not real-time consistency, so it is acceptable that the delay generated by message middleware leads to temporary inconsistency of transactions.

Option 3: best effort notification (regular proofreading) is also called local message table

The best effort notice, also known as periodic proofreading, is already included in option 2, which is introduced separately here, mainly for the integrity of the knowledge system. This scheme also requires the participation of message middleware.

After completing the task, the upstream system sends a message synchronously to the message middleware to ensure that the message middleware successfully persists the message, and then the upstream system can do something else.

After receiving the message, the message middleware is responsible for delivering the message synchronously to the corresponding downstream system and triggering the task execution of the downstream system.

When the downstream system is processed successfully, feedback the confirmation reply to the message middleware, and the message middleware can delete the message, thus the transaction is completed.

The above is an idealized process, but in a real scenario, the following unexpected situations often occur:

Message middleware failed to deliver messages to downstream systems.

Upstream system failed to send message to message middleware

For the first case, the message middleware has a retry mechanism. We can set the number of retries and the retry interval in the message middleware. For the failure of message delivery caused by network instability, often the message can be successfully delivered after a few retries. If the delivery fails after exceeding the upper limit of the retry, the message middleware no longer delivers the message, but records it in the failure message table. Message middleware needs to provide a query interface for failure messages, and downstream systems will query failure messages regularly and consume them, which is called "periodic proofreading".

If repeated delivery and regular proofreading cannot solve the problem, it is often because there is a serious error in the downstream system, which requires human intervention.

In the second case, a message retransmission mechanism needs to be established in the upstream system. You can set up a local message table in the upstream system and complete the task processing and inserting messages into the local message table in a local transaction. If inserting a message into the local message table fails, a rollback is triggered and the previous task processing result is cancelled. If all these steps are performed successfully, the local transaction is completed. Next, a dedicated message sender will continue to send messages in the local message table, and if it fails, it will return to retry. Of course, it is also necessary to set the upper limit for the message sender to retry. Generally speaking, the message sender still fails to reach the retry limit, which means that there is a serious problem with the message middleware, and only human intervention can solve the problem.

For message middleware that does not support transactional messages, you can use this approach if you want to implement distributed transactions. It can realize distributed transactions through retry mechanism + periodic proofreading, but compared with the second scheme, it takes a longer period to achieve data consistency, and it also needs to implement a message retry release mechanism in the upstream system to ensure that the message is successfully published to the message middleware, which undoubtedly increases the development cost of the business system and makes the business system not pure enough. And these additional business logic will undoubtedly occupy the hardware resources of the business system, thus affecting the performance.

Therefore, try to choose message middleware that supports transactional messages to implement distributed transactions, such as RocketMQ.

Scheme 4:TCC (two-stage type, compensation type)

The atomicity requirements of business operations across applications are actually quite common. For example, in the combined payment scenario in the third-party payment scenario, users should use the balance and the balance at the same time after shopping on the e-commerce site.

The red packet pays the order, and the balance system and the red packet system are different application systems. When the payment system calls these two systems to pay, it needs to guarantee the balance deduction and red.

Package usage either succeeds or fails at the same time.

The emergence of TCC transactions is to solve the problem of atomicity of cross-application business operations caused by application splitting. Of course, due to the regular XA transaction (2PC, 2 Phase Commit, two-phase commit)

The performance is not satisfactory, and there are also scenarios for solving database split through TCC transactions (such as account split), which will be described in more detail later in this article.

Therefore, from the point of view of the whole system architecture, different schemes of distributed transactions have a hierarchical structure.

The mechanism of TCC

It is clear at a glance that TCC should be the acronym of three English words. Yes, TCC corresponds to Try, Confirm and Cancel respectively.

The business implications of these three operations are as follows:

Try: Reserve business resources Confirm: confirm the execution of business operations Cancel: cancel the execution of business operations

If you compare the three operations of relational database transactions: DML, Commit, and Rollback, you will find that they are similar to TCC. In a cross-application business operation

Try operations first reserve and lock business resources in multiple applications, laying the foundation for subsequent confirmations. Similarly, DML operations lock database record rows and hold database resources.

The Confirm operation is confirmed after all the applications involved in the Try operation are successful, using reserved business resources, similar to Commit

On the other hand, Cancel means that when all the applications involved in the Try operation are not successful, the successful applications need to be cancelled (that is, Rollback rollback).

Where Confirm and Cancel operations are a pair of reverse business operations.

In short, TCC is the 2PC of the application layer (2 Phase Commit, two-phase commit) if you think of the application as a resource manager.

In detail, what TCC needs to do for each operation is as follows:

1. Try: try to execute the business.

Complete all business checks (consistency)

Reserve necessary business resources (quasi-isolation)

2. Confirm: confirm the execution of the business.

Really execute the business

Do not do any business inspection

Use only business resources reserved during the Try phase

3. Cancel: cancel the execution of business

Release business resources reserved during the Try phase

A complete TCC transaction participant consists of three parts:

Main business service: the main business service is the initiator of the whole business activity, such as the composite payment scenario mentioned earlier, and the payment system is the main business service.

Slave business service: responsible for providing TCC business operations from the business service, is the operator of the entire business activity. Slave business services must implement three interfaces: Try, Confirm and Cancel

For the main business service to invoke.

Because Confirm and Cancel operations may be called repeatedly, the Confirm and Cancel interfaces are required to be idempotent. The balance system in the previous combined payment scenario and

The red packet system is a slave service.

Business activity manager: the business activity manager manages and controls the entire business activity, including recording and maintaining the transaction status of the TCC global transaction and the sub-transaction status of each slave business service, and confirming the confirm operations of all TCC operations when the business activity is committed, and invoking the cancel operations of all TCC operations when the business activity is cancelled.

It can be seen that the whole TCC transaction is transparent to the master business service, in which the business activity manager and the slave business service do part of the work respectively.

Advantages and limitations of TCC

The advantages of TCC transactions are as follows:

The atomicity problem of cross-application business operation is solved, and it is very practical in scenarios such as combined payment and account split.

TCC actually refers the two-phase commit of the database layer to the application layer, which is an one-stage commit for the database, which avoids the problem of poor 2PC performance in the database layer.

The main drawbacks of TCC transactions are:

The operation functions of Try, Confirm and Cancel of TCC need to be provided by business, and the development cost is high.

Of course, it is a matter of opinion as to whether this flaw in TCC transactions is a flaw.

A case understanding

TCC, to be honest, TCC's theory is a little confusing. Therefore, we will take the account split as an example to describe the process of TCC transactions, hoping to be helpful to understand TCC.

The business scenario of account split is as follows. Accounts A, B, C, An and B, which are located in three different sub-libraries, transfer a total of 80 yuan to C: TCC transaction of distributed transaction

1. Try: try to execute the business.

Complete all business checks (consistency): check whether the account status of A, B and C is normal, whether the balance of account An is not less than 30 yuan, and whether the balance of account B is not less than 50 yuan.

Reserve necessary business resources (quasi-isolation): the frozen amount of account An is increased by 30 yuan, and the frozen amount of account B is increased by 50 yuan, which ensures that there will be no other concurrent process deductions.

The balance of these two accounts leads to insufficient available balances for accounts An and B during subsequent real transfer operations.

2. Confirm: confirm the execution of the business.

Real execution of business: if account A, B, C status of Try phase is normal, and the balance of account An and B is sufficient, then execute account A to transfer 30 yuan to account C, and account B to transfer 50 yuan to account C.

Transfer operation.

Do not do any business check: at this time, the business check is no longer needed, and the business check has been completed in the Try phase.

Use only the business resources reserved during the Try phase: you only need to use the amounts frozen by the Try phase accounts An and B.

3. Cancel: cancel the execution of business

Release the business resources reserved during the Try phase: if the Try phase is partially successful, for example, the balance of account An is sufficient, and the corresponding amount is successfully frozen, and the balance of account B is insufficient and the freeze fails, you need to Cancel account A to unfreeze the frozen amount of account A.

These are the common solutions for analyzing distributed transactions shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.