Best practices | Technical Innovation of OceanBase transaction engine 07/02 Update SLTechnology News&Howtos

Best practices | Technical Innovation of OceanBase transaction engine

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Yan ran, a senior technical expert of Ant Financial Services Group's OceanBase team and one of the start-up members of OceanBase, is currently responsible for the research and development of transaction engine and performance optimization.

The following is a transcript of the speech:

First of all, let's talk about a topic that everyone is very interested in: what is the biggest difference between OceanBase and the existing classic databases such as Oracle, SQL Server, etc. OceanBase is a cloud native database. And the software we do is all based on hardware. So, first of all, we need to understand the current hardware at what level?

As shown in the figure above, IBM's solution is the most familiar, from the System 360 more than 50 years ago to the new IBM Z14 last year, and many organizations still use this solution today.

We talk about cloud computing every day, so what exactly is cloud computing? Cloud computing is not a solution that pops out of a crack in the stone, it originated from the minicomputer of DEC. DEC is a contemporary product of IBM mainframe computers that address storage and computing needs at a lower price. It differs from the IBM mainframe in that DEC is cheaper and more flexible to obtain.

DEC has since been broken by chip-based solutions such as SUN workstations that offer better performance and lower prices, and platforms that are more versatile. After that, PC Server takes over the baton, and the CPU, memory and storage can come from different manufacturers, but together, we can still provide a set of standardized computing and storage platform.

Compared with IBM mainframes, PC Server-based data centers provide computing and storage solutions in a more industrial chain way, which is more cost-effective and easier to expand. But the IBM mainframe is a software-hardware integrated design, which is also the mainframe to meet the needs can use a lot of hardware-level solutions. For example, for the demand for high reliability in financial business, mainframes do a variety of reliability guarantees at the hardware level, from storage to memory and even CPU are supported by redundancy strategies. However, the failure rate of a single machine in PC Server is relatively high, so it is necessary to use a completely different approach to solve the business requirements of high reliability and high availability.

Here is a quote from Yang Zhenkun, founder of the OceanBase team: "computers are naturally not suitable for databases, but databases naturally choose computers." Why are computers naturally not suitable for databases? Because when dealing with data, a very key issue is the reliability, security and consistency of the data. The computer hardware naturally has a variety of probability of damage, whether it is a power outage, software Bug, operating system problems, or the whole computer room is down, the computer will naturally make mistakes. However, the problem we have to solve is that we can't go wrong, so it is a great challenge to solve the problem of reliable security and consistency of the database on the computer.

To achieve a more reliable database transaction service capability on a less reliable machine is a serious challenge we are facing at present.

OceanBase transaction

Of course, where there are challenges, there are opportunities. OceanBase, what have we done in the face of such challenges? In such a cloud environment based on PC Server, OceanBase not only realizes the ability of elastic expansion, but also realizes the high reliability and high availability ability of solving database transactions without relying on high-end hardware.

The hardware condition on which OceanBase depends is a common cloud environment. However, on the basis of this hardware, we can still achieve database transactions, also achieve high performance, and even support financial-level reliability.

What is financial reliability? Everyone has a personal feeling. For example, if you post a post, but if you lose it, you may be upset, but not very uncomfortable. Another scene, today you transfer 100 yuan to your friend, but your friend only received 98 yuan, at this time your heart must be scared. Therefore, in this financial-level scenario, what must be done is to provide adequate protection for users, which is called financial-level reliability. But the difficulty of financial-level reliability lies in the ability to handle details. Details are the devil, when we do database software, we need to have a strong ability to control the details.

The few points to be highlighted here actually continue several important features of database transactions.

Implementation of transaction ACID under OceanBase architecture: Durability: transaction log uses Paxos to do multi-copy synchronization Atomicity: use two-phase commit to ensure atomicity of cross-machine transactions Isolation: use multi-version mechanism for concurrency control Consistency: guarantee uniqueness constraint OceanBase storage engine

The architecture of OceanBase is based on LSM Tree. Why is it based on LSM Tree? What characteristicses does it have?

In a classic database, all data is stored in pieces in persistent storage, such as disk or SSD. When reading data, be sure to put it in memory first. When you need to make changes, also put a piece of data in memory. If the memory is full, the page will be brushed back.

OceanBase is based on LSMTree, which allows more changes to be concentrated in a memory structure. Our approach is to do this regularly in the background, the front desk to keep the data for as long as possible, and then let the background to do the merger. OceanBase has a mechanism called daily merge, which means that if the foreground can save one day, we don't do the backstage merge until every night.

OceanBase will schedule within the system. When the first machine brushes dirty pages in the background, the traffic of the business will run on other machines, so that the business of the foreground will not be affected. When the first machine is finished, cut back the traffic of the business, and then let the second machine do the operation of scrubbing dirty pages in the background. We solve the problem of brushing dirty pages by using the time difference between different machines and different services in a cluster environment through this method of "round-robin merger".

The design of the OceanBase transaction engine also makes use of the structure of LSM Tree, so that all the execution states of the transaction are saved in memory, and the modified data of the transaction will not be persisted until the transaction is committed. So when implementing the atomicity of transactions, there is no need to do undo. In this way, OceanBase can do database transactions in a more concise way. This is the overall logic of the OceanBase storage engine.

OceanBase memory transaction engine

The transaction execution operations of OceanBase take place in memory. One of the biggest differences between memory and hard disk is that memory can be randomly addressed in bytes, and one of the great benefits of this is that it enriches the data structure.

For example, such as the data in the table above, the first row of data will now be updated: on the condition that Han be paid 500000 more. Now you need to write a new value for the Han corresponding to this column in the table. The representation in memory is such a linked list. If Han suddenly wants to change departments to take on new challenges after a raise, we need to do another update. You just need to string a new update in this list to change his department. He has transferred from the R & D department to the investment department, so we will record the corresponding changes in this column here.

This is a concurrency control based on multiple versions within the database, which records the time of each update and ensures that the modification operation and read operation will not affect each other. Because even if a row of data is changed, we can still get the historical data directly. Linked list-based data structures are very friendly to programmers.

Compared with the classic hard disk-based database, whether reading or writing data is a fixed-length data block operation, and the way of expressing information is also based on the block. One of the biggest benefits of OceanBase's approach is that it greatly improves the simplicity of expression, that is, it itself is much more efficient. This also explains why OceanBase can provide stronger transaction processing capabilities in the current hardware environment, which is one of the very important reasons.

Let's review the architecture of OceanBase. After slicing the data, OceanBase can extend the data to multiple machines in multiple clusters for storage. It provides linear scalability.

As shown in the figure above, there will be multiple copies of the same data. For example, P0 is on three machines, which may be in the same computer room or from different computer rooms. But they serve the same piece of data, and only one of them is the current master, which is responsible for doing database operations and synchronizing database transaction changes to other machines. The purpose of the data organization and distribution of OceanBase mentioned here is to solve the reliability problem of the following database.

Reliability vs availability

High reliability and high availability is a concept widely mentioned by all database products. So what exactly does reliability and availability mean?

In the traditional database, there is no concept of availability in ACID's theory, but only durable. However, if the data is not lost, it may not guarantee the continuity of the service. But the ability of failover is very important in the actual system. So all commercial databases have corresponding solutions to usability. The classic one is based on the master-standby synchronization scheme, when the master is broken, the standby can continue to provide services, which is to solve the problem of availability.

OceanBase uses Paxos protocol to solve the problems of reliability and availability. For any database transaction to be persisted, the transaction log needs to be persisted to multiple replicas. Of the three copies, we think that the transaction is successful as long as it is persisted to the hard disk of the two copies. In other words, when any machine breaks down, there is at least one copy.

So, the other way around, is it better to synchronize to more machines? Wouldn't it be more reliable if all three copies were synchronized? This is the problem with usability. Do you think the transaction is successful if one of the machines has a network failure or the system load is too high to respond?

If all three are required to be successful, it cannot be considered successful because one machine is not responding. The Paxos protocol only requires two machines to synchronize to, that is, even if one does not answer, we still think that the transaction is successful because the majority succeeds.

Ensuring that one of the three is broken is unaffected and does not affect the continuous service of the system. This is a good balance between reliability and availability. If you need higher protection, you can choose a five-copy solution, which ensures that when two machines fail, it also does not affect reliability or availability, which is a very important balance.

So we add another A to the ACID, that is, we want to make the transaction reliable and the processing power of the transaction available.

Distributed transaction two-phase commit protocol

If the database writes logs on a stand-alone machine, it must only write the log files on its own machine. If it is successful, it will succeed, and if it does not succeed, it will fail. But when it comes to multiple machines, machine A succeeds and machine B fails. What should we do about this?

The existence of the two-phase commit agreement is to solve this problem, that is to say, the submission can no longer be successful at once, it involves reaching an agreement among multiple machines, and each machine is successful before it can be defined as successful.

In fact, two-phase commit protocols are very rare in practical applications. Why? Mainly because it is very complex, although the theory is very beautiful, but there are a lot of details. However, in the business scenario of OceanBase, we must use the two-phase commit protocol to solve the problem.

The basis of OceanBase is that every participant is highly available, because OceanBase uses the Paxos protocol to ensure the high availability of Partition, so the failure of any machine will not cause the service to stop, which is a very important premise. In addition, because Paxos synchronization introduces synchronization latency across networks and data centers, the original two-phase commit protocol will bring more overhead to write logs many times. One thing OceanBase does is to make the coordinator not write the log, but only keep the memory status. A very important benefit of it is the low latency of commit. At the same time, because all participants are highly available, we don't have the same problems that usually occur in a two-phase state, such as the coordinator's outage.

OceanBase transaction isolation

Transaction isolation involves how concurrency control of transactions is done, and OceanBase uses multi-version concurrency control. The read request takes the publish version of the current system as the snapshot version of the read. A version number is generated when the transaction is committed, and the version number is incremented continuously as the version of all modified data for the transaction.

In a stand-alone scenario, a log can determine whether the transaction can eventually be committed. Then the location of the log determines the version number of the transaction, and its version number must be incrementally continuous. But after the distributed transaction participates, for example, 230 may be a prepare log, and the prepare log does not mean that the transaction can be committed, but 232 may be the commit log of another transaction, which means that if the transaction is to be commit, version 232 needs to be readable, but 230 is in the process of a row-locked unsolved transaction. At this point, another control method is needed. For transactions between prepare and commit committed in two phases, the rows of these transactions have operations to be locked and are not allowed to be read.

This impact is small because there is no user intervention between prepare and commit, which is a millisecond operation, that is, the row is locked for a short moment during the commit process. At this moment, the read operation waits for the transaction to commit, and then determines whether or not to read this row of data.

Summary

Based on the above technological innovation, OceanBase truly realizes the high reliability and availability of transactions in the cloud environment, as well as good performance. It is hoped that OceanBase can help more businesses to solve the needs of data storage and query, and no longer suffer from the problems of high price and poor scalability of traditional commercial databases.

Click to read more and see more details

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.