The principle of database architecture of Alipay supporting 213.5 billion turnover 07/12 Update SLTechnology News&Howtos

The principle of database architecture of Alipay supporting 213.5 billion turnover

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

SQL Optimizer and distributed parallel execution of OceanBase

Abstract: this paper mainly introduces the distributed architecture of the general relational database OceanBase,OceanBase independently developed by Ant Financial Services Group, which achieves better reliability, availability and expandable rows on the ordinary PC server cluster through technological innovation. In this article, Pan Yi (nickname: Boze), a senior technical expert of Ant Financial Services Group's OceanBase team, introduces OceanBase and shares the SQL optimizer, distributed transaction execution logic and other contents, showing you the technical innovation of the underlying transaction engine of OceanBase.

A brief introduction to OceanBase

OceanBase is a general relational database independently developed by Ant Financial Services Group. It adopts a distributed architecture and currently supports all the core link systems of Ant Financial Services Group.

Why develop OceanBase?

As the basic software, database takes a long time to develop and costs a lot of money, so why can't Ant Financial Services Group adopt off-the-shelf solutions, such as commercial databases or open source databases? Once upon a time, Alibaba had the largest Oracle cluster in Asia, but with the development of Internet business, especially in the past decade, Ali's traffic increased exponentially every year, and surged on certain dates, which is also the difference between the Internet industry and traditional banking, telecommunications and other industries in database applications. Traditional industries can make plans for the next few years, such as customer size, business volume, etc., which are predictable to a certain extent. But the Internet industry is different, the traffic in the Internet industry is changing very fast, on the one hand, it is difficult to use commercial databases to expand rapidly to cope with the rapid growth of Alibaba; on the other hand, the reliability and high availability of traditional databases need to rely on extremely expensive hardware, and the cost will be very high. At the same time, Alibaba has a huge difference in traffic at the peak and peace peak, so it will cause a serious waste of resources to achieve high traffic support on a special date through hardware. To sum up, the existing commercial database can not well support Alibaba's entire Internet industry. And the use of open source database will also lead to a series of problems, take MySQL as an example, the first point, the Internet industry business traffic is large, and high concurrency, every query needs to be executed in a very short time, so for the general database, it is necessary to master the core code to ensure stable business requirements. Second, the financial industry needs a database with strong analysis and query capabilities, while the ability of MySQL in analysis and query is relatively weak, unable to meet business needs. For the above reasons, Ant Financial Services Group needs to develop a database to meet his business needs.

II. OceanBase architecture

This part will introduce the architecture design of OceanBase from three parts: cluster topology, partition-distributed protocol and storage engine.

1. Cluster topology

Multiple replicas: generally deployed into three subclusters, each subcluster consists of multiple machines, and the data is stored in different clusters.

Fully peer-to-peer nodes: each node is functionally equivalent and manages different data partitions.

Do not rely on shared storage: shared storage is expensive, so local storage is adopted.

2. Partition-distributed protocol

Data partitioning: data partitioning is supported. Each data partition is used as a management group, and each partition selects the master independently to write the log.

High availability & strong consistency: data (logs) are synchronized to most machines through the PAXOS protocol, and the master is automatically cut off in case of failure.

3. Storage engine

The storage engine adopted by OceanBase is the classical LSM-Tree architecture, and the data is mainly divided into two parts, namely, the baseline data stored in the hard disk (SSTable) and the incremental data stored in memory (MenTable). In this storage engine, all data additions, deletions and modifications are carried out in memory to form incremental data, and the incremental data and baseline data are merged at regular intervals to avoid random writes to SSD. Because the operation for the database is a full memory operation, the effect of the DML operation is very good, but if there are too many data modification operations in a certain period of time, the amount of data is too large, resulting in memory overflow, in this case, OceanBase provides the corresponding solution, that is, the dump operation, which will move the data to the hard disk, but will not merge the data, and the incremental data will be merged at the same time in the subsequent merge. The baseline data is merged with the dump data. It can be seen that this architecture will face a great challenge, that is, after the addition, deletion and modification operation, the query operation may need to merge the baseline data and incremental data, so under this architecture, the read operation may be punished for a certain amount of time, which is also a problem that the SQL optimizer needs to consider. In fact, if the optimizer is not developed based on its own architecture and business requirements, it may not work well, which is another reason to develop its own database.

III. SQL optimizer

The goal of SQL optimization is to minimize the execution time of SQL (plan generation + plan execution). This part mainly introduces SQL optimization from two aspects: OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing). For queries with a large amount of data, planned execution often takes up most of the time, but for many queries with small amount of data, more consideration should be given to the optimization scheme generated by the plan.

1. Cost Optimizer based on LSM-Tree structure

The OceanBase optimizer is based on the classical System R model and mainly carries out two stages of optimization. The first stage is to generate a local optimal plan based on all relationships, which mainly considers the cost of CPU and IO; the second stage is to perform optimization in parallel, considering the cost of CPU,IO,Network and Overhead. At the same time, the cost model also needs to consider the characteristics of LSM-Tree structure, such as data fusion of MemTable and SSTable, the cost of different tables may be different, so the cost needs to be calculated by dynamic sampling; the system is a distributed share nothing system; indexing table operation will have additional cost, using logical rowkey instead of physical rowid, so the time consumption of table return will increase, which are all factors that the optimizer needs to consider.

two。 Basic capabilities of optimizer

The optimizer mainly includes two types, namely, the logical optimizer and the physical optimizer. The logical optimizer mainly optimizes query rewriting and other operations, such as rule-based and cost-based optimization. The physical optimizer mainly optimizes the join order, join algorithm and access path, while taking into account Meta Data, such as statistics and table partition information. When the statistical information and the cost model are available, the execution cost of the model can be estimated, and then the model with the best cost can be selected for corresponding operation. The plan management module is more important for OLTP and can better optimize the short query.

3. Statistics for the optimizer

The OceanBase optimizer implements very complete statistics, including tables (avg rowlen, # of rows), columns (column NDV, null value, histogram, min/max), and partition / row-level statistics. To prevent the introduction of additional overhead, statistics are performed only when large versions of the data are merged. The storage engine can provide accurate Cardinality estimates of the amount of data for some predicates, for example, the start and end intervals of the scanned index can be calculated through the predicate, each block in SStable has the number of metadata rows, and the metadata about insert,delete,update operations can be counted in MemTable. In OceanBase, if the data is modified during the merging process, the data will not be accurate enough. At this time, dynamic sampling will be used to solve this problem.

4. Access path

Because the SQL query plan of OLTP is relatively simple, generally speaking, it is often a single-table, single-index query, so the access path is very important for OLTP. Therefore, it is necessary to make appropriate choices when doing SQL queries, such as primary key scan or index scan, single-column index or multi-column index. The selection criteria are mainly based on the rule model and the cost model. The rule model includes decisive rules (such as primary key scanning for query if the primary key is fully covered) and pruning rules (using skyline pruning rules to choose a better index by comparing multiple dimensions). After that, the optimal model is selected for query through the comparison of cost models. The main factors considered in the model include: scanning range, whether to return to the table, filter conditions, Interesting order and so on.

5. Plan caching

Plan caching means that after a plan is generated, if the same query or similar query appears later, the existing plan can be used without regenerating the plan. The plan cache matches the input query statements through an efficient lexical parser and uses parameterized query optimization to match. The following is an example of a planned query.

As you can see, fuzzy matching of parameters is performed for query statements in the plan cache, but a qualification is added to the parameters with a specific meaning. For example, parameter 3 in order by 3 represents the third column, and the modification of this parameter may lead to the inapplicability of the plan cache, so the qualification @ 3 = 3 is added when storing the plan cache.

6. Adaptive sharing schedule

Is it possible to match the plan cache for a query statement as long as the parameters are similar? The answer is no, for example, in a query statement, salesman will be queried by full table scan because of its large amount of data, while for president, because the amount of data is very small, the cost of querying through index may be better than the cost of full table scan, so in this case, the corresponding qualification should also be added. However, the regenerated plan may be the same as the original plan, and when this happens, the original qualification will be modified to facilitate the subsequent query statement to match the plan, so as to achieve adaptive plan sharing.

7. Hint/Outline

If you are not satisfied with the automatically generated plan in OceanBase, you can also bind the custom plan by creating Outline, that is, making the plan generation through Hint. The types of Hint are very rich, including: access path, join order, join method, parallel distribution and so on. Here are two examples of Outline binding.

8.SQL Program Management and Evolution

Many users, especially enterprise users, have very high requirements for stability, so OceanBase will automatically evolve the traffic after system upgrade, statistics update, and hardware update, that is, running both the new plan and the old plan at the same time. When it is determined that the new plan has no performance fallback compared to the old plan, the old plan will be gradually replaced with the new plan.

9. Zoning and zoning tailoring

OceanBase supports multiple partition types, including Range,Hash,Key,List. Range/Range, Range/Hash, Range/List, Hash/Hash, Hash/Range and so on are supported for secondary partitions. For static or dynamic partition tailoring supports inlist, function expressions, join filter and other ways.

10. Query rewriting

Query rewriting mainly includes rule-based rewriting and cost-based rewriting. Rule-based rewriting mainly includes view merging, subquery expansion, filter condition push-down, join condition push-down, equivalence condition derivation, outer join elimination and so on. Cost-based rewriting mainly includes OR-expansion,Window function, and query rewriting has a very good optimization effect on OLAP. The following figure shows the rewriting framework based on the cost model.

IV. SQL execution engine

The optimizer and the execution engine complement each other, and the plan that the optimizer can optimize depends on the execution plan of the execution engine.

1. Parallel execution

The concept of parallel execution is divide-and-conquer, which includes vertical divide-and-conquer (such as split plan as sub-plan unit) and horizontal divide-and-conquer (such as GI (Granule Iterator) to obtain scanning tasks). Parallel execution is mainly used in OLAP scenarios to solve the problem of querying RT time, which is necessary in many online analysis scenarios. RT time is an important indicator for RDBMS queries, and the performance of traditional Map-Reduce can not meet the performance requirements of OLAP, so how to find efficient execution plans and data pipelining is very important. In OceanBase, two-level scheduling, adaptive sub-plan scheduling framework and independent task segmentation of each node are adopted to execute in parallel. For data redistribution, OceanBase supports most common data distributions, including Hash, Broadcast/Replicate,Round Robin,Merge Sort, etc.

two。 Two-level scheduling

Two-level scheduling is formed in OceanBase in the manner described below. An execution node (SQC) is created on each machine involved in the query to let the master node (QC) control SQC, where QC is machine-level control and SQC is thread-level control. QC carries on the global scheduling, allocates each node's sub-plan parallelism according to the total parallelism, and SQC carries on the local scheduling, in which each node independently determines the horizontal parallel granularity and execution.

3. Planning dynamic scheduling

Planned dynamic scheduling refers to reducing the intermediate result cache and constructing the execution pipeline of two or more plan subtrees (vertical parallelism) within the allowable range of resource use according to user-specified or system resource adaptability. this way can effectively avoid materialization and reduce the adverse effects of materialization operators on parallel execution. This feature is under development and testing.

4. Parallel execution plan

OceanBase has parallel execution methods of all the major operators, including nested-loop join, merge join, hash join,aggregation, distinct, group by,window function, count, limit, etc., and supports rich redistribution methods and a variety of candidate plans, such as partition-wise join, partial partitionwise join,broadcast, hash, sort (for distributedorder by) and so on.

In fact, the optimization technology of parallel query gives rise to new problems under the MPP architecture, such as the partition join requires that the partitions of each table are logically and physically the same, which is also a direction that needs to be considered and optimized.

5. Compilation execution

Traditional execution methods such as type checking, polymorphisms (virtual functions) are inefficient for memory operations. OceanBase considers using LLVM to dynamically generate execution codes, SQL expressions can support dynamic generation of execution codes, and stored procedures are compiled and executed directly to improve performance.

Finishing: Yang Deyu

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.