Depth | what is the underlying logic of POLARDB to lead domestic databases to the world? 04/28 Update SLTechnology News&Howtos

Depth | what is the underlying logic of POLARDB to lead domestic databases to the world?

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

POLARDB is the next generation of cloud native distributed database independently developed by Aliyun. It is 100% compatible with MySQL, PostgreSQL and other open source databases, and is highly compatible with Oracle syntax. Customers using RDS services do not need to modify the application code and can migrate to POLARDB with one click to experience larger capacity, higher performance, lower cost, and more flexible flexibility. At present, POLARDB is Aliyun's fastest-growing database product, widely used in Internet finance, government convenience projects, new retail, education, games, social live streaming and other industries. As a new generation of cloud native database based on computing and storage separation architecture, POLARDB computing nodes mainly implement SQL parsing and optimization, query parallel execution and lock-free high-performance transaction processing, and synchronize memory states between computing nodes through high-throughput physical replication protocols. On the other hand, the storage layer is based on the distributed file system PolarFS, and the strong consistency between multiple data replicas is achieved through the Parallel Raft consensus algorithm. The multi-version page management of the storage engine is carried out in the storage layer to support the Snapshot Isolation isolation level of the full cluster across computing nodes. Advanced architecture based on the separation of computing and storage between computing nodes and storage nodes, operators such as filter and projection are pushed down from the computing layer to the storage layer through the Smart Internet protocol that understands the semantics of the database. In order to ensure the low latency of transactions and query statements and reduce the delay of state synchronization between computing nodes, 25Gb high-speed RDMA network is used between computing nodes and storage nodes, and the user-state network protocol layer of Bypass kernel is used for communication. Based on the advanced architecture of separation of computing and storage, POLARDB can be flexibly scaled from 1 computing node (2 CPU cores) to 16 computing nodes (up to 1000 cores), and the storage capacity of single instance is flexibly expanded from 10GB to 100TB according to usage.

The architecture design of the separation of computing nodes and storage nodes brings real-time horizontal scalability to POLARDB. Due to the limited computing power of a single database instance, the traditional approach is to build multiple database copies to share the pressure, so as to provide the scalability of database Scale out. However, this practice requires the storage of multiple copies of full data, and frequent synchronization of log data results in excessive network overhead. In addition, on traditional database clusters, adding replicas requires synchronizing all incremental data, which brings the problem of rising synchronization latency. POLARDB stores database files and log files such as Redo log on a shared storage device, ensuring that the primary instance and all replicas share the same full and incremental log data. Nodes only need to synchronize the metadata information in memory, and the consistency of reading data across nodes can be supported by the guarantee of MVCC mechanism, which cleverly solves the problem of data synchronization between master instances and replicas, greatly saves the network overhead across nodes, and reduces the synchronization delay between replicas. 02 improve transaction performance POLARDB Kernel level Optimization revelation in order to improve transaction performance, POLARDB has made a lot of optimizations at the kernel level. A series of performance bottlenecks are modified with lock-free (lockless) algorithm and various parallel optimization algorithms to reduce or even eliminate the conflicts among various locks and greatly increase the scalability capability of the system. At the same time, relying on the experience of dealing with large-scale and high concurrency scenarios such as Singles Day, we have realized the function of optimizing hot data such as inventory on POLARDB. For simple and repetitive queries, POLARDB supports getting results directly from the storage engine, thus reducing the overhead of the optimizer and executor. In addition, physical replication that is already efficient is further optimized. For example, we add some metadata to the redo log to reduce log parsing CPU overhead. This simple optimization reduces log parsing time by 60%. We also reuse some data structures to reduce the overhead of the memory allocator. POLARDB uses a series of algorithms to optimize logging applications, for example, only data pages in buffer pool need logging applications. At the same time, we have also optimized page cleaner and double write buffer to greatly reduce the cost of these tasks. This series of optimizations make POLARDB far outperform MySQL in performance, and achieve the highest performance of 6 times higher than MySQL in the benchmark evaluation of a large number of concurrent writes such as sysbencholtp_insert. 03 support parallel queries (Parallel Query) to enhance the capabilities of subqueries and complex queries such as join benchmark evaluation, POLARDB's query processor supports parallel queries (parallel query), which can execute a query on multiple or all available CPU cores at the same time. Parallel query can divide a query task (currently only supports SELECT statements) into multiple subtasks, multiple subtasks can be processed in parallel, and the whole adopts the concurrency model of Leader-Worker. Leader thread is responsible for generating parallel query plan and coordinating other components of parallel execution process. Parallel execution plan includes parallel scanning, multi-table parallel join, parallel sorting, parallel grouping, parallel aggregation and other sub-actions. Message queue is the communication layer between the leader thread and the worker thread. The worker thread sends data to the lead thread through the message queue, while the leader thread sends control information to the worker thread through the message queue. The Worker thread is responsible for actually performing the task. The Leader thread parses the query statement to generate a parallel plan, and then starts multiple worker threads to process parallel tasks at the same time. In order to execute the query efficiently, the execution on Worker does not need to be optimized again, but copies the generated plan fragments directly from Leader. This requires implementing copies of all nodes on the execution plan tree. After scanning, aggregating, sorting and other operations, the worker thread returns the intermediate result set to leader,leader responsible for collecting all the data sets from worker, and then carries out appropriate secondary processing (such as merge sort, secondary group by, etc.), and finally returns the final result to the client.

The Parallel Scan layer combines the data structure characteristics of the storage engine to achieve workload balancing. How to divide the scanned data into multiple partitions to make all worker threads work as evenly as possible is the goal of data partition. In the storage engine with B+ tree as the storage structure, partitioning is divided first from the root. If there are not enough partitions on the root (> = parallelism), it will continue to be partitioned from the next layer. If we need six partitions, the root node can be divided into at most four partitions, so we need to continue to search the next level to partition, and so on. In the actual process of implementing parallel query, in order to make multiple worker threads distribute scan segments more evenly, they will partition as many partitions as possible in the B+ tree, so that if a worker thread will give priority to completing the current partition because of its high filtering, it will automatically attach the next partition and achieve load balancing of all threads by automatic attach.

04 the new generation of cost-based optimizer cloud customers' business is diversified, if the execution plan is wrong, it will lead to slow query. In order to solve these problems systematically, POLARDB launched a new generation of cost-based optimizer. The new histogram Compressed Histogram is implemented in POLARDB to automatically detect high-frequency data and construct an accurate description. The data frequency and value space are taken into account in the selection rate calculation to solve the common data tilt scene in practical applications. POLARDB makes a large number of cost estimates based on the improved histogram, such as the size of the estimated table and table join, which is the decisive factor of join cost and join order optimization. MySQL can only roughly estimate according to the empirical formula, whether it is the rows_per_key with index or the default parameter value without index, the estimation error is large, these errors will be magnified in the process of multi-table join, resulting in inefficient execution plan. In POLARDB, histograms are used to merge the coincident parts, and different estimation algorithms are adapted according to different histogram types, which greatly improves the estimation accuracy and helps the optimizer to make a better join order selection. In the test of randomly generated normal distribution data, the speed of multi-table joint query can be increased by 2.4-12 times after optimization, the join order of multiple queries in TPC-H test has changed, and the performance has been improved by 77% 332%. POLARDB also uses histogram to optimize the logic of record_in_range. MySQL uses index dive to estimate the number of records in the interval for indexed filter conditions, which accounts for a higher proportion of CPU in OLTP short queries. After replacing index dive with histogram estimation, the response time of most queries in Taobao e-commerce core business is halved.

05. Self-developed distributed file system PolarFS: the storage layer of high reliability, high availability and collaborative design of POLARDB with database adopts the distributed file system PolarFS independently developed by Aliyun. PolarFS is the first domestic low-latency and high-performance distributed storage system designed for DB applications with full user space I _ An Ultra-low Latency and Failure Resilient Distributed FileSystem for Shared Storage Cloud Database stack (see article PolarFS: An Ultra-low Latency and Failure Resilient Distributed FileSystem for Shared Storage Cloud Database on VLDB 2018). It not only has the same low-latency and high-performance Imax O capabilities as the local SSD hard disk architecture, but also provides excellent storage capacity and scalability in the way of distributed clusters. As a storage infrastructure with deep collaboration with POLARDB, the core competitiveness of PolarFS is not only reflected in performance and scalability, but also in the long-term accumulation of a series of storage technologies with high reliability, high availability and collaborative design with database in the face of many challenging POLARDB customer business requirements and large-scale public cloud R & D, operation and maintenance. To support POLARDB to distribute queries among multiple compute nodes and to maintain global Snapshot Isolation semantics, PolarFS supports storing multiple versions (Multi-version page) dynamically generated by the B+ tree of the POLARDB storage engine. In order to reduce read-write conflicts, modern databases generally use MVCC concurrency control as the framework to provide different transaction isolation levels such as RC, SI, SSI and so on. Under the MVCC mechanism, each page of the B+ tree dynamically maintains a series of versions, and multiple transactions in parallel execution allow them to access different versions of a page. In POLARDB cluster, due to the existence of synchronization delay of cross-node replication, the page of each computing node B+ tree may be different, so multi-version storage can provide its corresponding version for each node. In POLARDB, the computing node provides the version information (LSN) of the data page when it writes a page to PolarFS. PolarFS not only stores the data page, but also stores the data version meta-information; when the computing node reads the data page, it will also provide the version information to obtain the corresponding data page (history) version from the storage. The POLARDB database layer periodically sends the low water mark of the version numbers of all compute nodes in the cluster to PolarFS,PolarFS and cleans up the historical versions that are no longer used based on this version number.

Ensuring data reliability is the bottom line of all POLARDB designs. In the actual distributed system, the bug of hardware, firmware or software such as hard disk, network and memory may cause data errors, which brings various challenges to the reliability of data. The reliability problems on the storage side come from silent errors (lost write, misdirected write,block corruption, etc.), and the network and memory mainly come from bit inversion and software bug. In order to ensure data reliability when various abnormal conditions (including hardware failure, software failure, manual failure) occur, POLARDB and PolarFS provide end-to-end full-link data verification guarantee. When writing data, POLARDB starts from the storage engine of the computing node to the data of the PolarFS storage node and passes through the intermediate link to verify the correctness of the data to prevent abnormal data from being written. When reading data, both PolarFS and POLARDB storage engines will do checksum check on the read data to accurately identify the occurrence of disk silent errors and prevent the spread of silent errors. When the business traffic is at a low peak, the data consistency scan will be done continuously in the background to check whether the checksum of the single copy data is correct and whether the data of each copy is consistent. The correct verification in the process of data migration is also very important: when POLARDB performs any form of data migration, in addition to the checksum check of its own data, it will also verify the consistency of the data of multiple copies; when both checks are passed, the data will be migrated to the target side; to the maximum extent possible to prevent the spread of data errors on the single copy due to the migration action, so as to avoid data corruption. PolarFS also supports fast physical snapshot backup and restore of POLARDB. Snapshot is a popular backup scheme based on storage system. Its essence is to use the mechanism of Redirect-On-Write, by recording the metadata changes of the block device, replicating the storage volume where the write operation occurs, and changing the content of the write operation to the newly replicated storage volume to achieve the purpose of restoring the data to the snapshot point in time. Snapshot is a typical post-processing mechanism based on time and write load model. In other words, when creating a snapshot, there is no backup data, but the load of the backup data is evenly distributed to the time window in which the actual data writing occurs after the snapshot is created, so as to achieve the fast response of backup and recovery. Through the snapshot mechanism of the underlying storage system and Redo log incremental backup, POLARDB is more efficient than the traditional recovery method of full data combined with logical log incremental data in the function of restoring user data at a point in time. 06 highly compatible Oracle syntax cost is a commercial database 1x10. In addition to being 100% compatible with MySQL and PostgreSQL, the two most popular open source database ecology, POLARDB is also highly compatible with Oracle syntax. Providing cost for traditional enterprises on cloud is a commercial database 1x10 solution. By replacing Oracle's GUI management tool OEM with DMS, and replacing command line tool SQL Plus with POLARDBPlus, it follows the usage habit of OracleDBA; client SDK can be replaced from OCI and O-JDBC Driver to libpq and JDBC Driver, only so and jar packages need to be replaced, and the program body code does not need to be modified; it supports Oracle's SQL general DML syntax and almost all advanced grammars such as connect by, pivot, listagg, etc. It can also provide full coverage support for PL/SQL stored procedures and the built-in function libraries used by stored procedures. Provide exactly the same format layout and operation syntax for some advanced functions (such as security management, AWR, etc.), so on the whole, POLARDB has achieved comprehensive compatibility and replacement of Oracle operation methods, usage habits, ecological tools, SQL syntax, format layout and so on. Combined with the migration assessment tool ADAM, the application can make little or no changes. 07 look ahead: more new technologies and enterprise features will be launched soon. In addition to the technologies mentioned above, POLARDB has a large number of new technologies and enterprise features that will be released in the second half of 2019. These technologies will comprehensively improve the availability and performance of POLARDB and reduce the cost of POLARDB: 1) from elastic storage to elastic memory, hot buffer pool (warm buffer pool) technology POLARDB is about to support and deconstruct the "hot" buffer pool of computing node processes. This will greatly reduce the impact of user business when the computing node is restarted. In the model replacement specification upgrade and downgrade (serverless), the impact on the business is less. At the same time, a separate memory also makes it possible to dynamically expand or contract on demand. 2) performance has increased several times, and the better DDL support (FAST DDL) POLARDB will soon support parallel DDL, which will greatly reduce DDL latency at the table level. This feature takes parallelization to the extreme and can reduce the time of DDL such as indexing by nearly 10 times. At the same time, POLARDB has also carried out a large number of DDL replication level optimization, which makes DDL can carry out cross-regional mass replication, faster speed and less resource consumption. 3) support cross-region global database (Global Database) POLARDB supports cross-region and long-distance physical replication to help users set up their global database deployment. Through physical replication, data can be copied to various computer rooms around the world in real time, so that the inquiries of global users can be answered in the local computer room and respond more quickly. 4) POLARDB support for partitioned tables supports 100T storage capacity. However, as the size of the table grows, the hierarchy of the single table index increases, resulting in slower lookup and positioning of the data, and some physical locks on the single table also cause the parallel DML to hit the ceiling. Therefore, reasonable zoning becomes more urgent. In the past, many users relied on the database external middleware to reduce the pressure of a single table. However, with the development of POLARDB in various aspects, such as parallel query, we can implement these functions of sub-database and sub-table in the database more effectively in the form of partition table. Effective partitioning not only enables us to support larger tables, but also reduces global physical lock conflicts for some database indexes, thus improving the performance of the overall DML. At the same time, this form can better support hot and cold data separation, store different "temperature" data in different storage media, while ensuring the performance of data access, reduce the cost of data storage. POLARDB enhances a series of functions of partitioned tables, including global index (Global Index), foreign keys of partitioned tables (Foreign Key Constraint), self-increasing partitioned tables (Interval Partition), etc., so that POLARDB can better cope with extra large table 5) Row-level compression POLARDB is about to launch row-level compression. The usual practice in the industry is to compress at the data page level through general compression algorithms (such as LZ77, Snappy), but page-level compression will bring excessive CPU overhead, because to change a row of data, the entire data page will be decompressed, changed, and then compressed. In addition, in some scenarios, data pages become bloat after compression, which can lead to multi-index page splitting (multiple splits). POLARDB uses fine-grained (fine-grain) row-level compression technology and uses specific compression methods for different data types. The data exists in both external memory and memory in a compressed way, and the row-level data is decompressed only when it is to be queried, instead of decompressing the entire data page. Because the data is stored in a compressed way except for the query, the log also records the compressed data, which further reduces the size of the log and the pressure on the data / logs transmitted over the network. At the same time, its corresponding index only stores compressed data. The reduction in the overall amount of data is sufficient to offset the additional overhead caused by decompression, so that this compression greatly reduces data storage without causing performance degradation. 6) HTAP of In-Memory in the traditional database field, analytical database is separated from online transaction processing. Therefore, it is usually necessary to import the online transaction processing data and the previous analysis data into the data warehouse at the end of the day to generate the corresponding report. In the HTAP database, the time and operating costs of large-scale data transfer are saved, the needs of most enterprise applications are solved in one stop, and the analysis report is issued simultaneously on the day of the end of the transaction. Under this requirement, POLARDB is implementing in-memory 's column storage data table. Synchronize with POLARDB row storage data directly through the physical logical log. In this way, we can carry out real-time big data analysis of these stored data through specific operators suitable for analysis. So that users can get the analysis results in one stop. 7) the scale of cold and hot separation storage engine X-Engine is getting larger and larger, but not all data access frequencies are the same. In fact, data access always shows obvious cold and hot distribution characteristics. Based on this feature, X-Engine designs a cold and hot hierarchical storage architecture, which divides the data into multiple layers according to the data access frequency (cold and hot), according to the data access characteristics of each level. Design the corresponding storage structure and write to the appropriate storage device. Different from the traditional Btree technology, X-Engine uses LSM-Tree as the architecture basis of hierarchical storage, uses multi-transaction queue and pipeline processing technology, reduces the cost of thread context switching, and calculates the task ratio of each stage, so that the whole pipeline is fully transferred, and the transaction performance is greatly improved. Data reuse technology reduces the cost of data merging, and reduces the performance jitter caused by cache elimination because of data reuse. Further use of FPGA hardware to accelerate the compaction process, so that the upper limit of the system is further raised. Compared with other storage engines with similar architecture, such as RocksDB,X-Engine, the transaction performance is improved by more than 10 times. The detailed technology of X-Engine is referred to the paper X-Engine: An Optimized StorageEngine for Large-scale E-Commerce Transaction Processing of SIGMOD 2019. At present, POLARDB not only supports business scenarios such as Taobao, Tmall and Cainiao of Alibaba Group, but also is widely used in government affairs, retail, finance, telecommunications, manufacturing and other fields. So far, 400000 databases have been moved to Aliyun. Based on the POLARDB distributed database, Beijing's public transport system quickly and smoothly arranges more than 20, 000 buses in the city, making it convenient for 8 million people to travel every day. Zhongan Insurance uses this database to process insurance policy data, increasing its efficiency by 25%.

The original link to this article is the original content of Yunqi community and may not be reproduced without permission.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.