In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
Editor to share with you how to achieve parallel query PostgreSQL, I hope you will learn something after reading this article, let's discuss it together!
Background of parallel query
With the popularity of disk technologies such as SSD, and the popularity of hundreds of GB memory, the performance problems at the ISSD O level have been effectively alleviated. To improve the expansion performance of the database, we can pursue the way of Scale Out, increase the number of machines, develop to the distributed direction, and also pursue Scale Up, increase hardware components, make full use of the resources of each hardware, and maximize the performance of the stand-alone machine. In contrast, Scale Up is a low-cost expansion scheme by accelerating performance through software and relying on optimization at the software level.
In addition to the enhancement of disk and memory resources in modern servers, the configuration of multi-CPU is also powerful enough. Database Join, aggregation and other operations cost a lot of memory, a lot of time is spent on data exchange and caching, the utilization of CPU is not high, so concurrent execution is a common method in the CPU-oriented acceleration strategy.
The performance of query is the core index to evaluate the quality of OLAP database products, while parallel query can focus on data reading and computing, and realize parallelism by decomposing Join, aggregation, sorting and other operations into multiple operations.
The challenge of parallel query is that the data slicing process, the communication between processes or threads, and the system overhead brought by concurrency control do not increase the performance, but degrade the original performance. In terms of implementation, how to plan the parallel plan in the optimizer is also impossible for many databases.
The parallel query function of PostgreSQL is mainly developed by Robert Haas, the core developer of the PostgreSQL community. According to Robert Haas's personal blog, the parallel query feature schedule for community development of PostgreSQL is as follows:
In October 2013, two adjustments were made to the implementation framework: Dynamic Background Workers and Dynamic Shared Memory.
In December 2014, Amit Kapila submitted a simple version of parallel sequential scan's patch
In March 2015, the official version of parallel sequential scan's patch was submitted.
March 2016, support for parallel joins and parallel aggregation
It was released in April 2016 as a new feature of 9.6.
The parallel query of PostgreSQL is more effective in Join and Merge situations with a large amount of data (the intermediate result is above GB). In effect, because of the system overhead, the resources invested are not linear with the performance improvement. For example, if you add 4 worker, the performance may increase by about 2 times instead of 4 times. The test results of TPCH show that there is generally an acceleration effect in Ad-Hoc query scenarios.
Description of parallel query function
Currently, the following three parallel scenarios are supported:
Parallel sequential scan
Parallel join
Parallel aggregation
For security reasons, the following four scenarios do not support parallelism:
Scanning of common table expressions (CTE)
Scan of temporary table
Scan of external tables (unless the external data wrapper has an IsForeignScanParallelSafeAPI)
Access to InitPlan or SubPlan
Using parallel queries, there are also the following restrictions:
Must be guaranteed to be in strict read only mode, and cannot change the state of database
Cannot be suspended during query execution
Isolation level cannot be SERIALIZABLE
Cannot call PARALLEL UNSAFE function
Parallel queries have judgments based on cost strategies, such as normal execution by default when the amount of data is small. In the configuration parameters of PostgreSQL, some parameters related to parallel queries are provided. We want to test parallelism, generally setting the following two parameters:
Force_parallel_mode: force the switch to turn on the parallel mode
Max_parallel_workers_per_gather: sets the number of worker processes used for parallel queries
A simple two-table Join query scenario. The query plan using the parallel query mode is as follows:
When a parallel query is opened, the parser generates a Gather. An Partial-style execution plan, which means that the Partial part of the plan is executed in parallel to the Executor layer.
As you can see in the execution plan, when doing parallel queries, two additional worker processes are created, plus the original master process, a total of three processes. The driver table data of Join is evenly distributed to 3 copies, and the IJoin O operation is dispersed by parallel scan, and then the Join is done with the large table data respectively.
Implementation of parallel query
The parallelism of PostgreSQL is accomplished by the mechanism of multiple processes. Each process is internally called a worker, and these worker can be created and destroyed dynamically. PostgreSQL is not parallel during the SQL statement parsing and query plan generation phase. In the Executor module, the subtasks that have been fragmented are executed concurrently by multiple worker. Even when the query plan is executed in parallel, the existing process acts as a worker to complete the parallel subtasks, which we can call the master process. At the same time, according to the number of worker specified by the configuration parameter, start n worker processes to execute other subplans.
The mechanism of shared memory is continued in PostgreSQL, which allocates shared memory for each worker when it is initialized, which is used for worker to obtain plan data and cache intermediate results respectively. There is no complex communication mechanism between these worker, but the main process makes simple communication to start and execute the plan.
The parallel execution model in PostgreSQL is shown in figure 1.
Figure 1 Framework of PostgreSQL parallel query
In the Hash Join scenario above, for example, at the executor level, the execution flow of the parallel query is shown in figure 2.
Figure 2 execution flow of parallel query
The worker collaborates to perform tasks in the following ways:
First, each worker node does the same task. Because the Hash Join,worker node uses a table with a small amount of data as the driver table to make the Hash table. Each worker node maintains such a Hash table, and the large table is equally divided into data Join with the Hash table.
The lowest level of parallelism is that the parallel scan,worker process of the disk can get the block of its own scan from the disk block.
The data after Hash Join is a subset of all data. For the aggregate function such as count (), the data can be calculated separately on the subset of the data, and then merged, and the results can be guaranteed to be correct.
After data integration, do a general aggregation operation.
How is the worker process created and run? Let's first look at the creation logic of worker (see figure 3).
Figure 3 worker creation of PostgreSQL
The parallel processing of PostgreSQL is based on the dynamic creation of worker. The worker can be initialized by the main process and, in the context, specify the entry function first.
In parallel queries, the entry function is specified as ParallelWorkerMain. In the ParallelWorkerMain function, after completing a series of signal proxy settings, ParallelQueryMain is called to execute the query. ParallelQueryMain creates a new executor context that recursively executes the parallel subquery plan.
The worker process used for parallel queries receives signals from the main process, such as once the signal to create the process is sent, the worker process starts, followed by the execution of the ParallelWorkerMain function. In turn, the ParallelQueryMain executes, and each worker process executes the subplan independently, and the execution results are stored in shared memory. After the execution of all the processes, the master process will collect the result data (tuple) in the shared memory for data integration.
Improvement of parallel query
After the release of the characteristics of parallel query, there is no lack of parallel evaluation and subsequent improvement plans. Community parallel query developers mentioned in their blog that they are going to make a large shared Hash Table, so that the parallelism of Hash Join operations will be further improved.
Figure 4 create a large Hash table to share data
In addition, for PostgreSQL, it is based on the fact that some database products from its folk have done parallel queries before it, so you can refer to:
Distributed Framework of Postgres-XC
MPP architecture of GreenPlum
Distribution of CitusDB
VitesseDB based on multithreading parallelism
Parallelism of Fujitsu Enterprise PostgreSQL of Fujitsu
Among them, the open source database GreenPlum parallel architecture can be used for reference. GreenPlum's parallel query designs a special scheduler to coordinate the allocation of query tasks, while PostgreSQL does not. With regard to the implementation framework of GreenPlum, to put it simply, it has the following three-tier structure:
QD: the scheduler sends the optimized query plan to the executors (QE) on all data nodes (Segments). The scheduler is responsible for task execution, including executor creation, destruction, error handling, task cancellation, status update, and so on.
QE: after receiving the query plan sent by the scheduler, the executor begins to execute the part of the plan for which it is responsible. Typical operations include data scanning, hash association, sorting, aggregation, and so on.
Interconnect: responsible for data transmission between nodes in the cluster
GreenPlum broadcasts and redistributes data according to data distribution, which can be used for reference by PostgreSQL's parallel model.
Just a large Hash Table with serial overhead on data access, worker parallelism is still limited. As shown in figure 5, the scenarios of large and small tables Join refer to GreenPlum's data broadcast mechanism, and the data of the driving table can be prepared for each worker process, which is equivalent to broadcasting a copy of the data. In this way, the data is highly shared, and the effect of parallelism is better.
In addition to PostgreSQL ecological database, relational database big brother Oracle has accumulated 30 years of experience in parallel query, which also needs to be used for reference. Its parallel query mechanism is described in Oracle's official manual.
Figure 5 improve the parallel effect by drawing lessons from the broadcast mechanism of GreenPlum
Oracle can highly slice the data in each operation. You can refer to the parallelism of Hash Join shown in figure 6.
Figure 6 parallel flow of Hash Join operations for Oracle
In the internal parallel control, after the data is grouped, whether it is scan or sorting, several groups of worker can divide and conquer the grouped data.
In other words, Oracle achieves the parallelism of the Operator Level. In each operation, a dynamic parallel operation after slicing the data. We can see that the parallel query of Oracle is doing Operator-level parallelism, and each operation link can divide and conquer the data, and the degree of parallelism is very high. This also requires high data flow, and data and operations can divide and conquer both horizontally and vertically.
PostgreSQL is currently a task-level parallelism, which divides the original execution plan into several separate sub-tasks vertically. Parallel implementation is simple, but the parallelism is not enough when there is a large amount of data, and the access load of shared memory is increased, and the performance improvement is not obvious.
Figure 7 dynamic parallel operations within Oracle
Referring to the way of Oracle, after the improvement in the figure above, worker is no longer performing a single task, but is called to perform the operation at any time. Data is layered, sliced and broadcast according to the operation, and the worker process serves the data operation, not the data serves the worker. In this way, in the scenario of super-large data, the driver table does the data partition as producer, and the exterior does the operator operation as consumer. Multiple sets of such operations result in freer parallel computing and more imaginative performance, which is what our team is currently trying to do.
Figure 8 through data grouping and worker grouping to improve the parallelism of PostgreSQL after reading this article, I believe you have a certain understanding of "how to achieve parallel query in PostgreSQL". If you want to know more about it, please follow the industry information channel. Thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.