Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand the parallel execution of Oracle

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to understand the parallel execution of Oracle". In the daily operation, I believe that many people have doubts about how to understand the parallel execution of Oracle. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the questions of "how to understand the parallel execution of Oracle". Next, please follow the editor to study!

Description of the term:

S: unit of time in seconds.

K: one thousand units of quantity.

M: unit of quantity is 1 million, or unit of time is minute.

DoP: Degree of Parallelism, the parallelism of parallel execution.

QC: Query Coordinator for parallel queries.

PX process: Parallel Execution Slaves.

AAS: Average active session, the average number of active sessions in parallel execution.

Distribution: pq distribution method, parallel execution distribution, including replicate, broadcast, hash and adaptive distribution, among which adaptive distribution is a new feature introduced by 12c, which I will explain one by one in this article.

Left side of Hash join: driver table, the build side of hash join, which is usually a small table.

To the right of Hash join: driven table, the probe side of hash join, usually a large table.

Bloom filtering: bloom filter, an in-memory data structure used to determine whether an element belongs to a collection.

Test environment and data

The Oracle version is 12.1.0.2.2, the RAC of two nodes and the hardware ExadataX3- ­- 8.

This is a typical star model. The fact table lineorder has 300 million rows of records, and the dimension table part/customer contains 1.2m.

And 1.5m rows, none of the three tables are partitioned, and the lineorder size is close to 30GB.

Select owner seg_owner, segment_name seg_segment_name, round (bytes/1048576,2) SEG_MB from dba_segments where owner = 'SID' and segment_name in (' LINEORDER','PART','CUSTOMER') / OWNER SEGMENT_NAME SEGMENT_TYPE SEG_MB-SID LINEORDER TABLE 30407.75 SID CUSTOMER TABLE 168 SID PART TABLE 120

For all the tests in this article, unless otherwise noted, I turned off the adaptive plan feature of 12c, and the parameter optimizer_adaptive_features is set to false by default. Adaptive-related features such as cardinality feedback,adaptive distribution method,adaptive join are not enabled. If you check the outline data of the execution plan, you will find that the implicit parameters related to the optimizer are set to off. In fact, the 12c optimizer is much more complex than previous versions because of the introduction of adaptive plan features. I find it very challenging to analyze the new features of 12c optimizer, and maybe I will try it in another article.

Select * from table (dbms_xplan.display_cursor ('77457 qc9a 324 kb). Outline Data-/ * + BEGIN_OUTLINE_DATA IGNORE_OPTIM_EMBEDDED_HINTS OPTIMIZER_FEATURES_ENABLE ('12.1.0.2') DB_VERSION (' 12.1.0.2') OPT_PARAM ('_ optimizer_use_feedback' 'false') OPT_PARAM (' _ px_adaptive_dist_method' 'off') OPT_PARAM (' _ optimizer_dsdir_usage_control' 0) OPT_PARAM ('_ optimizer _ adaptive_plans' 'false') OPT_PARAM (' _ optimizer_strans_adaptive_pruning' 'false') OPT_PARAM (' _ optimizer_gather_feedback' 'false') OPT_PARAM (' _ optimizer_nlj_hj_adaptive_join' 'false') OPT_PARAM (' optimizer_dynamic_sampling' 11) ALL_ROWS. END_OUTLINE_DATA * /

Parallel initial experience of serial execution

After the following sql connects customers and lineorder, calculate the full profit of all orders. Serial execution does not use parallel hint:

Select / * + monitor * / sum (lo_revenue) from lineorder, customer where lo_custkey = c_custkey

During serial execution, the sql execution time is 1.5 minutes, and the dbtime execution time is 1.5 minutes. The execution plan consists of five lines, and a user process completes all the operations of scanning, hashjoin, aggregating and returning data to the two tables of customer,lineorder. At this point, AAS (average active sessions) is 1. SQL execution time is equal to db time. Almost all dbtime is spent on the second line of hash join operations for db cpu,72% 's cpu. Because the test machine processes a 30GB IO request for an Exadata X3 Murray 8 in less than a second. An Celloffload Efficiency equal to 87% means that the storage node scans, filters unwanted columns, and finally returns a data size of only 13% of the 30GB of the compute node.

Parallel execution

Using hint parallel (4), specify that DoP=4 execute the same sql in parallel:

Select / * + monitor parallel (4) * / sum (lo_revenue) from lineorder, customer where lo_custkey = c_custkey

The execution time of SQL is 21s and db time is 1.4min. DoP=4, executed on two instances. The execution plan is increased from 5 lines to 9 lines, with four more operations' PXBLOCKITERATOR', 'SORTAGGREGATE',' PXSENDQC (RANDOM) 'and' PXCOORDINATOR' 'from the bottom up.

The operation of lines 3 to 8 is parallel processing, and the execution order of sql is as follows: each PX process scans the dimension table customer (line 6), scans the fact table lineorder (line 8) of 1/4 in terms of block address interval (line 7), then performs hash join (line 5), then pre-aggregates the connected data (line 4), and finally gives the result to QC (line 3). After receiving the data (line 2), QC makes a further summary (line 1), and finally returns the data (line 0).

SQL execution time is four times faster than before, because for the most time-consuming operations, such as full table scan of lineorder, hashjoin, and aggregation, we use four processes to process in parallel, so the final sql execution time is 1max 4 for serial execution. On the other hand, dbtime did not decrease significantly, 1.4m in parallel and 1.5m in serial. From a system point of view, the system resources consumed by the two executions are the same.

When DoP=4, because there is no distribution involved in data distribution, QC only needs to assign a set of PX processes, and the four PX processes are the p000/p0001 of instance 1 and 2, respectively. We can view these four PX processes from the system. Each PX process consumes roughly the same amount of db time,CPU and IO resources. AAS=4, which is the ideal scenario, where each PX process does the same amount of work and remains active all the time. There is no serial point, no parallel skew.

AAS=4, when viewing the activity information, in order to better display the activity information, click the "CPU Cores" check box.

Display the four PX processes on the Linux system.

[oracle@exa01db01 sidney] $ps-ef | egrep "p00 [01] _ SSB" oracle 20888 2014? 18:50:59 ora_p000_SSB1 oracle 20892 14 2014? 19:01:29 ora_p001_SSB1 [oracle@exa01db01 sidney] $ssh exa01db02'ps-ef | egrep "p00 [01] _ SSB" oracle 56910 14 2014? 19:01:03 ora_p000_SSB2 oracle 56912 14 2014? 18:53:30 ora_p001_SSB2

Summary

In the example in this section, DoP=4, when executing in parallel, allocates four PX processes, resulting in a 4x performance improvement. The SQL monitor report contains the overall information and details of parallel execution, such as QC,DoP, the instance where the parallel execution is located, the resources consumed by each PX process, and the AAS when the SQL is executed.

Producer-consumer model

In the example executed in parallel above, each px process scans the dimension table customer and then scans the fact table lineorder for hash join. At this point, there is no data to distribute, just a set of px processes need to be assigned. The behavior of this replicate dimension table is a new feature of 12c and is controlled by the parameter _ px_replication_enabled.

More often, when executing in parallel, QC needs to assign two sets of PX processes to work together for producers and consumers to complete the parallel execution plan. Figure 1 of the architecture is as follows:

Broadcast distribution, one data distribution

To illustrate how two sets of px processes work together, set _ px_replication_enabled to false. QC allocates two sets of PX processes, one for producers and one for consumers.

As shown in the figure below, the sql execution time is 23s, and the execution time is still 1.5 minutes when the execution time slows down.

The biggest change comes from the execution plan, which now has 12 lines. Added parallel scan PXBLOCKITERATOR for customer (line 8) to distribute 'PXSENDBROADCAST'' and receive 'PXRECEIVE'. There are two sets of PX processes in the execution plan, and in addition to the previous blue multi-person logo, there is now a red multi-person flag. At this point, the order in which SQL is executed is:

The four red PX processes act as producers, scanning the dimension table customer and distributing the data through broadcast to each blue PX process that acts as a consumer. Because of DoP=4, 4 copies of each scanned record are copied, from row 9 of sqlmonitor, customer full table scan returns 1. The 5m row of data, the distribution of line 8 and the acceptance of line 7 become 6m row records, and each blue px process as a consumer holds a complete copy of data containing all custome records and prepares the buildtable for line 5 hashjoin.

Four blue PX processes as consumers scan the fact table lineorder (line 10 / 11) in terms of block address intervals; at the same time, the data from the customer table that is already held is entered into hashjoin (line 5), and then the data that meets the join condition is pre-aggregated (line 4), because the goal of our query is to sum all lo_revenue, after which each PX process only needs to output a total.

The four blue PX processes, in turn, act as producers, sending the aggregated data to the consumer QC (lines 3 and 2). The QC makes the final aggregation of the received four rows of records and then returns them to the user.

Using broadcast distribution, you only need to broadcast customer data to each consumer. The number of Lineorder does not need to be redistributed. Because the data volume of lineorder is much larger than that of customer, the distribution of lineorder data should be avoided, and this execution plan is very suitable for star model data.

Observe the information under the Parallel tag in the sql monitor report. The red PX process is the p002/p003 process on instance 1 and 2, and the blue PX process is the p000/p001 process. Because the blue PX process is responsible for scanning the fact table lineorder,hash join and aggregation, it consumes almost all db time.

How the producer-consumer model works

After the parallel query, you can verify the execution described above through the view V$PQ_TQSTAT.

The p002/p003 process on examples 1 and 2, as a producer, almost averagely scans customer's 1ax 4 records, broadcasts each record to four consumer PX processes, and the total number of records sent is 6m lines. Through table queue0 (TQ_ID=0), each p000/p001 process, as a consumer, receives the full 1. 5m rows of customer records, and the sum of received records is 6m rows.

The p000/p0001 process in examples 1 and 2, as a producer, sends a result record of aggregation to the QC as a consumer through table queue1 (TQ_ID=1). QC, as a consumer, received four lines of records.

SELECT dfo_number, tq_id, server_type, instance, process, num_rows FROM V$PQ_TQSTAT ORDER BY dfo_number DESC, tq_id, server_type desc, instance, process DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS-1 0 Producer 1 P002 1461932 1 0 Producer 1 P003 1501892 1 0 Producer 2 P002 1575712 1 0 Producer 2 P003 1460464 1 0 Consumer 1 P000 1500000 1 0 Consumer 1 P001 1500000 1 0 Consumer 2 P000 1500000 1 0 Consumer 2 P001 1500000 1 1 Producer 1 P000 1 1 1 Producer 1 P001 1 1 1 Producer 2 P000 1 1 1 Producer 2 P001 1 1 1 Consumer 1 QC 4 13 rows selected.

So, what do the DFO_NUMBER and TQ_ID columns mean in the above output?

DFO stands for Data Flow Operator and is an operation that can be performed in parallel in the execution plan. A QC represents a DFO tree (tree), and the DFO_NUMBER that contains multiple DFO; for all parallel operations in the same QC is the same, in this case, all DFO_NUMBER is 1. It is also rare to see an execution plan that contains multiple QC, such as statements using unionall, where each branch of the unionall is an independent DFO tree, and different DFO trees can be executed in parallel. This article only discusses situations where there is only one QC in the execution plan.

TQ stands for table queue and is used to communicate between PX processes or with QC. In the above execution plan, table queue0 is the connection between PX processes, and table queue1 is the connection between PX processes and QC. Producers distribute data through tablequeue and consumers receive data from tablequeue. Different table queue numbers represent different data distributions. Through table queue, we can understand the nature of Oracle parallel execution using the producer-consumer model:

There are at most two sets of PX processes in the same DFO tree. Each producer process has a connection to each consumer process, and each PX process has a connection to the QC. Suppose DoP=n, the total number of connections is (n*n+2*n), and as n grows, the total number of connections will explode. When Oracle executes the design in parallel, it uses the producer and consumer model. Considering the complexity of the number of connections, each DFO only allocates at most two groups of PX processes. Assuming DoP=100, the total number of connections between the two sets of PX processes is 10000. Assuming that three sets of PX processes can be assigned together to complete the parallel execution plan, the total number of connections between the three sets of PX will be equal to 1 million, and maintaining so many connections is an impossible task.

In the same DFO tree, there is only one active data distribution between two groups of PX processes at the same time. If the execution path is long and the data needs to be distributed multiple times, the two sets of PX processes change the roles of producers and consumers, collaborate with each other, and complete all parallel operations. Each time the data is distributed, the corresponding tablequeue number is different. An active data distribution process requires the participation of two sets of PX processes, one sending data for producers and the other receiving data for consumers. Because there are at most two sets of PX processes in a DFO, it means that there can only be one active data distribution between PX processes at a time. If the PX process needs to distribute data multiple times in the execution plan, you may need to insert some blocking points in the execution plan, such as BUFFERSORT and HASHJOINBUFFERED, to ensure that the last data distribution is complete before starting the next distribution. I will explain the impact of these blocking points in later chapters. In this example, tablequeue0 and 1 can work at the same time because tablequeue0 is the link between two sets of PX processes, tablequeue1 is the connection between the PX process and QC, and tablequeue0 and tablequeue1 are independent of each other, so they can be done at the same time.

There is at least one message buffer between PX processes or with QC (up to three in a single node and up to four in a RAC environment) for interprocess data exchange. This message buffer is allocated by default in Largepool (or in Sharedpool if Largepool is not configured). Multiple buffers are designed to achieve asynchronous communication and improve performance.

The size of each message buffer is controlled by the parameter parallel_execution_message_size, which defaults to 16k.

When both processes are on the same node, the data is exchanged by passing and receiving message buffers in Largepool (or Sharedpool if no Largepool is configured). When two processes are on different nodes. For data exchange through the RAC heartbeat network, the data received by one party needs to be cached in the local Largepool (or Sharedpool if no Largepool is configured).

Summary

To illustrate how the producer-consumer model works in parallel, I use broad cast distribution, where QC allocates two sets of PX processes, one for producers and one for consumers. Between QC and PX processes, the two groups of PX processes distribute data through table queue, and cooperate to complete the whole parallel execution plan. The view V$PQ_TQSTAT records how the data is distributed during parallel execution. Through the description of DFO,table queue, I explain how the producer-consumer model works and the communication process. Perhaps some descriptions are too sudden for you, don't worry, I will use more examples to help understand later chapters.

How to read parallel execution plans

The number of the Table queue represents the order in which data is distributed in a parallel execution plan. To understand how parallel operations in the execution plan are performed, the principle is simple: follow the order of the Tablequeue.

To determine the execution order of the sql through the sqlmonitor report, you need to determine the tablequeue name of the name column, such as: TQ10000 (for DFO=1,tablequeue0),: TQ10001 (for DFO=1,tablequeue1), and the color of the PX process.

The following example is dbms_xplan. The output of display_cursor. For parallel execution plans, there are three more columns:

1. TQ column: Q1tablequeue 00 or Q1tablequeue 01, where Q1 represents the first DFO,00 or 01 represents the number of the tablequeue.

A. The TQ of the operation of ID7~9 is listed as Q1 PX 00. This group of PX processes is executed first as a producer, and then the data is sent to consumers through broadcast distribution.

B. The TQ of the operation of ID10~11,3~6 is listed as Q1Power01. After receiving the data of customer as a consumer, scanning lineorder,hashjoin and aggregating, the group of PX processes send the data to QC through tablequeue2 as a producer.

2. In- ­- out column: indicates the flow and distribution of data.

PCWC:parallelcombinewithchild .

PCWP:parallelcombinewithparent .

P-­- > P: paralleltoparallel.

P-­- > S: paralleltoSerial.

3. PQDistribute column: how the data is distributed. In this implementation plan, we use the broadcast approach, in the following sections

I will talk about other ways of distribution.

HASH distribution, two data distributions

In addition to broadcast distribution, another common parallel distribution is hash. To observe the execution of sql when using hash distribution, I use pq_distributehint for sql.

Select / * + monitor parallel (4) leading (customer lineorder) use_hash (lineorder) pq_distribute (lineorder hash hash) * / sum (lo_revenue) from lineorder, customer where lo_custkey = c_custkey

When using hash distribution, the execution time of sql is 29s and dbtime is 2.6m. Compared with the broadcast approach, the execution time and dbtime of sql have increased by about 40%.

The execution plan is as follows, the execution plan is 14 lines, and the hash distribution to lineorder is increased. The 'PXSENDHASH' to 300 million rows of data in line 11 is distributed through the hash function, and the' PXRECEIVE' in line 10 receives 300 million rows of data through tablequeue1. These two operations consume 38% of dbcpu. This is why the SQL execution time and dbtime are longer. At this point, the order in which SQL is executed is:

The red PX process, as the producer, scans customer in parallel (lines 8-9), and uses the function for the connection key c_custkey to send to one of the four blue consumers through tablequeue0 based on the hash value recorded in each line (line 7). The Hash distribution does not copy the data. Rows 6-9 of the sqlmonitor report and the actualrows column are 1.5m.

The red PX process, as the producer, simultaneously scans the li neorder (lines 12-13), using the same dhash function for the connection key lo_custkey, via tablequeue1, to one of the four blue consumers (line 11). The same hash function ensures that the same connection key of customer and lineorder will be sent to the same consumer, ensuring that the hashjoin result is correct. Because 300 million rows of data need to be calculated by the hash function and then distributed (this is inter-process communication, or need to communicate over the RAC heartbeat network), this huge extra overhead is the reason for the increase in 38%cpu.

Four blue PX processes as consumers received 1.5m rows of records from customer (line 6) and 300 million lines of records from lineorder (line 10) for hash join (line 5) and pre-aggregation (line 4).

The four blue PX processes, in turn, act as producers, sending the aggregated data to the consumer QC (lines 3 and 2) through table queue2. The QC does the final aggregation of the received four rows of records and returns them to the user (lines 1 and 0).

Observe the information under the Parallel tag in the sql monitor report. The red px process is the p002/p003 process on instance 1 and 2, and the blue PX process is the p000/p001 process. The red PX process, as a producer, is responsible for scanning the fact table lineorder and hash distribution of 300 million rows of data, accounting for more than 1 db time.

Because it involves the distribution and reception of 300 million rows of data, the red PX process as a producer and the blue PX process as a consumer need to be active at the same time. The activity information in the SQL monitor report shows that most of the time, AAS exceeds parallelism 4, which means that the two sets of PX processes are working at the same time. Unlike replicate or broadcast distribution, where AAS is 4, only one set of PX processes remains active.

SELECT dfo_number, tq_id, server_type, instance, process, num_rows FROM V$PQ_TQSTAT ORDER BY dfo_number DESC, tq_id, server_type desc, instance, process DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS-1 0 Producer 1 P002 299928364 1 0 Producer 1 P003 299954384 1 0 Producer 2 P002 300188788 1 0 Producer 2 P003 299951708 1 0 Consumer 1 P000 300005811 1 0 Consumer 1 P001 300005811 1 0 Consumer 2 P000 300005811 1 0 Consumer 2 P001 300005811 11 Producer 1 P000 11 1 Producer 1 P001 11 1 Producer 2 P000 11 1 Producer 2 P001 11 1 Consumer 1 QC 4 13 rows selected.

Select / * + monitor parallel (4) * / sum (lo1.lo_revenue) from lineorder_hash42 lo1, lineorder_hash42 lo2 where lo1.lo_orderkey = lo2.lo_orderkey

After the parallel query, the execution process described above is further verified by the view V$PQ_TQSTAT. The parallel execution process involves 3

The tablequeue0/1/2,V$PQ_TQSTAT contains 21 rows of records.

1. The p002/p003 process on examples 1 and 2, as a producer, scans the record of 1 customer on average, and then sends it to the consumer p000/p001 process through tablequeue0 (TQ_ID=0). The sum of customer records sent and received is 1.5m.

Number of records sent: 1500000 = 36568368 364899375679393764

Number of records received: 1500000 = 374690 '374924' 375709 '374677

2. The p002/p0003 process on examples 1 and 2, as a producer, scans the record of 1 lineorder on average and sends it to the p000/p001 process as a consumer through table queue1 (TQ_ID=1). The sum of lineorder records sent and received is 300005811.

Number of records sent: 300005811 = 74987629" 75053393" 74979748" 74985041

Number of records received: 300005811 = 74873553 "74968719" 75102151 "75061388

3. The p000/p0001 process in examples 1 and 2, as a producer, records a result of the aggregation to QC as a consumer through tablequeue2 (TQ_ID=2). QC, as a consumer, received four lines of records.

SELECT dfo_number, tq_id, server_type, instance, process, num_rows FROM V$PQ_TQSTAT ORDER BY dfo_number DESC, tq_id, server_type desc, instance, process DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS-1 0 Producer 1 P002 365658 1 0 Producer 1 P003 364899 1 0 Producer 2 P002 375679 1 0 Producer 2 P003 393764 1 0 Consumer 1 P000 374690 1 0 Consumer 1 P001 374924 10 Consumer 2 P000 375709 10 Consumer 2 P001 374677 1 1 Producer 1 P002 74987629 1 1 Producer 1 P003 75053393 1 1 Producer 2 P002 74979748 1 1 Producer 2 P003 74985041 1 1 Consumer 1 P000 74873553 1 1 Consumer 1 P001 74968719 1 1 Consumer 2 P000 75102151 1 1 Consumer 2 P001 75061388 1 2 Producer 1 P000 1 1 2 Producer 1 P001 1 1 2 Producer 2 P000 1 1 2 Producer 2 P001 1 1 2 Consumer 1 QC 4 21 rows selected.

Summary

The array size m can control the probability of misjudgment to a very small range.

We observe the parallel execution of sql when hash is distributed. The biggest difference between Hash distribution and broadcast is that it distributes both sides of hashjoin. In this example, the hash distribution to lineorder adds a significant dbcpu. In the next section, I'll use another example to illustrate the scenarios where hash distribution applies.

The choice of Replicate,Broadcast and Hash

We have tested three distribution methods: replicate,broadcast and hash.

Replicate: each PX process repeatedly scans the left side of the hashjoin, and buffercache is used to cache the small table on the left side of the hashjoin, reducing the physical reads required for repeated scans. As opposed to broadcast distribution, replicate requires only a set of PX processes. However, replicate cannot replace broadcast distribution. Because replicate is limited to cases where the left side of the hashjoin is a table, replicate cannot be used at this time if the result set on the left side of the hashjoin comes from other operations, such as join or views.

Broadcast distribution: the PX process as a producer distributes the result set to the left of hashjoin to each PX process as a consumer by broadcast. It is generally suitable for scenarios where the result set on the left side of hashjoin is much smaller than that on the right side, such as a star model.

The essence of Hash distribution: the left and right sides of hashjoin (two data sources) are redistributed through the same hash function, cut into N units of work (assuming DoP=N), and then join, in order to reduce the amount of data that the PX process needs to connect when it carries out join operations. The cost of Hash distribution requires distribution on both sides of the hashjoin. For the example of customer connecting to lineorder, because the dimension table customer has a much smaller amount of data than the fact table lineorder, replicate or broadcast distribution of customer is obviously a better choice, because there is no need to redistribute lineorder. If there are two large table join, the join operation will be the bottleneck of the entire execution plan, and hash distribution is the only appropriate way. In order to reduce the cost of join, the cost of hash distribution on both the left and right sides of the hashjoin is acceptable.

Hash distribution is sometimes the only reasonable choice

We use self-connect on lineorder to demonstrate why sometimes hash distribution is the only reasonable choice. The SQL tested is as follows:

Select / * + monitor parallel (4) * / sum (lo1.lo_revenue) from lineorder lo1, lineorder lo2 where lo1.lo_orderkey = lo2.lo_orderkey

The execution time of SQL is 2.4 minutes, and that of dbtime is 10.5 minutes.

The optimizer chooses hash distribution by default, and the execution plan is 14 lines, and the structure is the same as the previous example of Hash distribution. The difference is that the hash join in row 5 consumes 73% of the db time, uses the temporary table space of 9GB, and the IO in the table space accounts for 12% of the db time. About 15 per cent of db time is used for two hash distributions and receipts of Lineorder, compared with 38 per cent in the previous example, and the overall impact of these two HASH distributions has more than doubled.

The red PX process is the p002/p003 process on instance 1 and 2, and the blue PX process is the p000/p001 process. Red PX processes, as producers, account for about 15 per cent of total db time.

SQL execution begins, when the AAS is greater than 4 for two hash distributions to lineorder, and after the distribution is complete, only the blue PX process performs hash join operations, AAS=4.

It can be confirmed from the V$PQ_TQSTAT view that for the existence of two distributions of lineorder, the data received by the four PX processes as consumers through table queue0 and 1 are the same, ensuring that the redistribution will not affect the correctness of the join results. Each blue PX process requires that the left and right sides of the hash join are 300 million rows of data, distributed through hash, 300 million rows of records connected to 300 million rows of records the average allocation of four independent PX processes, each PX process processes 75m rows of records connected to 75m rows of records.

SELECT dfo_number, tq_id, server_type, instance, process, num_rows FROM V$PQ_TQSTAT ORDER BY dfo_number DESC, tq_id, server_type desc, instance, process DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS-1 0 Producer 1 P002 75055725 1 0 Producer 1 P003 74977459 1 0 Producer 2 P002 74995276 1 0 Producer 2 P003 74977351 1 0 Consumer 1 P000 74998419 1 0 Consumer 1 P001 74995836 1 0 Consumer 2 P000 74976974 1 0 Consumer 2 P001 75034582 1 1 Producer 1 P002 74986798 1 1 Producer 1 P003 74985268 1 1 Producer 2 P002 74984883 1 1 Producer 2 P003 75048862 1 1 Consumer 1 P000 74998419 1 1 Consumer 1 P001 74995836 1 1 Consumer 2 P000 74976974 1 1 Consumer 2 P001 75034582 1 2 Producer 1 P000 1 1 2 Producer 1 P001 1 1 2 Producer 2 P000 1 1 2 Producer 2 P001 1 1 2 Consumer 1 QC 4 21 rows selected.

Using broadcast distribution, poor performance

For lineorder,lineorder self-connection, what happens if we use broadcast distribution? Let's test it:

Select / * + monitor parallel (4) leading (lo1 lo2) use_hash (lo2) pq_distribute (lo2 broadcast none) * / 15 sum (lo1.lo_revenue) from lineorder lo1, lineorder lo2 where lo1.lo_orderkey = lo2.lo_orderkey

Using broadcase distribution, the execution time of SQL is 5.9 minutes and db time is 23.8 minutes. Both execution time and db time have increased nearly 1.5 times compared to hash distribution.

Red PX processes as producers, after parallel scanning of lineorder, 300 million lines of records are broadcast via tablequeue0 to four consumer blue PX processes (lines 6-9), equivalent to four copies, and each blue PX process receives 300 million lines of records. This broadcast distribution consumes 11% of db time because each line of record is transferred to each blue PX process, which consumes more db cpu than two hash distributions using hash distribution.

The temporary tablespace consumed by the hash join in row 5 rises to 27GB, and the temporary tablespace IO accounts for 38% of db time. Because the hash join data of each blue PX process becomes larger, there are 300 million rows of data on the left side of the hash join and 4. 5% of the 300 million rows recorded on the right side of hash join.

The blue PX process is responsible for the hash join for consumers, and the amount of db time consumed has increased significantly.

During hash join, the temporary tablespace read wait event 'direct path read temp' has increased significantly.

In the output of V$PQ_TQSTAT, the p000/p001 process on instance 1 and 2, as a consumer, received 300 million rows of data, causing the subsequent hash join to slow down sharply. The mechanism by which the Broadcast distribution broadcasts to the left side of the hash join determines that it is not suitable for situations where both sides of the hash join have large tables.

SELECT dfo_number, tq_id, server_type, instance, process, num_rows FROM V$PQ_TQSTAT ORDER BY dfo_number DESC, tq_id, server_type desc, instance, process DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS-1 0 Producer 1 P002 299928364 1 0 Producer 1 P003 299954384 1 0 Producer 2 P002 300188788 1 0 Producer 2 P003 299951708 1 0 Consumer 1 P000 300005811 1 0 Consumer 1 P001 300005811 1 0 Consumer 2 P000 300005811 1 0 Consumer 2 P001 300005811 11 Producer 1 P000 11 1 Producer 1 P001 11 1 Producer 2 P000 11 1 Producer 2 P001 11 1 Consumer 1 QC 4 13 rows selected.

Summary, traps distributed by Broadcast and Hash

Through the examples in the previous section and this section, we know that if we choose an unreasonable distribution mode, the performance of SQL will degrade significantly during execution.

For broadcast distribution: only the left side of hash join is distributed, but with broadcast distribution, the amount of data on the left side of hash join is not reduced. If the left side of hash join contains a lot of data, parallelism has limited improvement on hash join performance. Broadcast distribution of large amounts of data can also consume additional db cpu, such as the example of lineorder self-connection in this section. Replicate is the same.

For hash distribution: both sides of the hash join are distributed so that when each PX process performs hash join, the amount of data on the left and right is the original 1 prime NJE N for parallelism. The potential pitfalls of Hash distribution are:

Two distributions, especially for large tables, can incur significant additional overhead, such as the example in the previous section where customer connects to lineorder. Using Partition wise join eliminates the need for distribution, which is illustrated later.

If the data is skewed, the small value on the connection key accounts for most of the data. Through hash distribution, the record of the same key value will be distributed to the same PX process, and a PX process will process the hash join of most of the data, causing parallel execution skew. I will explain this situation and solutions in later chapters.

When parsing SQL, the optimizer will choose the specific distribution method based on the information such as cardinality and parallelism estimated on the left and right side of hash join. Maintaining correct statistics is essential for the optimizer to produce a reasonable parallel execution plan.

Partition Wise Join to eliminate the extra overhead of distribution

Whether for broadcast or hash distribution, data needs to be transmitted through communication between processes or nodes. The more data is distributed, the more db cpu is consumed. When executing in parallel, data needs to be distributed, essentially because Oracle uses share---everything 's centralized storage architecture, and any data is shared to the PX process of each instance. In order to divide and conquer the hash join operation, it is divided into N independent work units (assuming DoP=N), the data must be redistributed in advance, and the data distribution operation is the additional overhead brought by parallelism.

Using full or partial partition wise join technology, the extra overhead of distribution can be completely eliminated or minimized. If one side of the hash join does the hash partition on the join key, the optimizer can choose not to distribute the partition table, because the hash partition has already split the data, which requires only one side of the hash distribution hash join, which is partial partition wise join. If both sides of the hash join have hash join partitions on the connection keys, then each PX process can handle the peer hash partitions independently, and there is no data to distribute. This is full partition wise join. In hash partitioning, the unit of work of hash join is the amount of data contained in the equivalent hash partition, and the size of each partition should be controlled. When hash join, it is possible to eliminate the use of temporary tablespaces and greatly reduce the amount of PGA required.

Partition Wise Join, no data distribution is required.

If you do hash partitions on the column lo_orderkey of lineorder, the number of partitions is 32. The size of each partition is close to 1G.

SEGMENT_NAME SEG_PART_NAME SEGMENT_TYPE SEG_MB-LINEORDER_HASH32 SYS_P3345 TABLE PARTITION 960 LINEORDER_HASH32 SYS_P3344 TABLE PARTITION 960... LINEORDER_HASH32 SYS_P3315 TABLE PARTITION 960 LINEORDER_HASH32 SYS_P3314 TABLE PARTITION 960-30720 32 rows selected.

When using lo_orderkey connections, lineorder does not need to be redistributed. Let's continue to use self-connected sql to demonstrate full partition wise join.

Select / * + monitor parallel (4) * / sum (lo1.lo_revenue) from lineorder_hash42 lo1, lineorder_hash42 lo2 where lo1.lo_orderkey = lo2.lo_orderkey

At this time, the execution time of sql is 1.6min and dbtime is 6min; when hash distribution is not partitioned, the execution time is 2.4min and dbtime 10.5min. It is 1/3 faster to use Partition Wise join. There is only one set of blue PX processes in the execution plan, and there is no need to distribute the data. Because lineorder_hash42 's 300 million rows of data are split into 32 partitions. Although the degree of parallelism is 4, when each PX process hash join, the unit of work is a pair of matching hash partitions, and the amount of data on both sides is 300 million. A smaller unit of work reduces the temporary tablespace consumed by the entire hash join to 448MB. Each PX process consumes 8 pairs of hash partitions. It can be predicted that when we increase the degree of parallelism to 8-16-32, the logarithm of hash partitions processed by each PX process should be 4x2xql execution time linearly.

The blue PX process is the p000/p001 process of,. The db time consumed by each PX process is average, and each PX process handles 8 pairs of partition scans and hash join.

AAS is 4 most of the time.

The only data connection is tablequeue0, and each PX process sends a row of records to QC.

SELECT dfo_number, tq_id, server_type, instance, process, num_rows FROM V$PQ_TQSTAT ORDER BY dfo_number DESC, tq_id, server_type desc, instance, process DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS-- 1 0 Producer 1 P000 1 1 0 Producer 1 P001 1 10 Producer 2 P000 1 1 0 Producer 2 P001 1 10 Consumer 1 QC 4 5 rows selected

When DoP is greater than the number of partitions, Partition Wise Join will not occur

When the DoP executed in parallel is greater than the number of hash partitions, partition wise join does not occur, and the optimizer uses the distribution of broadcast local. Use DoP=64 to perform the same sql:

Select / * + monitor parallel (64) * / sum (lo1. Lo_revenue) from lineorder_hash42 lo1, lineorder_hash42 lo2 where lo1. Lo_orderkey = lo2. Lo_orderkey

DoP=64, query execution time is 15 seconds, db time is 11.3 minutes.

Two sets of PX processes appear in the execution plan. The optimizer chooses to distribute broadcast local to the right side of the hash join. If the left side of the hash join is small, the broadcast local occurs on the left side of the hash join. Because the DoP is twice the number of partitions, each partition of the lineorder_hash74 on both sides of the hash join is handled by two PX processes. Two blue PX processes and two red PX processes that handle a pair of matching partitions will be on the same instance. Data will only be between the PX processes of the same instance, and will not be transferred across instances, reducing the cost of data distribution, which is the meaning of broadcast local. The execution order of SQL is as follows:

The blue PX process scans the lineorder_hash42 to the left of the hash join in parallel (line 7) in terms of the database address range, because the DoP is twice the number of partitions, and each partition is scanned by two blue PX processes on the same instance. Each blue PX process scans about half the data for each partition, records about 4.7m rows, and prepares the build table for line 5 hash join.

The red PX process scans the lineorder_hash42 to the right of the hash join in parallel, and each red PX process scans about 4.7m lines of records, and then tablequeue0, in the way of broadcast local, distributes to the two red PX processes of this instance (when data distribution, it is mapped to some PX processes of this instance to avoid cross-node transmission, which is called slaves mapping, in addition to broadcast local, there are also distribution methods such as hash local,random local). Through broadcast local distribution, the amount of data has changed from 300m rows to 600m rows.

Each blue PX process receives about 9.4m rows of data through tablequeue0, which is the amount of data for the entire matching partition. Then do the hash join, and then the aggregation operation. For each blue PX process hash join operation, the amount of data on the left is 1x64 of lineorder_hash42 (= 1/DoP) and the data on the right is 1x32 of lineorder_hash42 (= 1 / number of partitions). If you continue to improve the DoP, only the amount of data on the left side of the hash join will be reduced, and the amount of data on the right will not decrease; at the same time, more PX processes dealing with the same partition will increase the cost of broadcast distribution. Therefore, when the DoP is greater than the number of partitions, the scalability of parallel execution is not good with the improvement of DoP.

Looking at a blue PX process, and the execution information of the example 1p005 process, you can confirm that the left side of the hash join is lineorder_hash42 's 1x64 join join and the right side is lineorder_hash42 's 1Acer 32.

Summary

In order to achieve the best performance when designing a data warehouse, a combination of partition wise join and parallel execution should be used. Do hash partitioning on the most commonly used join keys for large tables, and hash join gives the optimizer the opportunity to choose partition wise join. Range-hash or list-hash is a common partition combination strategy. First-level partition uses time range or list to do preliminary segmentation of data according to business characteristics, while second-level partition uses hash partition. When querying, after clipping the first-level partition, the optimizer can choose partition wise join.

When designing a partition wise join, you should increase the number of hash partitions as much as possible and control the size of each partition. In Partition wise join, each pair of matching partitions is processed by a PX process. If there is too much partition data, it may cause temporary space to be used in join operations, which affects performance. On the other hand, if the number of partitions is too small, partition wise join will fail when DoP is greater than the number of partitions, and the performance improvement with larger DoP is very limited.

The influence of data skew on different Distribution methods

Data skew means that most of the data on a column is a few hot values (Popular Value). In Hash join, if the data on the join key on the right side of the hash join is skewed, the data distribution causes a PX process to process all the hot data, prolonging the sql execution time, which is called parallel execution skew. If the optimizer chooses hash distribution, and the data on both sides of the join is distributed by hash, data skew will cause execution skew. The hash value recorded with the same value is the same and will be distributed to the same PX process for hash join. Uneven distribution of work, an unfortunate PX process needs to do most of the work, consuming more db time than other PX processes, resulting in a significant increase in SQL execution time. For replicate or broadcast distribution, there is no risk of skewed execution, because the data on the right side of the hash join (usually a large table) does not need to be distributed, and the PX process uses a block address interval or partition-based granule to scan the data on the right side of the hash join on average before performing join operations.

To demonstrate the relationship between data skew and different distributions, create two new tables, customer_skew contains a c_custkey=-1 record, lineorder_skew 90% record, and 270 million row record lo_custkey=-1.

Sid@SSB > select count (*) from customer_skew where c_custkey =-1; COUNT (*)-1 sid@SSB > select count (*) from customer_skew; COUNT (*)-1500000 sid@SSB > select count (*) from lineorder_skew where lo_custkey =-1; COUNT (*)-270007612 sid@SSB > select count (*) from lineorder_skew; COUNT (*)-21 300005811

Replicate mode, not affected by data tilt

The test sql is as follows:

Select / * + monitor parallel (4) * / sum (lo_revenue) from lineorder_skew, customer_skew where lo_custkey = c_custkey

The execution time of SQL is 23 seconds and the db time is 1.5m. The optimizer's default execution plan chooses replicate by allocating a set of PX processes, similar to the way broadcast distributes. Each blue PX process repeatedly scans the customer, and when scanning the lineorder_skew in parallel, the granule based on the address range is used as the scanning unit, as shown in the'PX BLOCK ITERATOR' in line 7.

The db time consumption of the four blue PX processes is average, and for the replicate mode, the lineorder_skew data skew does not cause the execution skew of the four PX processes.

When the optimizer uses replicate, it can be confirmed by executing the hint PQ_REPLICATE in the outline in the plan. The following section dbms_xplan. The display_cursor output is not displayed, only the outline data is displayed.

Select * from table (dbms_xplan.display_cursor ('77457qc9a 324kwax); Plan hash value: 4050055921. Outline Data-/ * + BEGIN_OUTLINE_DATA 22 IGNORE_OPTIM_EMBEDDED_HINTS OPTIMIZER_FEATURES_ENABLE ('12.1.0.2') DB_VERSION (' 12.1.0.2') ALL_ROWS OUTLINE_LEAF (@ "SEL$1") FULL (@ "SEL$1"CUSTOMER_SKEW" @ "SEL$1") FULL (@ "SEL$1"LINEORDER_SKEW" @ "SEL$1") LEADING (@ "SEL$1"CUSTOMER_SKEW" @ "SEL$1"LINEORDER_SKEW" @ "SEL$1") USE_HASH (@ "SEL$1"LINEORDER_SKEW" @ "SEL$1") PQ_DISTRIBUTE (@ "SEL$1"LINEORDER_SKEW" @ "SEL$1" BROADCAST NONE) PQ_REPLICATE (@ "SEL$1" "" LINEORDER_SKEW "@" SEL$1 ") END_OUTLINE_DATA * /

Hash distribution, data skew causes execution skew

Using hash distribution through hint, test the sql as follows:

Select / * + monitor parallel (4) leading (customer_skew lineorder_skew) use_hash (lineorder_skew) pq_distribute (lineorder_skew hash hash) * / sum (lo_revenue) from lineorder_skew, customer_skew where lo_custkey = c_custkey

Using hash distribution, the SQL execution time is 58 seconds and dbtime 2.1 minutes. For replicate, the sql execution time is 23 seconds, and the dbtime is 1.5 minutes. Interestingly, the dbtime consumed by the entire sql increased by only 37 seconds, while the execution time did increase by 35 seconds, which means that the increased dbtime is not average for each PX process. If the increased dbtime averages to each PX process, and parallel execution is not skewed, then the sql execution time should be increased by 37max 4, or about 9 seconds, instead of the current 35 seconds. The red PX process, as the producer, completes parallel scanning of customer_skew and lineorder_skew, respectively, and distributes them to the blue PX process through tablequeue0/1,hash. The distribution of lineorder_skew accounts for 45% of db cpu.

The blue PX process p001 of example 2 consumed 57.1 seconds of dbtime,sql execution time of 58 seconds, and this PX process has been active during the sql execution. Predictably, all lo_custkey=-1 data from lineorder_skew is distributed to this process for processing. The red PX process, as the producer, is responsible for scanning the lineorder_skew and distributing it, and their workload is average.

Most of the time in AAS=2, only the P001 process of instance 2 continues to receive data from four producers and hash join.

From the V$PQ_TQSTAT view, we can confirm that when distributing to the right of hash join, more than 200.007 million million data was received through tablequeue1, instance 2 of P001, as a consumer. This is why the PX process remains active throughout sql execution.

SELECT dfo_number, tq_id, server_type, instance, process, num_rows FROM V$PQ_TQSTAT ORDER BY dfo_number DESC, tq_id, server_type desc, instance, process DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS-10 Producer 1 P004 375754 10 Producer 1 P005 10 Producer 2 P003 393069 10 Producer 2 P004 365767 10 Consumer 1 P002 375709 10 Consumer 1 P003 374677 10 Consumer 2 P001 374690 1 0 Consumer 2 P002 374924 1 1 Producer 1 P004 75234478 1 1 Producer 1 P005 74926098 1 1 Producer 2 P003 74923913 1 1 Producer 2 P004 74921322 1 1 Consumer 1 P002 7497409 1 1 Consumer 1 P003 7467378 1 1 Consumer 2 P001 277538575 1 1 Consumer 2 P002 7502449 24 1 2 Producer 1 P002 1 1 2 Producer 1 P003 1 1 2 Producer 2 P001 1 1 2 Producer 2 P002 1 1 2 Consumer 1 QC 4 21 rows selected.

The sqlmonitor report of 12c is enhanced to include sampling information for the most expensive PX processes when skewing is performed in parallel. On the plan statistics page, the drop-down menu selects' Parallel Server 3 (instance 2 P001)'to confirm that instance 2's P001 process has received more than 200.007 million million data from line 10 of the execution plan,'PX RECEIVE', and data 278m in the Actual Rows column.

Bar

For practical applications, dealing with data tilting is a complex topic. For example, using bound variables to filter on skewed columns, binding variable bind peeking may cause instability in the execution plan. This section discusses the impact of data skew on different distribution methods:

In general, replicate or broadcast distribution is not affected by data skew.

For hash distribution, the most popular data of the connection keys on both sides of the hash join will be distributed to the same PX process for join operation, which can easily lead to obvious parallel execution skew.

12C the introduction of adaptive distribution can solve the problem of skewed parallel execution in hash distribution. I will demonstrate this new feature of adaptive distribution in my next article, "in-depth understanding of Oracle's parallel execution skew (part two)."

HASH JOIN BUFFERED, the blocking point in the execution plan during continuous hash distribution

So far, all tests have involved only the join of two tables. If there are more than two tables, at least two hash join are required, the data is distributed more often, the roles of producers and consumers may be reversed, and the execution plan will inevitably become complex. As the execution path becomes longer, in order to ensure the normal execution of parallel execution, the execution plan may insert the corresponding blocking point, cache the data that meets the join conditions to the temporary table during hash join, and suspend the data distribution. In this section, I use a three-table join sql to illustrate the different behaviors of different distribution methods when hash join is contiguous.

Use Broadcast distribution with no blocking points.

The sql for testing the join of the three tables is as follows, add the part table, and use hint to have the optimizer use broadcast distribution for both hash join. The performance of Replicate SQL queries is similar.

Select / * + monitor parallel (4) LEADING (CUSTOMER LINEORDER PART) USE_HASH (LINEORDER) USE_HASH (PART) SWAP_JOIN_INPUTS (PART) PQ_DISTRIBUTE (PART NONE BROADCAST) NO_PQ_REPLICATE (PART) PQ_DISTRIBUTE (LINEORDER BROADCAST NONE) NO_PQ_REPLICATE (LINEORDER) 25 * / sum (lo_revenue) from lineorder, customer, part where lo_custkey = c_custkey and lo_partkey = p_partkey

The execution time of SQL is 42 seconds and that of dbtime is 2.6 minutes.

AAS= (sql db time) / (sql execution time) = (2.6 / 60) / 42 / 3.7, close to 4, indicating that the four PX processes have been active all the time.

The execution plan is a perfect right deep tree, which is a typical form of execution plan when querying a star model. The producer broadcast distributes the two dimensions, and the consumer prepares the build table of hash join twice after receiving the data. Finally, the fact table is scanned and hash join is performed. We read this execution plan by following the principle of table queue order.

The red PX process scans the part in parallel as a producer and broadcasts to each blue consumer PX process via tablequeue0 (lines 7-9). Each blue PX process receives the complete data of the part (line 6), records on line 1.2m, and prepares the build table for line 5 of hash join.

The red PX process scans the customer in parallel as the producer, broadcasting the broadcast over tablequeue1 to each blue consumer PX process (lines 12-14). Each blue PX process receives the complete data of the customer (line 11), records at 1.5m lines, and prepares the build table for line 10 hash join.

The blue PX process scans the fact table lineorder in parallel, hash join line 10 for each of the 300 million rows of records that meet the scan criteria (if the sql statement contains filter conditions for lineorder), and hash join on line 5 for each record that passes line 10 of hash join, followed by aggregation. From the Timeline column information reported from sql monitor, the scan of the lineorder and the two hash join operations occur at the same time. There is no blocking point in the execution plan, and the flow of data on the execution path does not need to stop and wait. Most of the db cpu is consumed in two hash join operations. An optimized execution plan means that the less data goes through each hash join, the better. For this type of execution plan, you need to make sure that the optimizer executes the join that is best able to filter data in the location closest to the fact table.

Continuous hash distribution, blocking point in execution plan

Use the following hints to force SQL to use hash distribution.

Select / * + monitor parallel (4) LEADING (CUSTOMER LINEORDER PART) USE_HASH (LINEORDER) USE_HASH (PART) SWAP_JOIN_INPUTS (PART) PQ_DISTRIBUTE (PART HASH HASH) 26 PQ_DISTRIBUTE (LINEORDER HASH HASH) * / sum (lo_revenue) from lineorder, customer, part where lo_custkey = c_custkey and lo_partkey = p_partkey

The execution time of SQL is 1.5 minutes and that of dbtime is 8.1 minutes. Relative to the addition of the IO operation of 14GB.

Hash join is distributed using HASH twice in a row, each time the left and right sides of the hash join need to be distributed, and four data distributions occur between PX processes. The most significant part of the execution plan comes from HASH JOIN BUFFERED on line 12, which is an blocking operation. Next, we still read the execution plan by following the principle of table queue order, and explain why the blocking operation HASH JOIN BUFFERED occurs instead of the normal HASH JOIN.

1. The blue PX process, as the producer, scans the customer and distributes it to the red PX process as the consumer through tablequeue0,hash (lines 14-16). Each red PX process receives data from the customer of 1amp 4 (line 13), which is approximately 370k rows of records, and prepares the build table of line 12 'HASH JOIN BUFFERED'. Unlike the broadcast distribution, the execution plan starts at line 16, scanning the customer near the lineorder, rather than the first operation without a 'child' (line 9 scan part). This is where hash distribution differs from serial execution plans and broadcast distribution.

2. The blue PX process acts as the producer, scans the lineorder, and distributes the red PX process as the consumer through tablequeue1,hash (lines 18-20). Each red PX process receives lineorder data from 1amp 4 (line 17), with approximately 75m rows recorded. While receiving data through tablequeue1, each red PX process performs hash join in line 12 and caches the result set of join in PGA so that the data does not continue to flow up for the time being. If the result set is too large, you need to temporarily store the data in temporary space, such as in our case, using 7GB's temporary space. You can understand that the result set of join is temporarily stored in a temporary table. So why does the execution plan need to insert a blocking point here to prevent the data from moving upward?

This involves the core of the producer-consumer model: there can be at most two sets of PX processes in the same DFO tree, and one data distribution requires two groups of PX processes to work together. This means that there can be at most one active data distribution between two groups of PX processes at the same time, one sending data as a producer and the other receiving data as a consumer. Each PX process can only play one role, not both at the same time. When the red PX process distributes lineorder data to the blue PX process through tablequeue1, the blue PX process is receiving lineorder data and hash join. Looking at the timeline information of the timeline column, lines 12, 17 and 20 are simultaneous. But at this point, the red PX process cannot, in turn, act as the producer and distribute the results of the hash join to the blue process, because there are two restrictions:

The blue PX process, as a producer, is busy scanning lineorder; at this time and cannot, in turn, receive data from the red PX process as a consumer.

The build table for the hash jon operation in line 5 is not ready, and the table part has not even been scanned.

So Oracle needs to insert a blocking point in line 12 hash join, which becomes a HASH JOIN BUFFER operation, caching the result set of join. When the blue PX process finishes scanning and distributing the lineorder, the red PX process completes the hash join on line 12 and stores the results completely in temporary space. The data distribution of Tablequeue2 begins.

3. The red PX process, as the producer, scans the part and distributes it to the blue PX process as the consumer through tablequeue2 (lines 7-9). Each blue PX process receives the part data of 1amp 4 (line 6), about 300k rows of records, and prepares the build table of line 5 of hash join.

4. The red PX process, as the producer, reads the result set of the customer and lineorder connection for the "HASH JOIN BUFFERED" operation in line 12, which has temporary space, and distributes it to the blue PX process through table queue 3 (lines 11-12). "HASH JOIN BUFFERED" this operation uses the temporary space of 7GB, write IO7GB, read IO7GB, the total amount of IO is 14GB.

5. Each blue PX process, as a consumer, receives approximately 75m lines of records. For the data received through tablequeue3, the fifth row of hash join is performed at the same time, and the fourth row of aggregation operation is performed through the data of the join operation. When the data distribution on tablequeue3 is finished, each blue PX process completes the hash join and aggregation operations, and then distributes its aggregation results, one line record, through tablequeue4, to QC (lines 3-5). QC completes the final aggregation and returns it to the client.

Summary

Because of using star model testing, it makes sense for this example to use Broadcast distribution or replicate. In practical applications, continuous hash distribution does not necessarily have the blocking point of HASH JOIN BUFFERED. If the tables involved in the query are small, HASH JON BUFFERED generally does not appear. Even if blocking operations such as BUFFER SORT,HASH JOIN BUFFERED occur in the execution plan, it does not mean that the execution plan is not optimal. If the sql performance is not satisfactory, the HASH JOIN BUFFERED operation consumes most of the CPU and a lot of temporary space, through the sql monitor report, you can determine whether this is reasonable:

Check the estimated rows and actual rows columns to determine if the optimizer deviates from the cardinality estimation on the left and right sides of the hash Join, so choose the hash distribution.

Also check the estimated rows and actual rows columns of the hash join operation to see if the optimizer's estimate of the hash join result set cardinality is reasonable. The optimizer treats both sides of the hash join as separate events, and the estimate of the join result set cardinality may be too conservative and the estimate rows too small. For a typical case of star model: if multiple dimension tables participate in the join, the execution path is very long, the initial dimension table distribution is broadcast, the fact table does not need to be distributed, after several join, the result set cardinality drops rapidly, and the subsequent estimated rows on both sides of the hash join is close, causing the optimizer to choose hash distribution.

By checking the proportion of data filtered by each join, it is determined whether the optimizer executes the join with the most effective filtering data first to ensure that the amount of data flowing on the execution path is the least.

Hash join and Bloom filtering

Bloom filtering is commonly used in parallel execution plans, and I will explain this data structure and its role in this section. Starting with version 11.2, serially executed sql can also use Bloom filtering.

About Bloom Filtration

Bloom filtering is an in-memory data structure used to determine whether an element belongs to a collection. Diagram 2 of the working principle of Bloom filtering is as follows:

Quote from Wikipedia: http://en.wikipedia.org/wiki/Bloom_filter

As shown in the figure, Bloom filtering is a simple bit array that requires two variables to be defined:

1. M: the size of the array, in this case, masks 18.

2. The number of k:hash functions. In this example, KF3

An empty Bloom filter all bit is 0. When you add an element, the element needs to be calculated by three hash functions to get three hash values, setting all three positions in the array to 1. The distribution of the three elements of the collection {xrecoveryjournal z} is set to 1 for the nine positions of the array through three hash calculations. To determine whether an element belongs to a set, such as w in the diagram, it is only necessary to do three hash calculations on w to produce three values. The position on the right does not hit in the array, and the position is 0. It can be determined that w is not in the set {xQuery yrez}. Due to the existence of hash collisions, the judgment of Bloom filtering will be too optimistic (false positive), and there may be elements that do not belong to {xmeneux _ quotient z}, but after the hash calculation, all three positions are hit, which is mistakenly identified as belonging to {x _ rect _ y _ z}. The probability of misjudgment can be controlled within a very small range by reasonably setting the array size m according to the number of set elements.

Improvement of hash join performance by Bloom Filtration

The advantage of Bloom filtering is that with very little memory, you can filter most of the data. If the left side of the hash join contains a filter condition, the optimizer may choose to generate a Bloomberg filter on the data set on the left side of the hash join, use this Blombron as the filter condition when scanning the right side of the hash join, and exclude most of the data that does not meet the join condition for the first time. Reduce the amount of data processed by data distribution and join operations and improve performance.

Performance when using Bloom filtering

The performance when using Bloom filtering uses the cased nationalized filtering China condition for customer, which only calculates the sum of profits from customer orders from China. We observe the difference in performance between using Bloom filtering and not using Bloom filtering.

Select / * + monitor parallel (4) * / sum (lo_revenue) from lineorder, customer where lo_custkey = c_custkey and c_nation = 'CHINA'

The SQL execution time is 1 second, and the dbtime is 7.9 seconds. The optimizer chooses replicate by default. JOIN FILTER CREATE and JOIN FILTER USE are added to the execution plan. The order in which SQL is executed is to repeatedly scan the customer table (row 7) for each PX process, and generate a Bloom filter in the c_custkey column: BF0000 (row 6 JOIN FILTER CREATE) for the 60K (240K/4) row records that meet the cations nationalized China dataset. Use this Bloom filter when scanning lineorder (line 8 JOIN FILTER USE). Although the total number of rows of lineorder is 300m, SQL has no filtering conditions, only Bloom filtering is used, and only 28m rows of records are returned after scanning, and the other 272m rows are filtered out. During the hash join operation, each PX process only needs to deal with the connection between 60K row customer records and 7m (28M/4) line lineorder records, which greatly reduces the cost of join operation. For Exadata,Smart Scan supports Bloom filtering offloading to the storage node, when the storage node scans the lineorder, it uses Bloom filtering to exclude 272m rows of records, and removes unwanted columns from the eligible data. Cell offload Efficiency=98%, means that only 2% of 30GB is returned from the storage node to the PX process. Without Bloom filtering, the Cell Offload Efficieny will not be as high as 98%, as we will see in the next example. For non-Exadata platforms, because there is no Smart Scan feature, the data filtering operation needs to be completed by the PX process, and the effect of Bloom filtering will not be so obvious. Database In- ­- memory, a new feature of 12C, supports the use of Bloom filtering when scanning memory data stored in columns.

When the scan for LINEORDER appears on line 10 in the execution plan, the Bloom filter condition is used: SYS_OP_BLOOM_FILTER (: BF0000, "LO_CUSTKEY")

Performance without Bloom filtering

Next, we use hint NO_PX_JOIN_FILTER to disable Bloom filtering and observe the performance of sql execution at this time.

Select / * + monitor parallel (4) NO_PX_JOIN_FILTER (LINEORDER) * / sum (lo_revenue) from lineorder, customer where lo_custkey = c_custkey and c_nation = 'CHINA'

The execution time of SQL is 9 seconds and that of dbtime is 33.7 seconds. Compared with the use of Bloom filtration, the performance degradation is obvious. The optimizer still chooses replicate, and there are no PX JOIN CREATE and PX JOIN USE operations in the execution plan. The reason for the quadrupling of db time:

When PX scans lineorder, 300m rows of records are returned. Without Bloom filtering as a condition, each PX process needs to receive 75m rows of records from the storage node.

When performing the hash join operation on line 5, each PX process needs to connect 60k rows of customer records and 75m rows of lineorder records. The cost of Join operations has increased significantly.

Due to the absence of Bloom filtering, Cell Offload Efficiency dropped to 83%.

Generation, transmission, merging and use of Bloom filters during HASH distribution

We force the use of hash distribution through hint and observe the generation and use of Bloom filters in the sql execution plan at this time.

Select / * + monitor parallel (4) leading (customer lineorder) use_hash (lineorder) pq_distribute (lineorder hash hash) * / sum (lo_revenue) from lineorder, customer where lo_custkey = c_custkey and c_nation = 'CHINA'

At this time, the execution time of sql is 4 seconds, and that of db time is 19.4 seconds. Act 6 of the execution plan JOIN FILTER CREATE; 13 behavior JOIN FILTER USE. In this example, the PX process is distributed in multiple RAC and two instances. The distribution of Hash involves the generation, transmission, merging and use of Bloom filtering, which is more complex. The specific process is as follows:

The generation of Bloom filtering: four blue PX processes as consumers receive customer data distributed by the red PX process hash through tablequeue0, and each blue PX process receives 15K rows of records. While receiving customer records, two blue PX processes in example 1 jointly generate a Bloom filter in SGA, assuming B1, and two blue PX processes in example 2 jointly generate a Bloom filter in SGA, assuming B2. Because it is in SGA, Bloom filter B1 is visible to the two red PX processes of instance 1, and similarly, B2 is visible to the two red PX processes of instance 2.

Bloom filtered transfer: when the red PX process finishes scanning the customer on the left side of the hash join, it touches the release long filtered B1/B2 transmission. The red PX process of instance 1 sends B1 to the blue PX process of instance 2, and the red PX process of instance 2 sends B2 to the blue PX process of instance 1.

The merge of Bloom filtering: the blue PX process of instance 1 merges B1 with the received B2; the blue PX process of instance 2 merges B2 with the received B1. After merging, instances 1 and 2 produce the same Bloom filter.

The use of Bloom filtering: the four red PX processes in examples 1 and 2 are used as producers, and the merged Bloom filter is used when scanning the lineorder. Lineorder filtering is followed by 49m rows of records, and Bloom filtering does not seem to be as effective as replicate filtering. The Cell Offloadload Efficiency is 97%.

If only one instance is executed in parallel, the red PX process does not need to transmit the Bloom filter, and the blue PX process does not need to merge the Bloom filter.

Because the cost of hash join has been greatly reduced, the hash distribution of lineorder 49m records has become a significant average, accounting for 53% of db time.

Summary

This section describes the principle of Bloom filtering and a typical application in Oracle: the improvement of hash join performance. The essence of Bloom filtering is to advance the connection operation of hash join. When scanning the right side of hash join, most of the data that does not meet the join conditions will be filtered out as soon as possible. Greatly reduce the cost of subsequent data distribution and hash join operations. The generation and use of Bloom filters are slightly different in different ways of distribution:

For broadcast distribution and replicate, each PX process holds the complete data on the left side of the hash join and generates a complete Bloom filter on the join key, which is used when scanning the right side of the hash join. If sql involves multiple dimension tables, all of which are distributed using broadcast, the optimizer may generate multiple Bloom filters for different dimension table data, which are used at the same time when scanning fact tables.

For hash distribution, after the PX process as a consumer receives the data on the left side of the hash join, each PX process generates a Bloom filter for its own data set, and then broadcasts it to each PX process as a producer, which is used when scanning the right side of the hash join.

In the real world, the optimizer automatically selects Bloom filtering based on statistics and sql filtering conditions. The use of Bloom filtering usually leads to a performance improvement. In some extreme cases, the use of Bloom filtering results in performance degradation, in two scenarios:

When the dataset on the left side of the hash join is too large, such as millions of rows, and there are many unique values on the join key, the optimizer still chooses to use Bloom filtering. The resulting Bloom filter is too large to be fully cached in CPU cache. Then when using Bloom filtering, for each row of records on the right side of the hash join, you need to read the Bloom filter to memory to judge, resulting in performance problems.

If the Join operation itself cannot filter the data, the data on the right side of the hash join will be hit when using Bloom filtering. The optimizer may not realize that the join operation cannot filter the data and still choose to use Blombron. If the dataset on the right side of the hash join is large, Bloom filtering may consume significant additional cpu.

Typical serial points in parallel execution plans

In the real world, due to improper use, parallel operations can not be parallel, or parallel execution of the plan is inefficient, did not achieve the desired performance improvement. This section gives several typical examples.

Using rownum in sql results in PX SEND 1 SLAVE operations, where all data needs to be distributed to a PX feed to assign a unique rownum value to each row of records, as well as blocking operations such as BUFFER SORT.

Using a user-defined pl/sql function, the function is not declared as parallel_enable, so that the sql using this function cannot be parallelized.

When parallel DML, there is no enable parallel dml, resulting in DML operations can not be parallel.

Rownum, resulting in inefficient parallel execution of the plan

In the section "impact of data skew on different distribution methods", we create a new table lineorder_skew to change the value of 90% of the lo_custkey column of lineorder to-1. Because the lo_custkey is uniformly distributed, we can modify 90% of the data to-1 by modulating the lo_custkey column or by modulating the rownum. Use the following case when statement:

1. Case when mod (lo_orderkey, 10) > 0 then-1 else lo_orderkey end lo_orderkey

2. Case when mod (rownum, 10) > 0 then-1 else lo_orderkey end lo_orderkey

Test the sql execution performance of the two uses with the following table sql, with a parallelism of 16.

Create table lineorder_skew1 parallel 16 as select case when mod (lo_orderkey, 10) > 0 then-1 else lo_orderkey end lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_orderpriority, lo_shippriority, lo_quantity, lo_extendedprice, lo_ordtotalprice, lo_discount, lo_revenue, lo_supplycost, lo_tax, lo_commitdate, lo_shipmode, lo_status from lineorder

Without rownum, the create table execution time is 1 minute, and the db time is 15.1 minutes. QC allocates only one set of PX processes, and each blue PX process scans the lineorder table in parallel, collects statistics, and loads it into the lineorder_ Skew1 table based on the block address range. There is no data to distribute, and each PX process remains active, which is the most efficient execution path.

Most of the time, AAS=16

When using rownum, the execution time of create table is 22.3 minutes and that of db time is 38.4 minutes. The execution time of SQL is 22 times longer than when using lo_orderkey.

Two sets of PX processes appeared in the execution plan, and the PX SEND 1 SLAVE and BUFFER SORT operations did not appear in the previous test. According to the principle of following the order of table queue, let's read this execution plan:

The blue PX process scans lineorder in parallel and distributes all the data to a red PX process via tablequeue0 (lines 10-12). Because rownum is a pseudo column, in order to ensure that each row has a unique row number, the operation of assigning rownum values to all data can only be done by one process, and assigning rownum columns becomes the serial point of the entire parallel execution plan. This is why the performance of the PX SEND 1 SLAVE operation degrades dramatically. In this example, the only active red PX process is the instance 1 p008 process. The 300m row records of Lineorder need to be sent to instance 1 p008 process for rownum assignment, which is then distributed to 16 blue PX processes for data parallel insertion.

Example 1 p008 process receives data distributed by 16 blue PX processes, and after assigning values to the rownum column (line 8 count operation), you need to distribute the data to the blue PX process through tablequeue1. But because the distribution of data through tablequeue0 is still in progress, the execution plan inserts a blocking point BUFFER SORT (line 7) to cache the data after the rownum assignment to the temporary space with the size of 31GB.

After the data distribution of Tablequeue0 is finished, instance 1 p008 reads the 31GB data out of the temporary space and distributes it to 16 blue PX processes through tablequeue1 for statistics collection and insertion.

Only instance 1 p008 of the red PX process is active. Consumed 16.7 minutes of db time. For the entire execution plan, the two data distributions also consumed a large amount of db cpu. Distribute 300m line records from 16 blue PX processes to one red PX process through Table queue 0. Distribute 300m line records from 1 red PX process to 16 blue PX processes via Table queue 1.

Although DoP=16, the actual AAS=1.5, means that the execution of the plan is inefficient.

In the real world, rownum should be avoided in applications. The generation of Rownum executes the planned serial points, increasing unnecessary data distribution. For sql using rownum, increasing parallelism often does not improve performance, and there is no other way but to modify the sql code.

The custom PL/SQL function does not set parallel_enable, so it cannot be parallelized.

Rownum will cause a serial point in the parallel execution plan, and the user-defined pl/sql function, if not declared as parallel_enable, will cause sql to be executed only serially, even if sql is specified with hint parallel. Let's test it and create a package pk_test that contains the function f and returns the same value as the input parameter. There is no parallel_enable in the declaration of the function and parallel execution is not supported.

Create or replace package pk_test authid current_user as function f (pairn number) return number; end; / create or replace package body pk_test as function f (pairn number) return number as l_n1 number; begin select 0 into l_n1 from dual; return pairn-lumbn1; end; end; /

The following example uses the function pk_test.f in the where statement, and if you use the function pk_test.f in the select list, it will also cause the execution plan to become serial.

Select / * + monitor parallel (4) * / count (*) from customer where c_custkey = pk_test.f (c_custkey)

The query execution time is 54 seconds, and db time is also 54 seconds. Although we specify parallel execution using Dop=4, the execution plan is actually serial.

Set parallel_enable when the function is declared, indicating that the function supports parallel execution, and execute sql again.

Create or replace package pk_test authid current_user as function f (pairn number) return number parallel_enable; end; / create or replace package body pk_test as function f (pairn number) return number parallel_enable as l_n1 number; begin select 0 into l_n1 from dual; return pairn-lumbn1; end; end; /

At this time, the execution time of the query is 12 seconds, and the db time is 46.4 seconds. Parallel execution occurs as scheduled, with a degree of parallelism of 4.

Unless there are special constraints, when you create a custom pl/sql function, it should be declared parallel_enable. Parallelism caused by not setting parallel_enable when the pl/sql function is declared is a common problem that I have encountered in multiple customer systems. In 11g, when this happens, the PX COORDINATOR FORCED SERIAL operation may appear in the execution plan, which is an obvious hint; or you need to identify the problem through the sql monitor report. Only through dbms_xplan. It is not enough for display_cursor to check the execution plan, in which case the note part of the execution plan will still show DoP=4.

Parallel DML, without enable parallel dml, makes it impossible for DML operations to be parallel.

This is a common problem in ETL applications. Without enable or force parallel dml at the session level, dml operations cannot be performed in parallel. Demonstrate using 1.5m row data of customer.

Create an empty table customer_test:

Create table customer_test as select * from customer where 1: 0

We use parallel direct path insertion statements as an example. Execute insert twice, and the first time without enable parallel dml,insert statement is as follows:

Insert / * + append parallel (4) * / into customer_test select * from customer

The execution time of the Insert statement is 9 seconds. Although the parallelism of the entire statement is 4, the line 2 direct path insertion operation LOAD AS SELECT is executed serially in the execution plan.

The Note section of the execution plan shows that PDML is not enabled:

Note-PDML is disabled in current session

Once parallel dml is enabled, re-execute the insert statement.

Alter session enable parallel dml

At this point, the execution time of the insert statement is 3 seconds, and the LOAD AS SELECT operation can be performed in parallel in the third line of the plan.

At this point, the study on "how to understand the parallel execution of Oracle" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report