Oracle parallel Foundation II 07/11 Update SLTechnology News&Howtos

Oracle parallel Foundation II

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Oracle parallel Foundation (Serial 2)

Author: Wei Xinghua, senior database technology expert of Walk Science and Technology

Limitations of consumer producer model

As you already know from the above, there are usually two sets of PX slave processes within a parallel operation, one for the producer and the other for the consumer. Producers send data through table queue and consumers receive data through table queue. And for the consumer and producer model, there is a great limitation: a group of DFO units can only have two sets of PX slave processes at most. This restriction may be due to the fact that in order to maintain the simplicity of parallel code, on the one hand, there is a communication channel (table queue) between each PX slave process and between each PX slave and QC, if there are too many groups of PX slave allowed. It may lead to the exponential growth of communication channels. For example, for a parallel operation with a DOP of 5, the number of channels between PX slave is 55, the number of channels between slave and QC is 25, and a total of (5 / 2) 5 / 35 channels are needed. It is conceivable that if Oracle allows three sets of PX slave in a parallel operation, how many connections need to be maintained, and we assume that the current server is running a total of 50 parallelism, then the number of channels generated by the three groups of PX slave processes is 5050, 5050, 125000, not including the channels between PX slave and QC.

Parallel_execution_message_size

It may not be a big problem if there are a large number of channels passing messages between processes without taking up database resources, but this is not the case. The memory consumption of the channels passing messages between processes is controlled by the parameter parallel_execution_message_size. In the 11GR2 version, the value of this parameter is 16K. In previous versions, the value of this parameter may not be the same (each version has an increasing trend). In non-RAC environment, the maximum size of each channel can be 3parallel_execution_message_size, and in RAC environment, the maximum size of each channel can be 4parallel_execution_message_size.

For example, for a query with a DOP of 20, the maximum memory occupied by a channel in a non-rac environment may be:

Channel memory of PX processes + QC, channel memory between PX processes = 202016K3+22016K3=21120K, close to 21m of memory.

The memory of the channel is allocated in large pool by default, or in shared pool if large pool is not configured.

The formula for calculating channel memory:

Single node

(NN+2N) 316k

RAC node

(NN+2N) 416k

Hash join buffered

It's not really a flaw in hash join.

We have introduced the producer-consumer model, which has a big "flaw" is that in a parallel operation, there can only be at most two groups of PX slave,2 groups PX slave to transmit messages and interactive data through table queue, so when one group of SLAVE is reading table queue, it cannot write another table queue at the same time. Is it hard to understand?

Let's describe it with an example:

Select / * + parallel (6) pq_distribute (b hash hash) * / * from hash_t3 a, hash_t1 b where a.id=b.id

In this example, I force the data to be distributed as hash through hint pq_distribute (b hash hash). Pay attention to the operation in which the row source ID is 3. There is a very strange operation: hash join buffered. Don't be confused by this buffered name. It means that the data cannot flow up temporarily and must be stored here temporarily. The execution order of the statements is as follows:

First, the red producer PX slave scans the hash_ T3 table, and writes the scanned records to table queue TQ10000 according to HASH distribution.

The blue consumer PX slave receives data from the table queue TQ10000 and builds the hash table.

After the above operation is finished, the red producer continues to scan the hash_ T1 table and writes the scanned records to table queue TQ10001 according to HASH distribution.

The blue consumer PX slave receives data from table queue TQ10001 and probes with the above HASH TABLE, but the result cannot be written to table queue TQ10002, but is temporarily cached (the origin of hash join buffered).

After the HASH distribution is complete (that is, after the two sets of PX slave are inactive), a set of PX slave sends the result set to QC via table queue TQ10002.

Why are you doing this? It doesn't seem to make sense.

This is because the hash distribution requires that the right side of the hash join should also be distributed. The distribution operation involves two groups of PX slave processes, one is responsible for scanning, the other is responsible for receiving data, that is, a group of PX slave is responsible for writing the scanned data to table queue, and the other group is responsible for reading data from table queue. At this time, the data distribution operation can no longer be carried out, because the result set of join cannot be written to another table queue TQ10002.

If the result set is large, this may to some extent lead to the consumption of a lot of temporary table space, resulting in a large amount of disk read and write IO, which in turn leads to performance degradation.

If this does happen, you can avoid it by switching to broadcast distribution, because broadcast distribution does not need to be distributed on the right side of hash join

Select / * + parallel (6) pq_distribute (b broadcast none) * / * from hash_t3 a, hash_t1 b where a.id=b.id

For example, after changing to broadcast, the hash join buffered operation has disappeared.

Bloom filtration

It is necessary to introduce Bloom filtering, which plays a very important role in parallelism in post-11GR2 versions. Bloom filter is not the invention of Oracle. Oracle software was not born when bloom filter technology appeared. It was developed by Burton H.Bloom in 1970. What does Bloom filter to?

Bloom filter, or Bloom filter, is a data structure, which can quickly determine whether a data belongs to a collection. Hash join itself is very resource-consuming and very slow. Bloom filtering is much faster than hash join.

For a detailed introduction to Bloom filtering, please refer to: http://www.cnblogs.com/haippy/archive/2012/07/13/2590351.html

The Bloom filter is based on an array with M bits. For example, the size of the array in the figure above is 18 bits, and all values are 0 when initializing. If you want to understand how Bloom filtering works, you must know under what circumstances these flag bits need to be set to 1. In the figure above, {Xrecoery Y < Z} represents a set with three values (elements). If you look carefully, each value extends three lines. Here, it represents that each value is calculated by 3 HASH functions, and the range of calculated values is from 0 to 17 (the length of the array). For example, X is calculated three times by the HASH function, and the values are respectively: 1meme 3prime13, and then the corresponding flag bit is set to 1Magee YMagazi similarly put the corresponding flag position as 1. After some HASH calculation, all the elements of the {XgraineYPowerZ} collection have been calculated by HASH, and the corresponding flag bits are set to 1, and then we detect another set, where the element of the other set is WMague W, which also needs to be calculated by the same three HASH functions, and check whether the corresponding position is 1, if the corresponding position is one. Then W may (only may) belong to this set, and if there is any position that is not 1, then this W must not belong to this set. Because the Bloom filter does not exactly match the values (while HASH JOIN requires exact matching), there may be values that should not be part of the collection that pass through the Bloom filter.

The Bloom filter has the following characteristics:

Building Bloom filtered arrays requires very little memory and can often be placed entirely in the cache of CPU. Of course, the larger the array of Bloom filters, the less likely it is that Bloom filters will misjudge.

Since there is no need for an exact match, the Bloom filter is very fast, but there are some values that should not appear that may pass through the Bloom filter.

PX Deq Credit: send blkd and PX Deq Credit: need buffer

What's the use of Bloom filtering? Maybe you are an experienced DBAs, so you may be familiar with PX Deq Credit: send blkd, PX Deq Credit: need buffer waiting events. After the above introduction, we already have a lot of knowledge, table queue, producer-consumer model and so on. One group of consumer PX slave writes to table queue, the other group reads table queue to obtain data and completes the data transfer between processes, but there must be a situation. When a group of producer PX slave is writing data to table queue, it is found that the memory in table que is full and there is no memory left to write. Most of the time, this situation means that consumer PX slave consumes data too slowly from table queue. The most likely reason is that consumers have to overflow the data read from table queue to disk. Reading data from memory to disk is a very slow operation, so in this case, they will encounter PX Deq Credit: send blkd, PX Deq Credit: need buffer waiting, how to optimize? In this case, Bloom filtration plays a role.

If the optimizer thinks that Table X returns 1000 records, Table Y needs to scan 100 million records, but 90% of them do not need to be returned after HASH JOIN, in which case Bloom filtering is used to pre-HASH JON before HASH distribution. In this way, after the Bloom filter, a large number of records are eliminated by the Bloom filter, and finally the result set on the right side of HASH JOIN becomes very small, which makes the amount of data distributed by HASH become very small, greatly reducing the probability of PX Deq Credit: send blkd, PX Deq Credit: need buffer. Without Bloom filtering, the process has to pass a large amount of data to another set of processes, increasing the memory, CPU, and increasing the competition between the two groups of processes.

Don't expect the Bloom filter to be perfect, it can eliminate most of the rows, but not all of them, so some unwanted data will pass through the Bloom filter to reach the second set of processes.

Parallelism degradation

Whether you are using a manually specified DOP or an 11G AUTO DOP, the runtime DOP may not be what you expected: it may be degraded. Parallelism may be reduced for many reasons, for example, the parallel processes available in the current system can no longer meet the required DOP, or you have used Oracle's resource manager to limit the degree of parallelism, and so on.

The best tool for monitoring parallelism reduction is SQL MONITORING of oracle version 12.1, for example:

As shown in the figure above, in the [General Information] section, place your mouse over the little blue person in the Execution Plan section, and some parallelism information will appear. For example, in the figure above, the DOP of the running time is 4, the actual parallel service process requested is 10, and the actual parallel service process assigned is 4, and the parallelism is reduced by 60%.

To find out why the statement was degraded, click on the other columns of the PX COORDINATOR row source in the "Plan Statistics" section, such as the following figure, and mark them with a red box:

After clicking, it appears:

Here are some code descriptions that have been downgraded:

350 DOP downgrade due to adaptive DOP

351 DOP downgrade due to resource manager max DOP

352 DOP downgrade due to insufficient number of processes

353 DOP downgrade because slaves failed to join

In my case, the allocation of parallel resources failed because there were not enough parallel processes available to the system.

If you are not convenient to use EMCC, you can also observe the parallelism degradation through the view, but the reason for being downgraded, there is no view response yet (or I don't know yet, please let me know if you know)

DEGREE is listed as the actual degree of parallelism, and REQ_DEGREE is the degree of parallelism requested.

There are some ways to avoid parallelism degradation. For example, if you are using the ORACLE 11G version, you can use the automatic parallelism management feature and then specify the parallelism degree at the statement level. Because the automatic parallelism feature is turned on, parallel statement queuing will be enabled, and if the statement runtime finds that there are not enough parallel processes available, it will wait in queue until there are available parallel processes that meet the goal.

Multiple DFO units

Some commands can have multiple DFO units, because each DFO unit can use up to 2 PX slaves set. If a command has multiple DFO units, it can use more than 2 PX slaves set, and you can see whether multiple DFO units are used in the execution plan:

Select count () from (select / + parallel (a 4) / count () from hash_t1 a

Union

Select / + parallel (b 4) / count (*) from hash_t2 b)

Lines with ID 6 and 12 have coordinator identifiers at both sources, which means that this command uses two DFO units.

Through SQL MONITORING you can also see that this command has two parallel groups, and in theory, each DFO unit can operate in parallel at the same time, but in our example, the order of execution between two DFO units is, first execute DFO unit 1, and then execute DFO unit 2, you can see through the [schedule] column, the first DFO unit is active first, and so on, the second DFO unit begins to be active.

It can also be seen from the figure above that DFO unit 2 reuses the parallel processes of DFO unit 1 without reproducing new parallel processes, as can be seen from the parallel process number. Is SQL MONTIRONG super easy to use?

V$pq_sesstat view

By querying the v$pq_sesstat view, you can know the number of DFO units when the statement is running (DFO Trees), the number of parallel sets (Slave Sets), the number of service processes (Server Threads), and the degree of parallelism used for execution (DOP). As follows:

The article has been a bit long so far, and there is little coverage of 12C's new features, parallel execution skew, and the delivery and advanced knowledge of Bloom filtering. I will introduce these contents in the next article.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.