Series of related principles of CBO 04/11 Update SLTechnology News&Howtos

Series of related principles of CBO

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

The related principles of CBO

CBO is introduced in oracle7 to calculate the execution cost of the execution plan based on the statistics of the data object (including the number of rows in the dataset, the number of unique values, and so on). With the evolution of the version, CBO gradually perfected and began to use system statistics (system statistics, which appeared to estimate the CPU consumption of SQL in 9i). However, CBO still has some defects, and by understanding some relevant principles of CBO, it is easy for everyone to understand its defects, so that in many cases, the statistics that CBO relies on are collected 100% accurate, and the reason why they will choose the wrong execution plan.

The choice of execution plan

After CBO generates an execution plan, it calculates its cost, and then compares it with the lowest cost of the generated execution plan. This comparison stops when one of the following conditions is met:

1. All execution plans have been calculated.

The number of join permutations of the query block exceeds the value specified by the OPTIMIZER_MAX_PERMUTATIONS (10g and later _ OPTIMIZER_MAX_PERMUTATIONS) parameter. The default is 2000.

We can do a simple calculation, such as the following SQL:

There are seven tables in a query block. The possible join order of these seven tables is as follows:

1. A1-> a2-> A3-> A4-> A5-> A6-> A7

2. A1-> a2-> A3-> A4-> A5-> A7-> A6

3. A1-> a2-> A3-> A4-> A6-> A5-> A7

The number of all possible permutations is 7 percent 5040, far exceeding the default value of OPTIMIZER_MAX_PERMUTATIONS. In this case, CBO does not calculate all possible join sequences. It is possible to miss out on the lowest cost implementation plan. The reason for this design is to prevent too much comparison of execution plan costs from spending too much time on SQL parsing.

The basic concept of costing 1 cardinality (cardinality)

Cardinality refers to the number of rows in the result set of a row source. For example, in the following query, all rows of the emp table are returned, and the cardinality is the number of rows of the table 14. 5.

Another example is:

Its cardinality is the number of rows returned by the emp table after predicate filtering (job='CLERK').

2 selectivity (selection rate)

Selection rate, also known as selectivity, is closely related to cardinality. The formula for calculating the selection rate is as follows:

For example, if the emp table has 14 rows and empno is the primary key, then the frequency of each value is 1max 14. So the filter condition selection rate of the following sql is 1max 14.

We know that CBO takes the selection rate into account when choosing whether to access the full table or the index at some point in the execution plan, and as you can see from the above formula, you need to know two data to get the selection rate. The following is still an example of the 1.cardinality section to explain how CBO calculates the selection rate based on statistics.

You can see that this sql actually returns 4 entries, but how is the value of the rows section 3. 3 calculated?

First, CBO obtains the total number of rows of the emp table from the statistics as 14; then, according to the unique key value (num_distinct) on the column job, it is concluded that the selection rate of the equivalent condition on the column is 1 1/num_distinct (that is, in the case of no histogram, CBO believes that the column values have no data skew, and the data distribution is uniform, so each value in the column appears with the same frequency of 1/num_distinct). In this way, the result set that should be calculated is 141Universe 2.8CBO. In the algorithm, we have to ceil the result, that is, the result is ceil (141Universe 5) = 3.

For example, there are several white balls and black balls in a black cloth bag. If you want to guess how many balls of each color there are, you can only assume that the number of them is about the same without opening the bag to count.

It is conceivable that continuing to use this algorithm on a column with data skew (that is, the number of rows corresponding to different unique values) may result in an incorrect execution plan.

Let's create a table with data skew

Now, if we query the data of the rows of gender='M', it is obvious that if there is an index in the gender column, it is most efficient to access the index to obtain rowid and then return to the table. However, according to the previous explanation, in the case of collecting statistics instead of histograms, CBO will consider that the amount of data returned by gender='M' is 50% of the total data and choose a full table scan.

You can see that the corresponding value of rows is 65537, which is indeed the total number of rows of the table (rounded up).

In the actual application scenario, there may be multiple filter conditions for the table, and there are and or or connections between the filter conditions. The calculation of the selection rate in these two cases is very similar to the and or operation in the knowledge of senior high school.

First, for the case where and is used between conditions:

For example: select... From a where a.col1=value1 and a.col2=value2 . In this case, how does CBO calculate the selection rate? Let's add a filter condition to the previous example:

You can see that the rows part of the estimate is 1, there are actually two pieces of data. We record the two filtering conditions as I and j respectively, and the occurrence frequency is recorded as P (I) and P (j). In the absence of multiple columns of statistical information, CBO believes that the frequency at which I and j are established at the same time is P (I) P (j). According to the previous explanation, we know that P (I) = 1 1/15.emp, then the total number of rows of the P (I) P (j) = table is 14, then the result set produced by these two filtering conditions is ceil (14 * (1max 15)) = 1.

In the case of more than 2 filtering conditions, there is a similar algorithm. For example, if there is a filter condition i1, i2, and i3. In, then the algorithm for the final selection rate is:

For cases where there is or between the filter conditions, the algorithm is (knowledge of high school probability):

It can be seen that if there are too many filtering conditions, the result of selection rate calculation is likely to be greatly distorted.

For example, for sql:

The selection rate calculated from the above formula is likely to be very close to 0, and the cardinality calculated from this is close to 1. 5%. In fact, there are likely to be multiple results returned.

3 Transitivity (transitive)

Transitivity means that CBO does some equivalent transformations to filter or join conditions, so that the original filter or join conditions that only act on table A can act on table B that does JOIN with A. For example:

Can be converted to:

For this transformation, if there is a selective index on the col1 column of table b, CBO can choose to access the index. This conversion is not done in RBO mode.

In addition to the above situation, there is also the delivery of join:

Convert to

Defects of CBO

From these previous introductions, we can draw a conclusion about the shortcomings of CBO:

1. For complex SQL, it may not be possible to cover all possible execution plans, thus ignoring the best execution plan

two。 Without collecting the histogram, CBO believes that the values of the columns are uniformly distributed. For tables with skewed data, this assumption will be greatly distorted.

3. In the absence of multi-column statistics and extended statistics, CBO believes that columns are isolated, and the calculated selection rate is likely to be distorted in the case of SQL filtering conditions containing multiple columns or join between tables.

We often hear that it is false to use explain,autotrace to get the execution plan from plan table, or that rows is not correct. This is the reason. But the advantage of oracle is that it can constantly improve CBO. As mentioned above, oracle introduced histogram, multi-column statistical information, extended statistical information and other technologies to make up for the shortcomings of the original algorithm. The use of these technologies will also be described in a separate article.

For this article, there are errors or one-sided expression, please point out more; there are ambiguities, you can also tell me, in the next article to do the next explanation.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.