Write down an analysis of the SQL problems of the 500th line implementation plan-from emergency treatment to root cause analysis-the 18th issue of the Technical Life Series 04/19 Update SLTechnology News&Howtos

Write down an analysis of the SQL problems of the 500th line implementation plan-from emergency treatment to root cause analysis-the 18th issue of the Technical Life Series

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Prologue & enter the role

At 7: 00 a.m. on April 24, Old K opened the door to support the work of the operation and maintenance team of a large state-owned bank after it was put into production. According to past experience, the bank will have dozens of system software / environment version changes, system migration and other operations each batch, and there will be some problems after opening business on the first working day after putting into production. As an important part of various business systems, ORACLE database is often the focus of leaders' attention.

At about 8 o'clock, the application maintenance team reported a problem: a key business of a certain system needs to send a report to a national regulatory agency before opening the door, and the report is mainly generated by executing SQL statements in the database. In the past, the generation time of this report took only 2 minutes, but now there is no movement after running for 20 minutes, and it has no effect to mention it several times, while the opening time of the front-end counter business is at 08:30 and must be solved immediately to avoid affecting the business opening time.

Through the quick inspection on the spot, there is nothing abnormal in the system as a whole, but the SQL execution time of the business report is too long, mainly running on CPU, which may seem to have been logically read too much. In the analysis room, Lao K also learned about the changes of the system in the process of this batch production.

Key information:

Upgrade and migration of database version, original version 10.2.0.5, current version 11.2.0.4

Migrate using data pump export and import

The migration time is Saturday afternoon, April 22nd.

For some reason, the system where the original database is located has been shutdown

After the migration is complete, the database has already run some other batch tasks yesterday without exception.

The report SQL is mainly running in the current system, and there is no other online business SQL.

Although this report SQL has been run before, however, this is a new environment, and the original environment is no longer there, so it is impossible to compare it. It seems that it needs to be analyzed from scratch.

Lao K first made a snapshot through dbms_workload_repository.create_snapshot, and then grabbed awrsqrpt to analyze the statement.

The logical reading of the statement is indeed very large:

You can see:

The statement has been executed for nearly 1300 seconds and has not yet been completed

Logical read up to 450 million

Generally speaking, for SQL performance problems, you only need to understand the SQL business logic, analyze the SQL execution plan, and then fully understand the data distribution of the relevant tables to give the corresponding solution; however, the time left to the old K this time seems to be a little short and stressful.

But it doesn't matter, follow the routine, first look at the content of the statement and the execution plan, however, the thing is this:

In this way, the old K has a bit of a square, and the execution plan alone has reached 568 lines. If you look at the content of the SQL statement, the statement is so dense that the SQL tool on the operating terminal cannot format the SQL content at all and is basically not readable. A brief glance at the execution plan can slightly summarize some of the characteristics of the execution plan:

Features:

The biggest feature is that the sentence is too long and abnormal.

A large number of union-all can be seen in the execution plan, and it can be judged that SQL is formed by a series of simple multi-table association union.

There are a lot of tables involved, about 30-40 tables, with sizes ranging from tens of meters to several gigabytes.

Tables can be connected in various ways, such as filter, nested loop, hash join, merge sortjoin, and even merge join cartensian

If it were you, how would you analyze it further?

Make a decision quickly & solve the problem

While speaking, from starting to check the problem, to communicating with the application, looking at the sentence roughly, and understanding the execution plan, ten minutes have passed, and the time has come to 08:10, and there is not much time left for Old K. In addition to implementing the plan, the other results are as follows:

1. There should be no problem with the statement. It has been executed before and the execution speed is very fast, indicating that there should be a good execution plan for the statement.

two。 Changes in the execution plan can generally be thought of as inaccurate statistics, changes in optimizer parameters, and changes in the version of the database

3. Here, because of the relevant standards of the data center, it can be determined that the optimizer parameters will not change, the version of the database will change, and the statistical information may be changed or incorrect.

So, here, it seems that what we can do now should be statistical information. When there is no way to analyze the implementation plan of SQL and SQL in a short period of time, collecting statistical information should be a scheme worth trying. But here comes a new problem, involving hundreds of tables, large ones with dozens of gigabytes. If you collect statistical information one by one, and then mention the application report SQL one after another, it is estimated that you will have to lose millions more ~ ~ so, here, what we need to do is to find the table that is most likely to have problems, collect statistical information, and then try to mention batch again. Here, time is tight, and it is required that we hit with one hit. Old K needs to close his eyes and think quietly for a minute. Indeed, after a minute, Old K came up with an idea that can be tried, and achieved good results!

Old rule, dear reader, you can also imagine such a scene in front of you is such a scene above, what kind of ideas will you seek to help you solve the above problems? Before you think about it, you might as well turn it up again to see what information can help you.

OK, thinking back, the old K's idea is that the logical reading of the sentence is very large, but it was executed faster before, and normally the logical reading will not be too large. If the statistical information of a table is really inaccurate, the logical reading of the table or the table associated with the table should be larger (this sentence is very important), and this SQL is mainly executed during this period of time, so why don't we take a look at the AWR report during this period of time? See which table or which table has a larger logical read?

As you can see, the XXDEALS table accounts for 92.35% of the logical readings of the whole database.

The logical reading of XXDEALS_IDX7 also reached 16 million.

The user is also the user who runs the report SQL.

Further inspection shows that the table in the execution plan exists. The size of the table is about 700m, which is not large, and the last time the table was collected is April 7, which seems to be a long time from the current time. Just do it and open 8 parallel collections of statistics of the table. The collection is completed in one minute, and then the batch tasks are mentioned again. The execution plan has changed, and the time required to complete the whole batch is about 5 minutes.

In the end, the report was reported smoothly, and the business basically opened on time, and everyone was happy.

We can see that in the process of dealing with the problem here, Lao K played a trick. When the SQL statement and the execution plan are so long that it is impossible to directly locate the problems in the execution plan in a short time, based on the characteristics of very high logical reads during SQL execution, with the help of the logical reading top distribution in the AWR report, we roughly locate the tables that may lead to incorrect execution plans, and collect statistical information. Finally solved the problem. ORACLE provides many tools and methods to locate different problems. The key is whether we can make use of the existing information and find the features in the information to find solutions.

Q: can this method accurately find the table where the statistics are in question every time?

A: in fact, not necessarily. Here also need to consider the relationship between the driven table and the driven table, length reasons, without further explanation.

A perfect explanation & a meaningful conclusion

The problem is solved, but because it almost affects the opening of the business, the above leaders are still more concerned and need an explanation, or a team to assume this responsibility, among these key points:

Why there is no problem with the report SQL in the past, there is a problem just after the upgrade, and what is the reason?

The application maintenance team informed the operation that there had not been a large number of data changes during production.

Will there be such problems tomorrow or even later?

However, it is past 08:30, the businesses are open, and some minor problems have emerged one after another in the new investment systems and upgraded systems, which need to be checked in the database; for the time being, this explanation can only be given verbally, and there may not be enough time to verify.

Generally speaking, the problem solved through the collection of statistics is nothing more than that the statistics are too old.

But there is a problem that cannot be explained by statistics: why there has been no problem in the past, but there is a problem when you move to a new environment this time? Is there anything you shouldn't do during the import process?

Here you need to take a closer look at the collection time of the table's statistics:

You can see that in addition to the latest statistics collected, the XXDEALS table has been collected twice, once on April 7 and on April 22 (that is, the day the data was imported). However, before our last collection, the latest statistics of the table were collected on April 7, which means that the statistical analysis timeline of the table is as follows:

April 22-> April 7-> April 24

Is it very strange, old K Dun felt confused, but confirmed through the data import log at 18:04 on April 22 this point in time is actually in the dump import process, the old K's doubts will be solved.

Q: Why?

A: the general import order of tables is: import table data first, then build indexes, etc., and finally import statistics

Q: the answer is not the question. Does it matter?

A: no. The first thing we have to think of is that April 22 is Saturday, within the default window for automatic statistics collection, during the process of importing the whole library, after the XXDEALS table data is imported, until its statistics are imported, and the interval between them is very long, then within this time interval, because there is a large number of data inserts in the table, the automatic statistics collection task collects the table statistics. When the statistics are finally imported, the table statistics are overwritten by the statistics imported from dump, so that the statistics just collected (April 22) become history, and the statistics collected on April 7 become current statistics.

After confirming the timeline of this statistical analysis, Lao K has reason to make the following perfect conjecture:

Suppose the two tables (An and B) have little difference in time to collect statistics, for example, they were both collected around April 7.

After importing tables An and B into the new database in the statistics collection task time window on April 24

Table A first automatically collects statistics, and then imports the old statistics, so the statistics of Table An are the statistics of April 7.

The collection time of table B is relatively late. When the import is completed, the statistics are collected on April 7, and then the statistics are collected. Then the collection time of statistics is April 24.

Finally, we can think that the statistical information of tables An and B is the same in the original database, but there may be a large deviation in the statistical information of tables An and B in the new database.

In this way, the association of tables An and B in SQL is very likely to implement plan inconsistencies in the new and old libraries.

We check the latest time of the statistical information of the tables involved in SQL through the script, and find that the statistical analysis time of some tables is after the import is completed, and the analysis time of some tables is before the import is completed. From this point of view, the above assumption is very possible. Based on the above possibility, we reiterate that our operation requirements for the application maintenance team after the import of ORACLE database data is also our conclusion:

After the import is completed, you need to re-collect the statistics of the table (which can be collected by table or by user), where the method_opt parameter is recommended to be specified as repeat

After reporting to the leader, the old pot is perfect. Lao K continues to rush into dealing with other problems, and the verification may need to be left to follow-up.

So far, the problems we have analyzed can be sorted out:

SQL takes a long time to execute and its logic read is too large.

Quickly navigate to objects with large logical reads through AWR reports

Collect statistics for logically reading larger objects

Re-execute SQL and complete normally

During the query process, we learned that the import was just in time for the system's statistics collection time window, and this process may lead to changes in the statistical relationships of multiple tables in the new environment, which in turn leads to changes in the execution plan.

Turn complexity into simplicity & simple verification

Time is empty, think of such a perfect conjecture, Old K immediately began to verify; now, we have two execution plans before and after SQL, we can analyze the reasons by comparing the results; the sentence is still too long to read directly, but confirmed that the statement is indeed completed by unionall combined a large number of simple multi-table joins.

We have collected the statistics of the table XXDEALS, so we can focus on the changes in the XXDEALS-related parts of the execution plan. However, there are dozens of XXDEALS-related parts of the 568 lines of the execution plan, so how can we quickly locate the parts that lead to a sharp increase in logical reads?

First of all, we confirm that the size of the table XXDDEALS is only 700m, and the logical reads of a single scan of a table are only 90,000 logical reads. Even if a table is scanned by an index at a time, the maximum number of logical reads of the table + index is only 200000.

By the same token, if the table XXDEALS is the part that drives the table, even if the index usage changes, the increase in logical reads caused by a single scan will not be too large.

The only thing that can expand logical reading on a large scale in a single query is that the XXDDEALS table has changed from the original driven table to the driven table in NL join mode, or the index used by XXDDEALS as a driven table in NL join mode has changed.

Poor execution plan:

Better implementation plan:

From the analysis of the execution plan, it is consistent with our above tips. If hash join is used, in principle, two tables only need to be scanned once, while if it is NL, the logical read of the driven table may be greatly increased due to the incorrect number of rows estimated by the driven table.

Search for the SQL corresponding to this execution plan from the abnormal SQL statement as follows:

And by using the method of adding hint to execute the statement to make it run out of two execution plans, compared with the logical reading and execution plan, we can verify that this SQL is indeed the reason for the great decline in the execution efficiency of the whole SQL. In this way, through the analysis of the execution plan, we resolve the complicated SQL analysis into a short analysis of the relationship between the two tables.

However, after a careful analysis of the two implementation plans, we find that the evaluation value of the driver table has not changed, and the number of records evaluated is 1, so can our perfect conjecture be verified here?

Not really! We see that in the two execution plans, table b is a driven table, and the number of records returned is 1, which does not affect the choice of connection mode (NL or HASH JOIN) of the driven table XXDEALS, nor does it affect the choice of index used by XXDEALS. At the same time, we also confirm that table b, like XXDEALS, is imported from the original database, and the table analysis time is before import. Therefore, the conjecture here is only based on experience, and seems to be a reasonable explanation, not the final fact.

Based on the above analysis, we can simplify further here; the location of the driver table and the number of evaluation rows have not changed in the two execution plans, so the only thing we need to focus on is why NL is used in poor execution plans.

That is, all we need to pay attention to is the table XXDEALS. Under the relevant maturitydate and flagofdeal conditions, why does one go to the index while the other does not? what is the relationship between the table and the statistics?

Accurate positioning problem

To pinpoint the problem, you may need to reproduce the problem here. In fact, it is relatively simple to reproduce the problem here. The old K restores the statistical information of the table through dbms_stats.restore_table_stats, and imports a definition of the new table and the old table (including statistical information without derivative data) into the 11.2.0.4 test environment, named DEALS_0407 and DEALS_0424, respectively. The two tables use the statistical information of April 7 and April 24, respectively. It can also basically reflect the data distribution of the tables in the two dates.

Through a simple test, we have eliminated the influence of the maturitydate field, and now the only difference is that the number of rows evaluated by the two tables is not the same as when using flagofdeal='O'.

So how do you calculate 317 and 455k here? Let's try to see if 10053 can help us.

Key assessment information for DEALS_0407 for 10053:

Key assessment information for DEALS_0424 for 10053:

We can see that there is little difference in the total number of rows (Original) between the two tables, and there are some differences in the density (Density) of the flagofdeal column. The most important difference is that the number of buckets (Bkts) of the flagofdeal column Frequency histogram of the two tables is actually different. DEALS_0407 has only five buckets, while DEALS_0424 has six buckets.

Old K simply calculates that the final estimate of 317 for DEALS_0407 is equal to 1017318 (Original) * 0.000311 (Newdensity). From this point of view, it is obvious that the'O 'value is not in the popular value of the histogram

In DEALS_0424, the final estimated 455k records seem to have been estimated in its popular value; if this is the case, it can basically be confirmed that it is due to data changes, that is, there are no'O 'records in the DEALS_0407 table, while a large number of' O 'records are inserted / modified in DEALS_0424.

However, this time, with plenty of time, Lao K will no longer easily draw conclusions!

First, let's verify that the estimate of DEALS_0424 is accurate:

Among them, get_external_value data is a custom function for converting endpoint_value to actual strings, which has been posted on the "Zhongyi Antu" public account. Friends in need can follow the Zhongyi Antu official account and reply to "histogram function" to get it.

We see that the endpoint_number of the'O' value is 14164

Then the evaluation value is: 1021005 (Original) * (14164-7839) / 14177) = 455k

Consistent with 10053 trace and implementation plan

On the other hand, the number of DEALS_0407 histogram buckets is 5, so what are its values? we might as well take a brief look to see if the'O 'value is indeed missing:

However, the result does not seem to be what we expected. Although DEALS_0407 is indeed 5 barrels, it contains a'O 'value; the number of records using the same method to estimate the' O 'value should be about 465K:

It seems that there are no such big changes in data during the migration upgrade.

So what's the problem here?

Further positioning

Considering that there was no problem with the execution of the original SQL in the previous environment, does it mean that the evaluation of the number of records of flagofdeal='O' in the original environment should be accurate? Why don't we verify it again:

Yes, we can see that if we add the hint of OPTIMIZER_FEATURES_ENABLE ('10.2.0.5') to SQL to make the database use the optimizer of 10.2.0.5, we can correctly evaluate the number of rows recorded with this'O' value.

Similarly, even the 11.2.0.3 optimizer can correctly evaluate the number of rows recorded with the'O 'value:

In this way, Old K is even more confused. Is there a problem only in 11.2.0.4, and why is that?

We still have to go back to the question of evaluating the number of rows. Why does the optimizer of 11.2.0.4 think that there is no'O 'value in the histogram information when it is clear that there is an' O 'value in the histogram record we see? Is there a problem with the get_external_value function given by Lao K? At this time, Lao K remembered that the problem of dealing with strings is a problem that often brings us visual errors, that is, spaces, so I query the histogram information again:

Yes, it is a space. In fact, after being converted into a string, it is found that there is a difference between the two. If we compare carefully, the endpoint_value of the histogram of the two tables is indeed different. By comparing the histogram information of other fields in the same table, we confirm that generally, if the actual value is the same, the endpoint_value in the information should be the same:

Finally, after confirming the test many times, it is finally located that as long as the column is of type char and the field length is less than 8, the histogram information endpoint_value of 11.2.0.3 (and earlier) is inconsistent with 11.2.0.4. The testing process is not shown here.

Positioning bug

Basically it can be identified as a change in oracle, at least a change between 11.2.0.3 and 11.2.0.4 The next task is to find the official explanation of ORACLE. Lao K also checked MOS several times before, perhaps because the location is not accurate enough, so that the keywords found in MOS are not "critical" enough. Every time they find a lot of results but cannot read them carefully one by one and get nothing. This time MOS uses the keyword "char endpoint_value different" on Old K to search. I soon found the article "Bug 18550628: AFTER UPGRADE TO 11.2.0.4QUERIES USING CHAR FIELDS CAN PERFORM POORLY", which seems to be very consistent with what we found, but there is no actual content in opening the article. It looks like a bug article formed by opening SR. However, if you look closely at this article, you can notice that it actually points to another article:

Look for 18255105, and the article you find is "Patch for upgrade scripts to identify histograms affected by fix ofbug 15898932"

The description is as follows:

What did you get?

Ladies and gentlemen, what did you get after reading it? Anyway, Lao K has got a lot, so I won't say much when I'm tired.

This article is reproduced in Zhongyi Antu.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.