Technical Life Series-the Story of me and the data Center (issue 5) A case study of performance optimization of liquidation / reporting / day-end approval procedures 04/18 Update SLTechnology News&Howtos

Technical Life Series-the Story of me and the data Center (issue 5) A case study of performance optimization of liquidation / reporting / day-end approval procedures

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Preface

Unwittingly, the story of me and the data center in the technology life series has come to the fifth issue. Little y to see you again!

Previous issues have mainly shared some TroubleShooting cases. In fact, Xiao y is good at performance optimization, so from this issue, Xiao y will gradually share more database performance optimization cases.

To get to the point, if your day-end approval / liquidation / statements and other procedures are fast and sometimes slow, or have been slow since a certain day, as an operation and maintenance DBA or developer, how will you do it? Also, in addition to solving the problem, how do you answer a question that the leader is most concerned about? "Why is there a problem now, but there was no problem before?"

What Xiao y wants to share with you today is the process of analyzing and solving such a performance problem.

Your likes and retweets are the motivation for Xiao y to continue to share.

More Oracle database actual combat experience sharing launch, all in the "Zhongyi Antu" official account! Welcome to follow us.

In addition, some friends asked some friends not long ago whether Xiao y's team could provide external third-party oracle services, and the answer was YES!

Interested friends can add Xiao y's personal Wechat, WeChat account is shadow-huang-bj, hoping to make more friends and help more people in need.

Part 1

Here comes the problem.

Y, do you have a minute? Let's look at the performance of a report later.

There was a SQL statement that started a week ago, and the performance deteriorated sharply, and the execution time changed from less than 10 minutes to more than 10 hours.

Just finished the Oracle training at the customer site, the problem comes at the right time, just can let the customer feel how to integrate the theory into the charm of actual combat! The first idea of Xiao y is that the execution plan of the SQL statement has changed, and you can usually quickly find clues from statistics or CBO's estimates of cardinality, which should quickly find out the cause and solve it!

In the end, it turns out that Xiao y thought simple at first. To solve this problem, customers can not solve the problem by collecting statistics again or restarting the database. Fortunately, Xiao y adjusted to the academic model in time, and finally found the cause of the problem within an hour, and the solution to the problem was to let nature take its course.

Environment introduction:

Operating system Redhat 64 bit

Database Oracle 11.2.0.3, 2-node RAC

Part 2

Analysis process

2.1 complete SQL statement

Xiao y has carried out sensitive information processing and simplified writing of this SQL, as can be seen:

The SQL join the two sheets, and then group by

One of the two associated tables is a small table of 80m and the other is a larger table of 3.5g. The number of records is 1.6 million and 8 million respectively.

The @ SQL statement uses hint to prompt the optimizer table join to hash join, and the single table access path small table to scan the whole table.

For such a SQL, according to the experience of Xiao y, as long as the driver table selects a small table, then the execution time of the whole HASH JOIN is basically equal to the single table access time of the two tables. The total of the two tables is less than 4G, which can usually be completed in 5 minutes. This is consistent with the previous execution time described by the customer.

Here, by the way:

Many developers often write incomplete hint, for example, this hint only writes the table join mode, the single table access path only writes a table, and the table connection order is not written, in fact, it is not completely fixed to the execution plan.

Next, Xiao y will check to see if the execution plan has changed and whether the execution plan is correct.

2.2 implementation Plan

You can see:

The Z execution plan (the algorithm within oracle) is indeed the same as hint

Hash join is used in the connection mode of the Zero table.

The access path to a single table is a full table scan (table access full)

The join order of the ZR table is a small table driven table (hash memory table)

This is a perfect and optimal execution plan. The only minor drawback is that some temporary tablespaces are used in the optimizer's steps to evaluate hash join and hash group by, but this is only an CBO assessment and does not mean it will actually happen.

Compared with the previous implementation plan, it is the same.

Since there is no problem with the execution plan and there is no change, you need to break down the execution time of SQL to see if it is spent on CPU or IO, clustering, concurrency competition, and so on.

2.3 Statistics related to SQL implementation

You can see:

1. Each execution time is 39615 seconds, more than 10 hours.

2. There are only 45276 block (blocks) per logical read.

3. Perform a physical read of 451421 block (blocks) each time

4. Basically, the time is spent on CPU, reaching 38719 seconds, more than 10 hours, while the time spent in IO/ cluster / application (lock) / concurrency is very small.

2.4 first brainstorming

At this point, the experienced DBA should be able to find some strange phenomena in the CASE.

However, we still have to take care of everyone. First, let's answer some questions that may be in the minds of our friends.

2.4.1

Is there any abnormal waiting event?

Looking at this, someone might say:

Is there any abnormal wait event during the execution of the SQL statement?

The first answer is NO!

Because the execution time of the entire SQL is basically spent on CPU, reaching 38719 seconds, more than 10 hours, while the time spent in IO/ cluster / application (lock) / concurrency is very small (less than 100 seconds in total). If SQL runs on CPU, there will be no clue to waiting for the event. The time distribution is shown in the following figure.

2.4.2

Is hash join One-pass/Muti-pass causing slowness?

Some people may say:

The use of temp tablespaces appears in the execution plan. Is hash join One-pass/Muti-pass causing slow execution of SQL?

The answer is NO!

First of all, the execution plan shows that the temp tablespace (hash join one-pass/muti-pass) will be used, which is just a pre-implementation assessment of CBO, and the actual implementation will probably not be used at all.

Second, if we really use temp tablespaces and become the bottleneck of the entire SQL, we will see a lot of direct path read/write temp, because this kind of wait events are counted in the wait events of the IO class, then the execution events of the entire SQL statement should be IO accounted for the most instead of seeing the time spent on CPU.

2.4.3

Little y's doubt

At this point, Xiao y began to feel that this case needs to be more focused to solve!

Execution time is mostly spent on CPU, which usually means that most of the data needed is in memory.

One common sense is that if the required BLOCK is in memory, CPU can handle 100000 or even hundreds of thousands of logical reads per second!

But specific to this SQL, 10 hours of CPU time, processing logical reading, only 450000!

The logical read of 450000 corresponds to the size of 4G, which is the sum of the sizes of the two tables.

At present, there are indeed some strange things, Xiao y needs to:

Compare with the decomposition of the history execution time

Get this SQL statement running again to get more clues.

2.5 Historical performance comparison and confirmation

You can see:

At first, it could handle 440000 logical reads per hour, but then it slowed down.

For most of the time, only 1000-3000 of logical reads were processed per hour

The execution time is indeed all on CPU!

Unfortunately, since the AWR report is retained for only 7 days, the decomposition of the original execution time cannot be obtained, and there is no way to compare normal and abnormal moments. Next, this is a SELECT statement that can run directly to reproduce the problem, so that little y can observe more clues!

2.6 Real-time clues to reproduce the problem

Run this SQL statement again, and then open other windows to observe. In the first 1 minute, it is normal to read a small table and a large table. IO is almost 30m per second, and then IO drops sharply. At this time, the waiting event is ON CPU.

Small y immediately checked the progress of the implementation of SQL, the table SMALL_TABLE in v$session_longops has been scanned, but the progress of the full table scan of another table BIG_TABLE stays at 82%! But if you take a closer look, it still goes up, but it goes up more slowly! This is shown in the following figure.

For the full table scan of BIG_TABLE, sofar basically increases by 1 every 5 seconds!

At this speed, it will take (442460-362690) * 5 = 400,000 seconds, that is, more than 10 hours! This can be matched with the chapter "comparison and confirmation of Historical implementation"!

Here's a hint. Slow growth has nothing to do with IO performance. It has been analyzed above, and time is spent on CPU.

Next, readers, you can stop for a moment, summarize the above phenomena, and think about it for a few minutes.

If you are here to pick up this CASE, how will you continue to check?

↓

Don't go away. There's more in the back.

2.7 how to move on without waiting for the event

Since the SQL execution is on the CPU, then there will be no clues waiting for the event to come out. Since it is on the CPU, you must go to see call stack, which is a habit that Xiao y has developed for many years.

Through oradebug short_stack, three fetches were made at intervals of a few seconds. As shown in the following figure:

By this time, Xiao y has finished all the tests.

It's time to go out and have a cigarette. You need to put all the clues you have found in your head.

Fortunately, after a cigarette, Xiao y finally figured out all the problems, all the phenomena he saw could be explained clearly, and there was a question that the leader was most concerned about-"Why is there a problem now, but there was no problem before?" Xiao y also has the answer.

Friends are advised to stop here and think about it for a few minutes to see if they have found the cause of the problem.

Friends who have found the answer here can send a resume, which shows that you have good thinking ability and experience! Welcome to join the Oracle service team of Zhongyi Technology! Please send your resume to 51994106@qq.com

2.8 second brainstorming

During the short time when he went out to smoke, Xiao y kept thinking about several questions.

Why do you process only thousands of logical reads per hour?

What are you doing with the time spent on CPU for the Z SQL execution?

Why didn't it come out before, but now?

The following figure of the function qerhjWalkHashBucket, will explain all the problems thoroughly! QerhjWalkHashBucket means that you need to traverse the data in hash bucket in the process of doing hash join.

Therefore, Xiao y re-stringed the Hash Join principle, such as the whole table association of two tables An and B.

SELECT * FROM A Magi B

WHERE A.ID=B.ID

The execution process within ORACLE can be simplified as follows:

Angular SCAN A (scan Table A)

@ HASH (A.ID), break it into BUCKET, stay in pga hash area and wait for others to match.

Angular SCAN B (scan Table B)

^ HASH (B.ID)

Into the corresponding Bucket, compare whether the values of the associated fields of the table are the same, return or discard

The purpose of HASH is to prepare data into individual buckets. Each algorithm has its advantages and disadvantages.

So what are the disadvantages of HASH JOIN?

Did we hit the flaw?!

2.9 Truth emerges from the water

Obviously, when the driven table has a lot of data in one of the bucket in memory, then a value of the driven table needs to traverse more data than in that bucket, which is similar to nest loop. Then a value comparison will take a long time!

Driven table A BLOCK can store dozens to hundreds of records, and a record needs to be compared in a bucket with a lot of records. A BLOCK of a driven table has many records, so naturally it can only handle thousands of logical reads per hour! Also observed that the v$session_longops.sofar rose very slowly!

At the same time, the process of comparing a large number of associated fields consumes CPU (when the driver table is read into PGA, it stays in PGA memory).

So why didn't you come out before? That's because the data distribution of the associated fields that used to drive the table is uniform! Since one day, the distribution of the associated fields of the table began to be uneven!

Issue SQL and verify as follows:

As you can see, the number of records of id=0 in the driver table small_table reaches 170000, which means that the data of a bucket at least reaches 170000, which runs counter to the idea that hash join breaks the data into each bucket, usually the idea that the data of a bucket is no more than 5 and the original intention of the design!

At this point, all the questions have been satisfactorily answered!

2.10 further verification

By adding the filter condition of small_table.id! = 0 to the SQL statement, the data of small_table is reduced from 1.6 million to 1.43 million. If there is little change, execute the above SQL and the execution time is about 3 minutes!

This validates a disadvantage that hash join is not suitable for driving the uneven distribution of associated fields in a table!

2.11 solution

If you know why, then there are a variety of solutions!

Hash join is not suitable for driving the uneven distribution of associated fields in a table, so there are many solutions.

1) use hint of use_merge instead of use_hash. If you cannot modify the program, you can specify the execution plan through sql profile. Neither of the tables is large, and sorting and merging joins are fast.

2) investigate, confirm and process the data driving table small_table.id=0, why a large number of id=0 data suddenly appear on a certain day, and whether it can be deleted

……

2.12 tips from experience

You can see:

It is necessary to master the principle.

What kind of architecture, algorithm and storage structure determine what he can do and what he cannot do.

But have you thought about his shortcomings? If you haven't had it before, Xiao y suggests giving it a try, so that you can gain more.

This article is a reprinted article by Yi Antu.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.