A case study of Oracle's historical fault playback-- Technical Life Series No. 37-- the story of me and the data center. 07/01 Update SLTechnology News&Howtos

A case study of Oracle's historical fault playback-- Technical Life Series No. 37-- the story of me and the data center.

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

On Friday night, November 25, I was watching TV at home when the phone rang. It was a call from a senior bank customer, DBA, who is also my good friend and brother. It seems that he is in trouble.

"Yuanbang, during the routine approval run on the evening of November 20, there were several online transaction timeouts in one minute! At that time, the CPU was not high, the memory was very abundant, and there was no page change. The abnormal SQL_ID is 15vru8xqgdkjf, which is the transaction message table inserted by our online trade fair, which has nothing to do with the batch. The most serious stroke, is 2016-11-20 20:44:04-20:44:34, the time reached 30 seconds. What these transactions do is simply INSERT INTO VALUES, insert the transaction message table. I have sent the relevant information about ASH dump to your email, please help analyze the reason as soon as possible. If you need any information, you can call me again. "

Careful students may have found a lot of information in this passage:

It has been five days since the fault occurred. Can you still check it?

The fault has only appeared for less than 1 minute, can you still check it?

How can a simple INSERT INTO VALUES take 30 seconds?

In Oracle, the answer is YES!

What helps us to complete this verification task is the ASH dump mentioned by the customer.

Next, the root cause analysis of case may subvert the "experience" of many people's operation and maintenance. "I didn't expect that doing so would have such a great impact on highly concurrent systems." There may be a lot of gains in life. We might as well look down together.

Start a journey of analysis

one

ASH popular science

Open email, not only awr report, ash report

There is also a dump of ash_1120_2040, that is, ASH DUMP, which is so sweet!

This table, which is actually a backup of dba_hist_active_session_history, records the SQL and waits of active sessions in the database at a granularity of 10 seconds, and is not lost with a restart. On the other hand, gv$active_session_history is sampled and stored by second, but because it is stored in memory, it will be lost with restart, and the number of stores is limited.

This is the ASH function, the active session history function.

For example:

Suppose a session is running a SQL statement, from 4 to 34 seconds, for a total of 30 seconds.

Because the ASH function samples at 10-second intervals, it is possible to collect a total of 4 times in 4, 14, 24, 34 seconds, or 3 times in 5, 15, 25 seconds (run ends at 34 seconds, so the next 35-second acquisition will not have the session).

two

A preliminary study of ASH-active conversation trend

After importing ash_1120_2040 's dump into the computer

Immediately issue the following query to get the trend of the number of active sessions distributed by time.

The result is drawn as shown in the following figure:

You can see:

About 30 seconds between the two red vertical bars is the time when some online transactions time out.

However, a few minutes before the problem, that is, the red framed part, the number of active processes has already come up. This has been confirmed with the customer, it is the time of batch adjustment, and the batch is processed concurrently by multiple processes.

three

What is the process waiting for during the transaction timeout?

Next, we issue the following query to get the wait event during the transaction timeout period

The results are as follows:

As you can see, it is all waiting related to RAC global cache. What needs to be explained is that not knowing RAC does not affect the reading and understanding of this case. Small y mainly teaches you the methods and ideas of checking difficult historical problems through this case, as well as the courage to constantly subvert your common sense.

You can see:

The first is gc buffer busy acquire, which means that when a node in the database cluster, such as node 1, requests a BLOCK from node 2, node 1 has already applied for the same BLOCK in front of me, so even the application is busy, busy acquire.

What does this wait mean?

Two or more sessions applied for the same BLOCK at the same time and ran into each other.

Common reasons:

It is usually related to the efficiency of SQL. If SQL is efficient enough, you don't need to apply for so many BLOCK, and you won't bump into other sessions and apply for a BLOCK at the same time.

But here, it's obviously not a question of SQL efficiency, because it's just a simple INSERT INTO VALUES statement, and there's no problem of poor execution of the plan.

Obviously, in the process of my previous application, I encountered an exception during the request for BLOCK from Node 2, which led to the problem.

If it's you, how do you analyze it next?

four

Analysis of root cause analysis blocking source

When we do root cause analysis, the common means is to sort out the relationship between processes. After finding the source of the blockage, we analyzed what he was doing and why it was much slower than usual, which led to the problem.

At this point, some students may ask, "can I find out which session on this node applied for BLOCK before me? I want to see what the conversation in front of me is going through."

The answer is YESmistress!

The blocking_session field in ASH identifies "which session requested the BLOCK before me"

Next, we issue the following query to find the hundreds of sessions waiting for gc buffer busy acquire at one moment, which BLOCK (p1/p2) they are applying for and who they are blocking (blocking_session).

As you can see, at the moment when things started to go wrong, that is, 20:44:04

Node 1 has 82 sessions applying for the same BLOCK waiting for gc bufferbusy acquire, which is blocked by the session with node 1 SID of 3200, that is, the session with node 1 SID of 3200 requests node 2 before 82 processes.

five

A lot of people fell on the way to dig for treasure...

Taking advantage of the victory, let's see what the session with a node 1SID of 3200 is waiting for and who is blocking it.

So recursive, until we find the source, in the end who is in "as long as one guard, ten thousand men do not open" …

The results are as follows:

You can see:

There's an anomaly! The session with node 1 SID 3200 could not be quickly obtained when the request received file number 47 and BLOCK number 1540116. The waiting event was gc current blockbusy, and the application lasted more than 20 seconds from 20:44:04 to 20:44:24, but crashed, there was no way to know who my blocker was.

Yes, ash is a treasure, but many people are on their way to dig for it.

Next, how do we look down? Where on earth is the way to the treasure?

I am sorry that Xiao y has led everyone to this dead end. In reality, there are indeed many friends who are repeating this dead end. They do not want to repeat this dead end. They want to find a guide to take part in our series of training courses leading to experts.

six

Follow Xiao y and start again.

In fact, as IT technicians, we sometimes think too complicated when dealing with problems, pay too much attention to skills, pay too much attention to the description of official documents, and unwittingly let ourselves into a dead end. The easiest way to look up questions is to follow your train of thought as if you were speaking. When many questions cannot be answered in official documents, in fact, life can give us answers. There's no way to win. there's a trick.

-Zhongyi Technology small y (Huang Yuanbang)

Xiao y likes to keep complexity simple, now that the customer has mentioned that there is an online transaction inserted into the message form from 20:44:04 to 20:44:34, the time has reached 30 seconds. Then let's start from here.

The results are as follows

You can see:

SQL_ID = '15vru8xqgdkjf' has been sampled 4 times in ASH, indicating that the execution time is more than 30 seconds. It is exactly the same as the customer description. And? If it's you, how to keep looking, you might as well think about it for dozens of seconds.

seven

Fall down again?

Next, let's take a look at this very long transaction through sql_exec_id, the only execution tag for SQL:

The results are as follows:

Indeed, the longest transaction, between 04 and 24 seconds, has been waiting for "gc current retry"

"it's over, I haven't encountered this waiting event." well, I haven't encountered it either. if it were you, how would you look it up? it is suggested to stop for a few minutes, du Niang and google or MOS to see if you can get the idea to move on. Finally, did you fall here again? Please leave a message to tell Xiao y...

The answer is in the back. It's up to you to decide when to turn back.

eight

Ask for directions from life, ask for answers

In fact, Xiao y doesn't know who the person "gc current retry" is waiting for in the long trading conversation, because the person "gc current retry" can't find any explanation on Duniang, GOOGLE or METALINK.

At this time, Xiao y is used to asking for answers from life.

"gc currentretry", literally, requests the current BLOCK, but cannot be obtained, in retry. Since we are requesting a BLOCK from node 2 and cannot get it, we need a retry, so it is necessary to find out who is responsible for supplying the BLOCK in node 2. That's right! Is the LMS process of Node 2!

Next, we are reasonable and should all check what the LMS process is doing during the problem.

You can simply issue the following query

The result is shown in the following figure

You can see that these LMS processes are waiting for gcs log flush sync, that is, the LMS process needs to brush the changes of the BLOCK in log buffer to the online log file on disk before passing a BLOCK to other nodes, but unfortunately, it has not been completed for a long time. So there is no way to quickly transfer BLOCK to node 1, so node 1 is constantly retry. The blocker of "gcs log flush sync" here is obviously the LGWR process that we are familiar with, that is, the writing process of log buffer.

nine

the boundless plain

After crossing this hurdle, it is obvious that we can go all the way south.

Next, let's see if the LGWR process of node 2 is too busy to pay attention to the log request of the LMS process. Issue the following query

The results are as follows:

You can see:

The LGWR process of node 2 is waiting for enq: CF-contention, that is, to write control files (for example, to write control files when switching logs), but is one step ahead of others, who is the 1369 session of node 1. In the meantime, the LGWR process foolishly continued to wait for the queue lock of the control file, but could not get it for dozens of seconds.

Take advantage of the chase and issue the following query to see why the 1369 session of node 1 has held the queue lock of the control file for a long time.

The results are as follows:

As you can see, sid=1369 is the ckpt checkpoint process for node 1

Ckpt is reading and writing the control file, so it holds the lock of the control file.

The wait event controls the ordering and parallel writing of files, and does not block, but changes all the time. Normally, if the IO is fast, the time to hold locks on the control file is very fast.

ten

Root cause Analysis-- checking the Source of IO

Further check the performance of IO

As can be seen from the oswatcher data, during the problem period, many disks busy 100%, and the average response time is more than 200ms. IO is basically mainly written, but not all disks are 100%. For example, hdiskpower 11 still has a small amount of read and write.

Check the source of a large number of write IO

Check for wait events in ASH

The results are as follows:

You can see that there are 24 processes waiting for db file parallel write at the problem moment. Obviously, the DBWR process is waiting for a total of 24 DBWR processes in the database, all of which are active, indicating what action triggered a large number of dirty block brushing operations.

eleven

The truth comes out, who triggered the write operation of a large number of dirty blocks

If you check the alert log, you can see the command of alter system archive log all (current), which will inevitably trigger DBWR to brush a large number of dirty blocks, causing the disk to be extremely busy in a short period of time, and the HBA card bandwidth may also be full.

At this point, you might as well stop and think about who caused the archivelog all (current) command that triggered the full checkpoint.

Think about it for dozens of seconds...

The answer is the RMAN backup script that we are all familiar with!

To be clear, the truth is not what you think! It is not the IO generated by the backup that affects the busy disk!

Because as we saw before, 100% of the busy disk comes from writing, not from reading! RMAN is mainly about reading! And if it is RMAN's read IO that affects the busy disk, it can't only affect about 30 seconds!

Because the alter system archive log all (current) command at the beginning of the RMAN script triggered DBWR to brush a large number of dirty blocks, causing the disk to be very busy, and the control files and data files together, causing the ckpt process to operate very slowly after getting the control file queue lock, and then blocked the LGWR process, the LGWR process did not respond to the LMS process, and finally led to GC waiting for tens of seconds, that is, a large number of online transactions timed out. After the full checkpoint is completed, the transaction returns to normal!

How to verify it again, alter system archivelog.. What about the impact?

Next, issue the following command to verify:

As you can see, RMAN is launched in the first 04 seconds, and the RMAN process is still there after 43 minutes and 34 seconds, but the transaction has returned to normal, that is to say, a large number of dirty blocks are superimposed by the full checkpoints triggered in the RMAN script and batch parallel write operations, which is the cause of the problem!

Why was it all right before?

It was confirmed that the backup was advanced and overlapped in batch on the same day, but not because the read IO of the RMAN backup caused the problem, but because the RMAN backup script was executed at the beginning of the alter system archivelog.. The full checkpoint triggered by the command and the batch parallel write operation produced a large number of dirty blocks superimposed, resulting in about 30 seconds, and the checked INSERT online transactions also timed out, which is the truth of the problem and the root cause of the problem!

twelve

It's not over. Are you sure this is the truth of the problem?

Are you sure this is the truth of the problem?

You are welcome to leave a message and say what you think. At the same time, you can add y Wechat shadow-huang-bj to join the study group. I will find out the answer in the next issue (Technical Life Series).

Cause of the problem and summary of experience

Summary of causes of failure:

The reason for the failure is that a large number of dirty blocks are produced during the batch period.

After the ckpt process holds the queue lock of the control file, it starts the IO operation

L but during the rman backup calls the command of alter system archive log, which triggers the dbwr process to write a large number of data files, resulting in disk 100% busy. Causes the IO read and write of the control file not to be completed quickly.

Due to log switching, the lgwr process needs to acquire the queue lock of the control file, while the ckpt process blocks the lgwr process, and the lgwr process is in the enq: CF-contention state and cannot respond to the log writes of the LMS process. The LMS process is responsible for the GC between nodes. As a result, there is a problem with GC and a timeout for INSERT statements

Data center operations and maintenance recommend:

Control files and data files are separated from the underlying RAID group and placed on other disk groups to avoid being affected by DBWR write IO and RMAN read IO.

The alter system archive log all command in the Rman backup script causes the database to do full checkpoints, triggering a large number of IO operations in DBWR, which slows down the overall performance

An alternative way to reevaluate alter system archive log all

* stagger RMAN backups and batch times

In this case, Xiao y hopes that you will have at least the following gains:

Altersystem archivelog.. in the backup script of the. Rman. I can't believe that orders have such a big impact...

The technique is very important, this is the foundation!

The way is also very important, it can take you farther away.

When there is no answer in technology, we can ask life that God will not mistreat those who are serious, careful and thoughtful.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.