Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Classic cases of Systemstate Dump Analysis (part two)

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Preface

Follow up the previous issue: (reading method of the previous issue: reply to '007' after following the official account of "Zhongyi Antu")

4.3.4

Analysis of library cache lock in SSD

Follow up on the previous issue:

At this stage, the seemingly clueless problems in front seem to be sorted out, and a large number of cursor:pin S wait on X managers have made it clear that all the spearheads are directed at sid 859.

We are only one step away from the truth. We only need to analyze the source of library cache lock to explain the whole mystery. Lao K has already mentioned the method of analyzing library cache lock waiting events. Now let's combine the trace file to see how to locate the blocking relationship of library cache lock.

Well, let's take a look at sid 859:

In this conversation message, we can see:

> > the session is waiting for the library cache lock wait event with a waiting time of 4429 seconds.

> > session requests library cache objects (request=S) with 700000209bb9d80 handle in S mode

> > the library cache object whose handle is 700000209bb9d80 is that SYS.C_OBJ#_INTCOL#, is a cluster (clustering) object.

We can see that session 859 is waiting by requesting library cache lock on 700000209bb9d80 in S mode, so we can confirm that there must be another session in the system that holds the library cache lock; on 700000209bb9d80 in X mode. Similarly, we search the trace file for the keyword "700000209bb9d80" and filter it to see the following entry:

After we locate the message, we confirm the session to which the message belongs, and confirm that the session information is as follows:

See here, do you have the feeling that the session holding the library cache lock on 700000209bb9d80 is 624, while the session 624 is waiting for the "cursor:pin S wait on X" event, and the waiting object is bbcee4f7;. Now let's improve the waiting chain diagram above:

In the end, we found that a deadlock was formed between session 859 and session 624, as follows:

> > session 859 holds mutex on bbcee4f7 and requests library cache lock on 700000209bb9d80

> > session 624 holds library cache lock on 700000209bb9d80 and requests mutex on bbcee4f7

> > other sessions generate a large number of cursor:pin S wait on X wait events, all due to the fact that the mutex on the bbcee4f7 has not been released for a long time.

At this point, have all the mysteries been solved? Will our analysis be completed?

The answer is: NO

Part 5

Root cause analysis

5.1 third brainstorming

Old K, who often does root cause analysis, still has doubts at this time:

> > if the session on the chain is deadlocked without a restart at that time, will the problem be solved?

> > what are session 859 and session 624 doing, and why are they deadlocked?

> > how can a single session hold a mutex of cursor, which makes the system in a tamped state?

5.2 conversation of dark willows and bright flowers 859

Now Lao K focuses on how to solve the above two doubts and continues to analyze trace.

First, take a look at session 859 and intercept the information in the trace file, as follows:

From the information at the yellow mark, we know that this is a background process of the database; we can view the process information of this session in trace as follows:

Answer the second question first: what is conversation 859 doing?

The background process is CJQ0, which is used by ORACLE to schedule job. We know that if a session requests library cache lock on an object in S mode, the session usually needs to find the information about the object involved in the statement in library cache when parsing a statement or compiling a package. Look for the keyword "oper EXCL" in the trace file of PROCESS 24, and we find the following three records:

Look for the keyword "oper EXCL" in the trace file of PROCESS 24, and we find the following three records:

Mutex 7000001e7d04898 (859,0) idn bbcee4f7 oper EXCL

Mutex 7000001e5fbe4e0 (859,0) idn fb52493f oper EXCL

Mutex 7000001e8faa990 (859,0) idn a8bbc174 oper EXCL

Someone might ask? How can there be three cursor in a session at the same time?

Don't forget that there are recursive calls in the ORACLE database, that is, the front end initiates a simple sql,ORACLE background that actually produces a series of recursive calls, which are actually collections of sql. Continue to find and refine the IDN value. The three sql statements currently being parsed are:

Let me explain here: in fact, the recursive call relationship to sql in SSD is not reflected, but the call relationship can be inferred from the above three sql statements.

The CJQ0 process is used to schedule oracle job. From three statements, we can roughly see that a8bbc174 is the sql used to query the job related information in the system, fb52493f is the sql used to query the information of an object through the object number, and bbcee4f7 is the sql used to query histogram information.

A8bbc174 calls fb52493f,fb52493f to call bbcee4f7. If there is a problem with bbcee4f7, the other two will definitely not be able to execute properly. Here, it is precisely because the bbcee4f7 cannot complete the parsing, which leads to the blocking of fb52493f and a8bbc174 in the upper layer.

New doubt: what is the internal relationship between the three SQL and C_OBJ#_INTCOL#?

The session is waiting while requesting library cache lock on C_OBJ#_INTCOL#, and from the three sql texts, it seems that none of them has anything to do with the object C_OBJ#_INTCOL#. How do you explain this? Careful readers may have noticed that the old K specially pointed out that C_OBJ#_INTCOL# is a cluster (clustering) object, and the cluster object is not a table, but is used to store the common columns of multiple tables. Here we can notice whether the histgrm$ object in the lowest-level sql is related to C_OBJ#_INTCOL#. Let's take a look at the definition of histgm$:

Another puzzle solved is that histgrm$ does use the C_OBJ#_INTCOL# cluster object, so when parsing sql statements that use the histgrm$ table, you need to get the library cache lock on the C_OBJ#_INTCOL#.

5.3 conversation of dark willows and bright flowers 624

Next, let's take a look at session 624, just like analyzing session 859, extract the trace of a single process, and let's intercept some of the information as follows:

From here, this is a called job process with PROCESS number 42

This is not a database background process, so we can see more information than the 859 process we saw before. We roughly know that this process is a job task called by the database to collect statistics, waiting for the "cursor:pin S wait on X" event, and the parsing object is bbcee4f7.

Because it holds the library cache lock of the object C_OBJ#_INTCOL# in X mode and blocks the critical session 859, let's see why it holds this library cache lock;. We searched for the keyword oper EXCL in the process information of PROCESS 42 and found two related messages as follows:

Similarly, we can guess the recursive invocation relationship between the two from the semantics of the sql statement.

Session 624 holds the library cache lock of C_OBJ#_INTCOL# and I_OBJ#_INTCOL#, where I_OBJ#_INTCOL# is the index of CLUSTER. Now all the mysteries have been solved. You can relax and get your mind straight from the beginning.

Part 6

Scene reproduction

Finally get a panoramic view of what the database is doing when it fails.

In this scenario, at T1 moment, the database automatically collects statistics. J000 process is scheduled to collect statistics of the entire database. When collecting cluster objects, the database can only use analyze to analyze.

T2, because the statistics of C_OBJ#_INTCOL# objects are updated, and because of the association between histgrm$ and C_OBJ#_INTCOL#, the sql (including bbcee4f7) related to histgrm$ also needs to be reparsed.

T3 moment, J000 first collects C_OBJ#_INTCOL# statistics, and then continues to use analyze to collect statistics for its index I_OBJ#_INTCOL#

T4 moment, when the CJQ0 process periodically queries the system JOB, it needs hard parsing, and parses it when recursively calling bbcee4f7.

In the process of parsing, you need to request the library cache lock that holds histgrm$ and its related objects (that is, C_OBJ#_INTCOL# and its index I_OBJ#_INTCOL#) in S mode

At the moment T4, the J000 process is indexing library cache lock on I_OBJ#_INTCOL# in the analyze index, which is in X mode.

In the process of J000 using analyze, the relevant recursive sql also needs to be executed and hard parsing is required, so the key sql bbcee4f7 mentioned above is called.

So it ended up with a deadlock.

Part 7

Problem positioning

In fact, in the process of the above analysis, we can basically judge as bug, (MOS): 1628214.1 Database Appears to Hang with Deadlock Involving SYS.C_COBJ# or C_OBJ#_INTCOL# While Statistics Maintenance Job is Running

Part 8

Write at the end

So far, we have located the bug and got the solution during and after the event, but Lao K is more concerned about whether you can get something from this CASE. Here is a good way to ask yourself:

> > can I check the cursor:pin S wait on X in Systemstatedump?

> > can I check the library cache lock in Systemstatedump?

> > will I do it if I want to simulate my database tamping?

Is there another unanswered question? If I encounter the same problem next time, can I solve the problem by killing the process in the deadlock chain?

If your answer is yes, then Lao K thinks this sharing is very valuable.

If you have any questions about the above, you can communicate and discuss them through Wechat or QQ.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report