Look at the essential skills of engineers-Technical Life Series No. 51-the story of me and the data center 07/01 Update SLTechnology News&Howtos

Look at the essential skills of engineers-Technical Life Series No. 51-the story of me and the data center

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Foreword:

In the twinkling of an eye, time has come to June, and the series of articles on technical life have not been updated for half a year. In the past six months, the authors of our series of articles on technical life have been working hard on the front line of work, and do not forget to accumulate and summarize at the same time. Provide more profound details for our articles Today, the technology life series will formally return to the classroom, still rigorous and profound, vividly reproducing the customer's on-site handling of problems, hoping to bring more in-depth thinking while passing on more knowledge points.

Here, first of all, I would like to thank the friends who have been supporting the technology life series, your support and forwarding is the driving force for us to continue to share.

I. here comes the problem.

On a leisurely Friday night, Lao K, who was thinking about his life at the window, received a customer's question. The DBA at the customer site had been dealing with it for some days and was ready to restart the time window on the weekend to solve the problem. However, the customer cautiously advised that the problem may not reappear after restart, and we need to find the trigger cause of the problem before restart to avoid this kind of problem.

So what is the specific problem? The customer describes the following information:

ORA-01410 errors often occur when applications execute some SQL, but such errors occur only occasionally and may not be encountered manually, because there may be multiple tables involved. After previous analysis, it is suspected that it is caused by memory disorder. It is expected to restart the operating system next weekend to solve this problem. At present, we can try to analyze the problem from the database level. See if you can find anything.

From the communication with the customer, it seems that I have been tortured by this problem for a long time, but now I don't have much hope for analyzing the cause of the problem, but just try to analyze the problem.

The customer just handed over the problem to us with the mentality of giving it a try, but as a service provider, we must try our best to solve the customer's problem.

2. Find out the situation

To solve the problem as soon as possible, we can only analyze it remotely, so I contacted the customer's interface DBA and remotely connected to the customer's server to understand the relevant environment and analyze the problem. The details are as follows:

Operating system: REDHAT 6

Database: oracle 10.2.0.5 single node

Specific phenomenon: the access of some tables is reported to ORA-01410, obviously for a SQL statement, if the table is removed from the SQL, the whole SQL will not report an error; in addition, there are many tables in this situation; this phenomenon will not occur all the time, occasionally

First of all, I communicated with the customer DBA, got several SQL with this problem, and executed these statements directly. After many experiments, I did encounter several errors, especially the SQL which took a long time to execute.

Select one of the simpler statements, and its statement and execution plan are as follows:

In this SQL statement, the table with the problem is XXXstatelog, which is relatively small, only 60m

And what is the ORA-1410 error?

[oracle@test10g ~] $oerr ora 1410

01410, 00000, "invalid ROWID"

/ / * Cause:

/ / * Action:

The ORA-1410 error is due to an invalid ROWID,ORACLE that does not indicate the possible cause of this type of situation

III. Communication and search

Now I know the basic situation, but the customer DBA has been analyzing it for some time, and they have a better understanding of the overall application, hardware and related changes. When taking over the on-site problems, I am used to communicating more with the customer DBA. On the one hand, we can learn from their existing experience to avoid reinventing the wheel, on the other hand, onlookers are clear, maybe we can point out some flaws for them. Discover the key to the problem In short, make full use of the existing resources to quickly analyze the problem, not to mention the network is intermittent, the relevant operations also need the help of customers to implement and collect.

Q: have you made any changes recently?

A: the database has not been adjusted, and we don't know much about the adjustment of the application version.

Q: apart from this error, does the app have any other problems, such as wrong result?

A: there is no such feedback.

Q: is there any rule in reporting errors like this?

Answer: there are only five or six tables that know that such a situation has occurred, but they have not found any special rules in terms of time or frequency. Usually, if you perform a failed operation again, it seems that there will be no problem.

There doesn't seem to be any more information, so think about it first; then take a look at this error:

[oracle@test10g ~] $oerr ora 1410

01410, 00000, "invalid ROWID"

/ / * Cause:

/ / * Action:

Two key messages: ROWID and invalid

First of all, when will ROWID be used? generally speaking, ROWID is a pseudo column that does not exist in the data block of the table. It seems that only the leaf block of the index will store ROWID to locate the data.

In addition, INVALID seems to mean that there is an exception in the rowid stored in the index, or is there confusion after reading it into memory?

In the absence of more information, you might as well learn from the experience on MOS to see what causes ORA-1410, and find an article explaining this kind of error:

The trigger of ORA-1410 is described:

From the series of descriptions given here, the main possible causes include: data block corruption, index block corruption, memory corruption, exceptions caused by DDL operations, and various bug

From the situation we have encountered, the corruption of the data block / index block does not seem to exist, because multiple execution does not always make an error.

Without considering bug, there are only DDL operations or memory anomalies, so is it the problem caused by the DDL operation?

IV. Thinking and inference

The MOS document I have seen so far is quite comprehensive, so I first sent it to the customer DBA for discussion

However, the customer DBA said that they have read this document carefully, and they have communicated with the application. There should be no DDL operations during the time when there is a problem, but only some TRUNCATE operations in a fixed period of time in the evening. In addition, they have also tested it in their own environment. Truncate does not cause ORA-1410 to report errors, and based on past experience, query errors caused by truncate are reported by ORA-8103;. So, customers here think it may be caused by memory problems or ORACLE's bug. Customer DBA seems to have a lot of experience.

I kind of agree with the customers' views:

1. I seem to have seldom encountered ORA-1410. I have encountered cases with abnormal indexing before.

two。 In addition, when truncate causes SQL to report an error, it usually reports ORA-8103, and the object does not exist.

3. If you are really not aware of the truncate operation, then the front-end business may have been affected

This doesn't seem to be caused by truncate, but how does the phenomenon come about? If it is really an exception in memory, how is it good? What is its trigger condition?

Taking advantage of the on-and-off network with my customers, I began to think alone.

According to the description of the mos document, there are several main causes of problems:

1. There are frequent DDL statements in the database.

two。 A corruption occurs in an index or data block, that is, a problem caused by a bad block

3. Hardware level, for example, there is a memory problem

4. Bug of Oracle

Take a closer look at the phenomenon of this problem and think about the possible reasons one by one:

1. Caused by the DDL statement, but the user has confirmed that the DDL statement will only be executed at a specific time.

two。 A data block or index block is damaged, which can be judged by the following statement

SQL > analyze table validate structure cascade

After confirming with the user, it is found that the command does not return the information related to the bad block, which can basically rule out the possibility of the bad block.

3. Memory-level problems, if they are memory-level problems, are more likely to occur when accessing some data blocks of a table. And the hardware engineer also confirmed that there is no problem with memory, so it can basically rule out the possibility of memory problems.

4. Oracle's bug, after doing some searches on mos, did not find a similar problem.

5. Test and simulation

After thinking about the above, I still think that the truncate operation is more likely to cause exceptions. If such case can be reproduced, I believe it will be very convincing. Well, this reproduction should be relatively simple, so it is better to reproduce it directly.

So I started to recreate it in my own test environment, which is also a 10g database, Redhat linux environment, and the implementation is as follows:

1. Create tables, generate more data, create indexes

Create table test0611 as select * from dba_objects

Insert into test0611 as select * from test0611;-- executes multiple times

Create index T0611_IDX1 on test0611 (object_id)

two。 Construct statements that access data through an index

Because rowid is only used in indexes, we use hint here to let SQL access the index first, and then return the table through the index. This is a poor execution plan, and the SQL execution time will be relatively long.

3. During the above query, we perform the trauncate operation in another session

Truncate table test0611

4. Observe the query operation in step 2 to confirm whether an error has been reported

The above simulated case tests were performed several times, but failed to reproduce the errors or other errors of ORA-1410, and the query completed normally. The problem seems to have reached an impasse. Is the truncate statement really not the cause of the problem? Let's think about it on the other hand, if we want to overturn our analysis, the most direct way is to get more evidence to prove that our analysis is wrong. Then let's gather more information.

VI. Accidental discovery and solution of problems

If we fail to reproduce the problem quickly in our own environment, it seems that we have to find other clues in the production environment in order to gather enough information before restarting.

In the process of viewing, it was unexpectedly found that OGG (oracle goldengate) was deployed on this environment, and it synchronized the entire SCHEMA table to the target ORACLE instance as the source side, which seemed to bring a turnaround to the location problem: if it is an exception at the memory level, it will not theoretically be transmitted to the ORACLE instance on the target side of ogg. If it is the table itself or the index itself, then it should be able to reproduce on the ORACLE instance on the target side of ogg.

Next, we execute the relevant query on the ogg target database. After half an hour of testing, we find that the same statement will also report an error on the ogg target database, but unlike the production environment, the error message is ORA-8103.

[oracle@11g dmp] $oerr ora 8103

08103, 00000, "object no longer exists"

/ / * Cause: The object has been deleted by another user since the operation

/ / began, or a prior incomplete recovery restored the database to

/ / a point in time during the deletion of the object.

/ / * Action: Delete the object if this is the result of an incomplete

/ / recovery.

ORA-8103 is familiar with it, usually because the object is dropped by drop or truncate. At the same time, we found that the version of the target database is 11g, and immediately we thought the problem was much simpler. In 11g database, there is a parameter enable_ddl_logging to control whether DDL operations are recorded in the alert log, but the source 10g database does not have this function.

So, after we adjust this parameter in the target database instance, the alert log quickly captures the truncate information.

Tue Jun 01 20:11:18 2018

ALTER SYSTEM SET enable_ddl_logging=TRUE SCOPE=BOTH

Tue Jun 01 20:16:34 2018

Truncate table XXXstatelog

At this point, the truth seems to be clear. It is the application that adds the truncate operation to the code, and the frequent truncate operation leads to an error interrupt in the query of the related table. The next thing the customer DBA needs to do is to communicate with the developer the purpose of its truncate operation, and adjust it accordingly.

After removing the frequent truncate operations and changing them to regular truncate operations, there will be no ORA-1410 or ORA-8103 errors during the business period, and the matter will come to an end.

VII. Research and reproduction

The customer's problem is solved, but new doubts begin to linger in my mind, but now there is no customer pressure, no time urgency, you can calm down and study this problem slowly:

1. Why is it tested in my environment that truncate does not affect select operations while truncate operations in the customer environment affect select operations? where does the difference come from?

two。 What is the difference between ORA-1410 for 10g environment and ORA-8103 for 11g environment?

To answer the first question, let's first take a look at what the truncate operation does to the table / index:

1) for ordinary heap tables, the truncate operation only truncates the table and index, mainly modifying the data dictionary information and segment header information.

2) for the case where we simply access the table data through the index, after the truncate is executed during the query, the truncate modifies the segment header information, and the specific index block and data block we need to access remain unchanged, the error will not occur naturally.

3) when we overwrite the data blocks of truncate, or when our query needs to access the segment header information, the error of ORA-1410 or ORA-8103 will come out.

After understanding the above points, it is very simple for us to reproduce the error; let's build a few reproduction scenarios:

First, two tables are constructed, of which T06182 is larger and T06183 is smaller, and there is an index on the object_id column:

The statement used to construct the query:

Test scenario 1:

One session executes a simple query, another session performs a truncate operation during the query, and after the truncate operation is completed, the query can continue to complete without reporting an error

(note: the completion time of truncate is 11:13:59, and the query is not affected during the query process)

Test scenario 2:

One session executes a simple query, and another session performs a truncate operation during the query. After the truncate operation is completed, it continues to execute the flush buffer cache operation, resulting in an ORA-1410 error.

(note: the truncate completion time is 11: 22. The start time of the flush buffer_cache is 11:22:15. An error was reported during the same period.)

Test scenario 3:

One session executes a simple query, and another session performs a truncate operation during the query. After the truncate operation is completed, the insert operation continues, and an ORA-1410 error occurs.

(note: the completion time of truncate is 11: 00, the execution time of insert is 11:20:16, the query in the same period reports an error, the data is inserted after truncate, and the extent is reused, which leads to an error in accessing the data.)

For the second question, in fact, after fully understanding the errors of ORA-1410 and ORA-8103, we can know what they have in common. It seems that they only reported ORA-1410 in 10g and ORA-8103 in 11g.

As you can see, with the same case, there are ORA-8103 errors in the 11g database environment.

And similar articles can be found on MOS to illustrate the problem:

Small knot

From the point of view of the whole problem, it is actually a series of problems caused by truncate. From the initial stage of suspicion to the final confirmation of the problem, there are some detours in the middle, resulting in a complicated analysis of the problem. In hindsight, if we suspect truncate, we can directly check the last_ddl_time and DATA_OBJECT_ID of the object and locate it directly. Or if you are not so nervous when dealing with problems, if you read MOS 1410.1 carefully, you can quickly determine several directions and confirm them one by one. However, this is the difference between real-time problem processing under high pressure and ex post analysis. Fortunately, precisely because this problem still takes some time, it makes me further try to reproduce the problem and reflect on the whole process. And then there are more discoveries, summed up, are also some of the necessary skills of front-line engineers, write to yourself and share with you:

1. As for the understanding of the principle, it is very important to solve the problem, especially when you want to reproduce the fault, otherwise it is likely to fail to achieve the desired results and lead to wrong conclusions.

two。 For dealing with problems in real time, even if there is a lot of pressure and time is running out, you still need to calm down and read carefully for the details you don't know and the articles you haven't read, otherwise you may miss the knowledge point that allows you to achieve it overnight.

3. Try to recreate the problem you encounter, which will impress you more than solving the problem.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.