A small change that leads to data loss 04/27 Update SLTechnology News&Howtos

A small change that leads to data loss

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Preface

Unwittingly, the story of the technology life series of me and the data center came to the tenth issue, and Xiao y met with you again!

Earlier, we shared a lot of practical cases of Oracle database failure and optimization, and a friend asked if Xiao y could share some practical cases of data recovery without backup.

The answer is, of course, yes. Little y has never been a person to hide. ^ _ ^

Over the years, Xiao y's Oracle service team has basically encountered all the problems that should and should not be encountered.

Therefore, there are still many cases of data recovery without backup, sometimes even three or four such CASE a week, with different types of problems, such as:

The file system of a telecom operator is full, and the maintenance staff cleaned up the online log files so that the database could not be started.

> > A telecom IDC computer room lost power, and the Oracle database was damaged and unable to start.

> > A fund client mistakenly deleted drop user xx cascade from the database user. .

……

Xiao y feels from the bottom of his heart that it is really difficult to share a "data recovery case without backup". After all, such a failure is a dishonorable thing for the customer. If the sensitive information is not handled cleanly, he is afraid that the customer will sit in the right seat and make trouble for himself, so he has no idea of sharing similar cases in the first place.

But on second thought, if we can extract the common risks, we can not only give you a reminder from the technical level, but also give suggestions from the perspective of how to improve the database operation and maintenance system. Then this kind of case sharing becomes meaningful!

Here to add, some friends may ask curiously, "like receiving this kind of CASE, customers are desperate. You can open your mouth and offer a high price. You must make a lot of money."

In fact, in many cases, according to the style and philosophy of Zhongyi Technology, we serve corporate customers and do word-of-mouth. From the perspective of long-term cooperation with customers (or potential customers), most of this kind of CASE is done for customers free of charge (I hope you are not disappointed). The charge also depends on the degree of damage and manpower input. What we quote is definitely a kind-hearted price (too low to be embarrassed to say). After all, customers are already very sad. If we take advantage of the opportunity to kill a brushstroke, then it is an one-shot sale, and there will be no long-term cooperation in the follow-up, which is not in line with our service concept.

Sharing topics in this issue:

Share a case of data loss caused by Oracle changes, and then inspire you to think about such a question.

Is your Oracle database operation and maintenance system really perfect?

Xiao y will present such a real case for you today. At the end of the sharing, in addition to the tips on technical risks, we will also give Zhongyi Science and Technology's views on how to build a scientific operation and maintenance system, hoping to be of some help to you.

The significance of case sharing:

Little y found a problem, that is, no matter how much others do risk warning, many customers will still make the same mistake!

Even if he knows the problem that others have already encountered!

Why did he know the problem and the risk, but he still made the same mistake? Because he has no personal pain! If you are only looking at other people's jokes and do not take action to make rectification from the perspective of the operation and maintenance system, then similar problems are likely to arise in the follow-up. Xiao y hopes that readers can understand the essence and good intentions of each sharing, so that they can carry out rectification and prevention from the point of view of the operation and maintenance system, so that there is no waste of Xiao y's pains.

Think about one question first:

Does your system still have a script that deals with logic like the following?

In order to avoid that the archive log is too late to back up to tape, which leads to database hang, many customers often have such a script in their systems. When the utilization rate of the archive file system reaches 60%, start the script to back up the log to the tape library. When the utilization rate of the archive log exceeds 90%, delete the archive log and issue an alarm message indicating that the archive log is deleted. It needs to be done as soon as possible!

There seems to be nothing wrong with doing this. Is there a problem?

Is there a problem with this? After reading the specific cases shared by Xiao y, you will understand:)

Part 1

Here comes the problem.

Tragedy appears.

A potential customer found that there was an error accessing the data on file 256 and that file 256 could not be accessed.

Further check because the file is offline, need to do recover.

And the file can no longer be online, because the lack of archive log, can not do recover.

So he asked Xiao y for help. Little y thought, there are nothing more than two situations

1) is the archive log backed up to the tape

2) the archive log has been deleted

If it's the first case, it's easy. You just need to recover from the tape!

If it is the second case, it will be too bad. We may lose the data.

It doesn't matter. We don't get into trouble, and we're not afraid of it when it comes.

Let's first take a look at the operation of the customer's online data file:

1.1 File online

The online operation of the 256th file, obviously oracle will prompt the file to do media recovery, that is, media recovery. Because when the file is in offline (for whatever reason), the corresponding dirty block of the file will not be brushed to disk.

1.2 Recover data File

So the customer did the operation of recover datafile 256 and entered AUTO, but the database prompted that the log file with serial number 14389 could not be found.

1.3 View error messages

Check on the operating system, and the log file does not exist

1.4 where are the archived logs?

Was it deleted on the file system after it was backed up to the tape?

Check the backup of rman and find that the archive log required by Node 1 does not have any backup record at all!

This is sad! The archive log required by file No. 256 online has been deleted! The data may be lost!

Part 2

How did it happen during the accident?

How could a small change lead to such a situation?

It is understood that this is a 10g RAC environment on IBM AIX, and the data files are bare devices.

The customer recently made a "small" change to the tablespace plus data files for RAC!

So the file can't be found by offline and archive log. Are these two problems directly related to this change? Add a data file to the tablespace, will such a change also result in data loss?

You may find it incredible, but Xiao y has basically guessed the process. Similar tragedies are always going on in different places.

At this point, readers are advised to stop for a moment and think about the relationship between change and these two issues! And think about it, if it were you, how would you help the client continue to deal with it?

Part 3

The plot reappears.

Why are the files lost in the offline& archive log?

In fact, it is very simple, let's take a direct look at the change process and the whole process of problems:

3.1 change "success"

At around 11:50 on January 4th, the customer initiated the change. Two data files were added to a tablespace in the second node of RAC, and the addition was successful. The Alert log shows Completed. Change "success"

3.2 is it really successful?

But did the change really work? Is it a quick change?

At 15:07, when node 1 is doing checkpoint, it needs to update the SCN number of each data file header, but due to the incorrect operating system permissions of the newly added bare devices, an IO error occurs. Obviously, this is a typical RAC problem that forgets to modify the permissions of a node. There are so many ORA- errors, if you find out and deal with them at this time, it's still too late! It's just... No, but...

3.3.The data file forces offline

At 15:07, node 1 could not write the SCN of the header because of the permission problem of the bare device, so the two new data files were forced to offline. There are so many ORA- errors, if you find out and deal with them at this time, it's still too late! It's just... No, but...

3.4 problems are found

After N hours, when Node 1 accessed the data in the two files and began to report errors, the customer began to realize the seriousness of the problem! As you can see from the view v$recover_file, the two files with a file_id of 256and 257 are in the offline state.

After finding that the permission of the bare device forgot to be modified, the customer modified the permission of the bare device of node 1 and prompted the need to do recover when executing the alter database datafile'/ dev/xxx' online data file.

Check to find that the archive log of Node 1 file during offline has been deleted in the file system, rman has not had time to back up, can no longer be restored!

So what caused the archive log to be deleted?

Do you still remember the following paragraph in the preface at the beginning of the article?

Does your system still have a script that deals with logic like the following?

There seems to be nothing wrong with doing this. Is there a problem?

Is there a problem with this?

Yes, there is such a script in the customer's system!

Due to the abnormal backup to the tape, the utilization of the archive log file system reached the threshold, which triggered the script to delete the archive log! Coupled with the "coincidence" of forgetting to modify the permissions of a node's bare device during the change, it led to a tragedy!

At this point, do you still think it is safe to delete archive logs in order to avoid database hang and then initiate a comprehensive approach afterwards? The answer is obviously no! Little y believes that more than 90% of DBA will not check whether there are files in v$recover_file that need to be recovered when deleting archive logs!

Part 4

Can it be saved?

How to solve it?

In this case, is there a way to online the data file? (of course, you can also use extraction software to extract data directly.)

There is naturally a way for Xiao y to ask, and the method is very simple (less than 5 steps).

Use bbed to change the SCN of the header of the offline file to be consistent with other data files SCN, and do it for only a few minutes, so you can try it yourself. It should be noted that this is just a way to deceive the database consistency check, the loss of log files, data loss is inevitable!

When using bbed to modify the data file header SCN, the only thing to be careful is to pay attention to the byte order of different platforms. Linux platform is small byte order, high and low bits are opposite.

Here, Xiao y takes the case where file 19 of his environment has been offline and the archive log needed by online has been deleted as an example to illustrate the processing process.

4.1Checking SCN

Check v$datafile_header. The status of document 19 is that offline,SCN is different from other files.

In the case of losing the log, if we want to online the file, we can only deceive the database. We just need to change the SCN on the file header of the No. 19 data file to the same as the other files such as the 17ram No. 18 file.

4.2 identify SCN

SCN number exists in the kcvfhckp.kcvcpscn structure of each file header (block number is 1), blue represents the input command, as shown below, the red part, that is, the next 4 bytes of offset 484 represents SCNBASE, expressed in hexadecimal, we convert it into decimal with a calculator, the number we get is the SCN of v$datafile_header above.

4.3 pay attention to the order in which bytes are stored in high and low order

The SCN number shown in the following figure using the dump command is a883d301 (see underscore) and in the image above.

It is exactly the opposite according to the high and low bits of bytes.

4.4 modify SCN

Use the modify command to change the SCN on the file header of document 19 to the same as that of other files such as 17 SCN 18, and recalculate the check value. finally, verify confirms that the SCN is not detected by SCN.

Check v$datafile_header again and you can see that document 19 has been changed by us to be the same as other files SCN.

4.5 data file online

Online up after recover datafile, and repair is complete.

Part 5

That's the point.

Summary of causes of failure:

In this case, a change in the data file was added to the Oracle RAC tablespace, and the data file was offline because the permission to modify it was forgotten at a node. Later, the archived log was automatically deleted by the script due to file system utilization, resulting in a tragedy of data loss. With bbed, you can online files without log files, but data loss is inevitable!

Zhongyi Science and Technology's suggestion on the construction of database scientific operation and maintenance system:

I believe there is a consensus, that is, "change is the hardest hit area that leads to failure."

There is no trivial matter in operation and maintenance, no matter how big or small the change is.

Small changes are often not fully prepared for detailed change steps because of skill and contempt. Doing things based on experience, coupled with staying up late, tired, relaxed and other reasons, it is easy to omit some small details and lead to catastrophe.

That's true.

Changes are operated by people (it is impossible to use automated operation and maintenance means to achieve all changes). People are bound to make mistakes, even if they are double-checked, they cannot be completely avoided, and customers who really achieve double-person review of changes for a long time are definitely in the minority.

Then, it is particularly important to build a scientific operation and maintenance system.

The probability of problems can be reduced under the scientific system.

Taking the change link in operation and maintenance as an example, in terms of methodology, Xiao y suggests:

1. Sort out all the changes

2. Sort out the risk points of all changes

3. Work out the corresponding feasible solution for each risk point.

4. In principle, the solution needs to be independent of the field implementation staff.

With regard to the case shared today, Xiao y believes that there are many areas worthy of improvement:

1. For the RAC environment with bare devices, there is a lack of monitoring of the permissions of each node data file on the OS.

If there is such a monitoring point, you can soon find that Node 1 has forgotten the permission to modify, then it will not be offline, and there will be no failure caused by data loss.

2. Lack of monitoring of v$recover_file

If there is such a monitoring point, you can quickly find that the file is offline, and it can be solved by online in time. In addition, can the action of Online be automated?

3. Lack of mechanism for monitoring or timely handling ORA- errors in alert logs.

Is the level setting of the monitoring point accurate? If it is also an ORA- error, the warning can easily be ignored, while a text message will be sent if it is serious. For example, some of my colleagues in the data center need to be on duty every day to confirm, analyze and deal with the monitoring alarms one by one to ensure that they are not omitted, so as to ensure the continuity of the business.

4. Lack of monitoring and / or timely processing mechanism for backup

If you find that the backup is not successful, such as queuing due to too many backup jobs, you can do so in the form of off-peak, adding belt, and so on, so that the archived log will not exceed the value of the valve.

5. In any case, there should be no script to delete archive logs in the system

What if I don't delete it? Can the database hang? Do you accept database hang or data loss? The answer is obvious. There is not enough space for filing, so we need to start with spatial planning. If not, we should reserve space for seven days. Data security is more important than cheap storage space.

Operation and maintenance is a science, you can not encounter all the problems, so you need a scientific operation and maintenance system to reduce the probability of problems! You are also welcome to discuss with Xiao y how to build a scientific operation and maintenance system.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.