Technology Life Series the Story of me and the data Center (issue 11)-A failure caused by a start and stop 07/12 Update SLTechnology News&Howtos

Technology Life Series the Story of me and the data Center (issue 11)-A failure caused by a start and stop

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

The spring breeze gently blew away the cold in winter, and came to the most beautiful flower season of the year, accompanied by the warm sunshine Lao K to see you again! This return to Lao K will not only continue to share some small cases I have dealt with, but also optimize the technical exchange module. I hope that in the process of discussion, you can not only improve the technical level, but also appreciate the unique craftsman spirit of Zhongyi DBA team.

Here comes the problem.

In the early morning of one day, the customer called and found that the key system database node could not be connected when patrolling. The original database was 4-node RAC, but now one node was missing. They were worried that the 3-node database could not bear the pressure of business peak. This situation needs to be dealt with as soon as possible, so the old K chose to solve the problem with the fastest remote support.

Analysis of problems

Environment description

System: linux

Database: there are multiple database instances p1 and x1 on the server (where p1 and x1 are aliases)

Phenomenon description

When we log in to the p1 database locally, everything is normal and there is no obvious problem:

Analysis step 1:

Check the alert log and learn that the database has reported errors since 18:00 the day before.

From the log, the problem seems simple. When oracle is running, the group changes: the group at startup is 501 (oinstall), but the current group is 503 (asmadmin). At the same time, we can confirm this from the property changes of the relevant trace files in the directory where alert resides:

The trace file generated or updated at 18:43 is still the owner of oracle:oinstall, and by 18:44, the newly generated or updated trace file becomes the owner of oracle:asmadmin.

Knowledge points:

Q: in the UNIX/LINUX environment, there are many background and foreground processes after oracle database startup. Although it is common for related processes to generate some trace files, what really determines the properties of oracle-related processes?

A: generally speaking, the call of the oracle background process depends on the binary file $ORACLE_HOME/bin/oracle, but the relevant owner attribute of the server process (server process) assigned by it from the remote connection is inherited from the listener process, and the owner attribute of the listener process is also the owner attribute of the process from the user it started (oracle user and grid user) $ORACLE_HOME/bin/oracle.

Analysis step 2:

In this way, we can see the owner of the $ORACLE_HOME/bin/oracle file:

Old K in February 10 to see that the owner of the oracle file is oracle, the group is asmadmin, its last modification time is December 10, compared to a long time apart; however, we are concerned about the file group / owner changes, will not affect its display of the modification time, only modify the content or replace the file will appear changes in file modification time. Next we can use the command in the figure to see when the file properties change:

Obviously, the change of the file owner / group is on February 9, so basically all the corresponding, it can be reasonably inferred: the owner of the oracle file changed on February 9, and then the database generated an error, that is, in fact, this database instance was not available from the night before, and the problem was not found until the next morning, so the way to solve the problem is much easier, restart the database! Because you can't log in to the database normally (even sysdba can't log in), you have to kill the key processes of the database instance directly, and then start the database normally, and everything returns to normal again.

It is not a troublesome problem to analyze according to the above steps, and you may think that even if we do not know the detailed reasons and see that the database cannot be used, we will directly use the restart method, won't this problem be easily solved? Is it necessary to understand the nature of the problem?

Here, Lao K has something to say! ↓

Since we choose to take the road of technology, the most important thing is to maintain a lasting curiosity, and the process of finding answers is also beneficial to us. Continuous research means continuous improvement and endless exploration. is the only guide on the way forward. Of course, the most important thing is to ensure the safety of the production environment and comply with the premise of operation!

The problem first appeared.

OK, let's get back to the question. Although we have outlined some of the above, it is not difficult to find that there are two important questions that remain unanswered:

Why is there a problem with the x1 instance while the p1 instance is not affected?

The first thing the old K guessed was: "is it true that the startup of the p1 instance is after the group change of the oracle file?" So check the startup time when you found the p1 instance:

Is it really a coincidence that the startup time of the p1 database instance seems to be just before and after the dependency group change of the $ORACLE_HOME/bin/oracle file? This leads to the second question.

Who changed the grouping of oracle files and why?

As shown in the figure, the trace file of the p1 instance also changed on December 10, 2016, and remained the primary attribute of oracle:oinstall after the change, but returned to the primary attribute of oracle:asmadmin after the restart was completed on February 9 (not shown in the picture). That is to say, before that, we could infer that the generic group of the $ORACLE_HOME/bin/oracle file was asmadmin, which was changed to oinstall on December 10. It was changed to asmadmin again on February 9.

The memory Salon of Lao K ~

Things seem to be getting complicated again. Here reminds Lao K of an experience he encountered. CASE is like this: after we have patched the database, the group of $ORACLE_HOME/bin/oracle files often changes, resulting in the situation that local non-dba users cannot access ASM disks without listening to the database connection Times (asm disks are usually grid:asmadmin). We only need to restart the database by using srvctl start database to solve this problem (of course, we can also use chown directly when the database is stopped).

In this case, did you also use the srvctl command to restart the p1 instance the day before?

It is precisely because the p1 node was started using srvctl the day before, which changed the generic group of the oracle file, resulting in an error on the x1 node; as for why you want to change the subordinate group of the oracle file, that is the reason for the mechanism of oracle. What we need to understand here is that the difference between starting and stopping a database using the srvctl/crsctl command and logging in to the database using sqlplus is that the srvctl/crsctl command invokes oraagent in crs to execute, while sqlplus is executed directly under the oracle user.

Summary

1. Database instances p1 and x1 are running normally

2. The group of the $ORACLE_HOME/bin/oracle file is oinstall.

3. On February 9, the operation and maintenance staff restarted the p1 instance with the method of srvctl startinstance-d p-I p1, and changed the subordinate group of $ORACLE_HOME/bin/oracle to asmadmin in the process.

4. Database instance x1 starts to report an error. The run-time belonging group is inconsistent with the startup-time belonging group.

The problem is back!

At this point, the first two problems have been solved, but in the process of communicating with our customers, we can't help but ask, why restart the p1 instance? The answer caused us to think again: the customer found that the p1 instance could not be logged in the day before. Because the p1 database is actually no longer used for production, there may be some simple tests to use this database. So the customer's on-site maintenance personnel directly restarted the p1 instance to solve the problem. But why can't the p1 instance log in? If it's just a test library, shouldn't there be a stressful situation, or is it something like x1? In this regard, we will analyze the operation of the p1 instance before. In addition, from the previous process, we can see that the oracle file group has changed from asmadmin to oinstall, so we also need to explain a problem. If the srvctl way of starting and stopping the database will change the generic group of the $ORACLE_HOME/bin/oracle file to asmadmin, then what changed the generic group of this file to oinstall? With these two new doubts, Lao K embarks on the journey to solve them again. Let's first take a look at the alert log of the p1 instance:

When I saw the log Lao K, I found that the original p1 instance started at 09:18:09 on December 10, started at 09:23:33, and lasted until February 9, before it was restarted, and finally solved by restarting. Why does P1 have such a problem? So why was there no problem with the X1 instance before? Let's take a look at the alert log before x1 instance:

X1 instance also goes through the process of starting error report when starting, but unfortunately, x1 instance has already made an error when starting the background process RSMN, causing the instance to start failed, and then started again at 09:23, then it starts successfully. Subsequently, because the $ORACLE_HOME/bin/oracle file has not been modified to belong to the group, there is no error report until 18:00 on February 9. Careful students can see that the shutdown of x1 instance and p1 instance are basically the same as the first startup on December 10, so we speculate that this action should use crsctl stop crs and crsctl start crs to start and stop the database in the process of starting and stopping CRS, so why did the start of p1 instance complete, but x1 instance failed? In fact, p1 just starts the instance and does not open the database, while the x1 instance fails because the startup time of its key process RSMN is just after the time of the group modification of $ORACLE_HOME/bin/oracle (09:19:22), which leads to the startup failure. When starting again, the subordinate groups of the processes related to the x1 instance are consistent and there is no longer the problem of inconsistency between startup and runtime.

Now, we have one last challenge left, and that is, on December 10, what did the maintenance staff at the customer site do to cause the group of the $ORACLE_HOME/bin/oracle file to become the incorrect oinstall group? Here, we look at the patching information of the database on this node and find that there has been no patch since the installation (the log is too long, so it is no longer listed here). Fortunately, we found clues through the history command record:

Previously, there was a relink operation on the oracle file, and the log was recorded in $ORACLE_HOME/install/relink.log. When we look at the information of relink.log, we can locate the operation time of this command:

Based on the daily experience of Lao K, it can be determined that the relink operation will change the group information of the oracle file. After opatch makes some database patches, it usually changes the dependency group information of the oracle file. If we take a closer look at the detailed logs in the opatch process, we will find that the underlying command actually used in this process is relink, that is, we think that relink will actually modify the contents of the oracle file, as well as modify the dependency group of the oracle file (because the user doing this operation is oracle:oinstall). After coming to the conclusion, continue to communicate with the customer, the dialogue is as follows: ↓

The on-site operator did use the relink command to reproduce the compiled oracle file when dealing with other issues before, but the database was not started at that time.

Did you restart the operating system at that time? Please enter

Uh. I think so!

What on earth happened?

1. On December 10, the customer's on-site operator needs to use the relink command to solve the problem.

two。 Restart the operating system directly before relink, automatically start CRS when the operating system starts, and will attempt to start the database to restore to the previous state

3. During the process of starting the database, the maintainer starts the relink operation without paying attention to whether the database is running.

4. In the process of relink operation, p1 completes instance startup, while x1 instance fails because the subordinate group of oracle file has changed when RSMN starts, and RSMN startup fails, which leads to x1 instance failure. Although p1 instance starts successfully, it always reports an error, and the database cannot be opened.

5. During this period, the p1 instance is actually unavailable. On February 9, the maintainer found that the p1 instance was not available because it thought that the p1 library was not important and restarted the p1 instance directly without checking the reason. The p1 instance is normal.

6. Because the command of srvctl start is used to start the p1 instance, the subordinate group of $ORACLE_HOME/bin/oracle is modified at startup, and the x1 instance is unavailable.

7. Finally, it returns to normal by restarting the x1 instance.

Finally, it is concluded that:

Through detailed analysis, we find that the processing method of p1 instance can be solved quickly, but the root of the problem is caused by a series of operational mistakes. So Lao K wants to once again put his own point of view: in the process of solving the problem, we can not easily filter any small doubts, the problem now may be intensified by a number of small problems and finally affect the normal business. As long as we make a thorough analysis of the whole, we can naturally solve the problem essentially. Therefore, we must always maintain the spirit of exploration, do not ignore any details, this is also the spirit that we, as Chinese people, have been adhering to!

All right, that's all for this issue. This year, our DBA team will go all out to share our experience in a witty and rigorous way. I hope you will support us. Our partners' retweeting and sharing is the greatest encouragement to us!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.