Oracle RAC High availability failure of risk Alert 05/23 Update SLTechnology News&Howtos

Oracle RAC High availability failure of risk Alert

2025-05-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Preface

Unwittingly, the story of the technical life, me and the data center came to the second issue, and some friends began to care about who little y is, which is not important, we are more concerned about the technical level of sharing and the actual risk tips to customers. In the future, we will continue to share the story of Xiaoyi of the operating system and Xiao W of the middleware. In fact, the name "Xiao y" has no special meaning, so let's use it to represent us operators who dedicate their unrepentant youth to the data center.

Sharing topics in this issue

What Xiao y wants to share with you today is the following serious topic:

Is your Oracle RAC really highly available? Or is pseudo-high available?

In other words:

When the partition / server of a node in an Oracle RAC cluster goes down

Can you say with a pat on the chest to the leader

"it's all right, this is ORACLE RAC, and there's another node! as long as the node can withstand the load, it can provide services normally!"

If you read the above paragraph again, do you feel hesitant?

Xiao y asks this topic in a different way:

Although the RAC high availability test has been done before the system is launched, when a node in the cluster is running for a long time, the load, including CPU, memory and the number of processes, is constantly changing, and after some series changes, the high availability test is not done during the period. In this case, if the partition / server of a node in RAC goes down, can you still clap your chest and say firmly? "other nodes of my Oracle RAC must be able to provide services normally!" ?

The same question, when more words lay the groundwork, when you hear this question again, is the answer more hesitant?

Today, Xiao y will present to you a real case of "RAC high availability loss" and its complete and real analysis process.

What can you get from the case?

Learn about some specific factors that lead to Oracle RAC high availability failures.

Xiao y estimates that many friends still have similar problems in their systems.

It is suggested to refer to this case for careful inspection to eliminate hidden dangers.

The case is exciting and interesting to see in advance.

This case will be a lot of difficulty, customers in order to analyze the root cause of the problem, invested a lot of manpower and time, once without results. After Xiao y took over the case, the analysis of the problem was deadlocked for a time in the absence of information. However, by combing through all the clues over and over again, Xiao y finally found a breakthrough in a small detail that had nothing to do with the database and successfully located the cause of the problem. You can learn from this method.

Part 1

Fault description

Symptom: loss of Oracle RAC high availability. The performance is as follows

1) at about 16:01 in the afternoon, the hardware failure occurred in P595 where the XX system database RAC cluster node 2 is located, resulting in the unavailability of the partition where the node 2 database is located.

2) but from 16:01, the application cannot connect to the surviving node 1 in the database RAC cluster.

The ORACLE database RAC cluster fails to play the role of a highly available architecture!

The customer instructs that the root cause of the problem must be found in order to improve the highly available architecture of the system.

Xiao y understands that such a big event is a huge risk for a data center running hundreds of RAC. Do other systems still have the same problem? When will it happen again? If the root cause of the problem is not found, how can it be comprehensively combed, checked and prevented by the point and area?

Environment description:

AIX 5.3

Oracle 10.2 2 Node RAC

HACMP+ bare equipment

Therefore, when Xiao y received the case, there was still a lot of pressure. Before starting the analysis, Xiao y got the following information:

1) Operation and maintenance DBA hangs at RAC survival node 1 by connecting to the database via sqlplus "/ as sysdba" in case of failure

2) in the event of a failure, the maintenance DBA hangs by connecting to the database through sqlplus-prelim "/ as sysdba" in RAC survival node 1, and adding the-prelim parameter to connect to the database is also hang, which is a very rare case.

3) the crs cannot be stopped by crsctl stop crs-f on the surviving node 1, and the command suspension cannot be terminated

4) restart the operating system on the surviving node 1 through shutdown-Fr, the command suspension cannot be ended, and finally restart the partition through hmc, and the business returns to normal.

Part 2

Analysis process

2.1 ideas for fault analysis

One node in the cluster is down, and the other nodes are unable to provide services.

Usually, it is because the cluster software does not complete the reorganization of the entire cluster status and data.

Therefore, the data are not consistent, so other nodes in the cluster are unable to provide services.

This environment is deployed on the IBM minicomputer, in which the cluster software is used:

ORACLE RAC/ORACLE CRS/IBM HACMP

Therefore, you need to check whether the three clusters have been reorganized and reconfigured separately.

2.2 confirm whether ORACLE RAC has completed the reorganization

Looking at one node of the database alert log, you can see that the RAC cluster was reorganized at 16:01:32.

The problem can be eliminated.

2.3 confirm whether ORACLE CRS has completed the reorganization

You can see that the network heartbeat timed out from 16:01, began to weed out node 2, and finally left the cluster at 16:01:27 node 2. The problem can be eliminated.

2.4 confirm whether IBM hacmp has completed the reorganization

According to the analysis of AIX experts, no abnormality was found in HACMP.

After the whole inspection, no abnormality was found.

Since the Lpar where the Node 2 database is located is down, the vip of Node 2 should drift to Node 1, and then through the netstat-in command check, it is found that there is no Node 2 floating VIP on Node 1!

The act of taking over the VIP will eventually be done by the CRSD process, so check the CRSD.log to see if anything unusual happened during the takeover.

2.5 check the 1-node crs log to confirm the takeover of node 2 vip

2.6 Node 1 crs Log Summary

You can see:

1) when Node 2 fails, the CRS of Node 1 will take over the vip, db and other resources of Node 2, and start it on Node 1, but there is a timeout timeout when calling the script racgwarp to do check/start/stop, thus terminating the child process.

2) and node 1 also experienced a timeout in its own vip detection, as shown below

/ oracle/app/oracle/product/10.2.0/crs/bin/racgwrap (check) timed out for ora.node1.vip! (timeout=60)

3) therefore, CRS also has some exceptions in resource management, mainly due to the timeout exception when calling the script racgwrap.

Timeout appears in Racgwrap scripts, usually because:

The performance of the operating system is slow, such as a large number of memory pages.

Command hangs during script execution

2.7 check nmon data for operating system performance

You can see that the nmon actually stopped at 16:01 at the fault time on May 20, which indicates that there may be an exception such as a hang when NMON executes a command.

In addition, from the perspective of the monitoring software, there is no alarm of memory and CPU in the operating system.

2.8 determine the direction of analysis

So, next, does our analysis focus on the database or the operating system?

Through the above clues, we have reason to believe:

There was something wrong with the operating system at that time! Therefore, the focus of the follow-up analysis will be placed on the operating system level!

2.9 summarize and sort out all clues

1) Survival node Sqlplus-prelim cannot attach to shared memory

2) Kill-9 cannot kill some processes

A process in an atomic call, such as an IO, can only accept the signal of process termination between two calls, indicating that an atomic call cannot be terminated and cannot be terminated by kill-9.

3) CRS cannot take over the failed node vip through script, resulting in a timeout

4) CRS cannot detect resources such as vip/ snooping of surviving nodes through scripts, and all of them have timed out.

5) No output after the Nmon failure point of the surviving node

2.10 problem analysis once reached an impasse

Although the direction is on the operating system, the AIX expert did not detect an exception.

AIX experts concluded that the operating system was not abnormal at that time!

Because they checked some crontab scripts, there is output, indicating that OS is still working.

As shown in the following figure

2.11 how to find a breakthrough

Xiao y set the direction to the operating system, but the operating system expert checked and denied that there was something wrong with the operating system.

The two sides hold different views on the question of why the nmon of the survival node stops writing.

At this point, the problem analysis reached a stalemate, small y began to think, how to continue to analyze? How can we prove that the operating system is abnormal?

1) if the operating system is not given a clear point, then it is difficult for the operating system to find out what the abnormal problems are.

2) the analysis of the problem is deadlocked, how to find a breakthrough becomes the key.

3) firmly believe that the abnormal operating system is in the right direction.

Go back to the origin, re-comb and verify each clue, will you miss any important clues?

Reproduce the clues before the examination, there is an important discovery!

You can see:

From 16:02 and subsequent samples, we can see that the The file system result is as follows detection stops and no SYSTEM.SH_RUN_COMPLETE keyword is written to indicate that the script execution is complete. This shows that the operating system also encountered an exception when executing non-database commands!

2.12 confirm the specific commands that are called when an exception occurs in the shell script

Check the shell script and find that the script is hung at "The file system result is as follows:" and is actually calling the df command to view the file system.

So under what circumstances will HANG occur in this operation?

The answer is when using the nfs file system.

The NFS file system is used in the XX system database cluster, and the / arch3 file system of node 2 is mounted to the / arch3 file system of node 1 through NFS. When there is a hardware failure in Node 2, Node 1 cannot communicate with Node 2's nfs server, which in turn causes the df command on Node 1 to hang while viewing the file system.

From this point of view, the reason why node 1 nmon data stops is that the df command hang resides.

But what does this have to do with node 1 being unable to connect to the database?

When little y saw the df command, he burst into tears!

All phenomena can be explained! When all the phenomena are explained, I feel at ease! This means that you have found the root cause of the problem, so the precautions are foolproof!

2.13 relationship between loss of Nfs mount points and inability to connect to the database

When performing a database connection operation, you need to get the current working directory (pwd)

However, due to the defects of implementing pwd process in some versions of AIX operating system, it is necessary to check the permissions and types of directories or files recursively to the root directory.

When nfs is checked, because nfs is mounted in hard/background mode, when nfs server is not available, it will inevitably lead to a hang when checking the nfs directory, resulting in unable to get the output of pwd, and then unable to connect to the database!

2.14 solve all the mysteries

1) Survivor node Sqlplus-prelim cannot attach to shared memory

When getting the current working directory, get_cwd (pwd) hangs because the nfs mount point is missing

2) Kill-9 cannot kill some processes

The process IO the nfs mount point and hangs, so there is no chance to receive the process termination signal.

3) CRS cannot take over the failed node vip through script, resulting in a timeout

The pwd command is called in the Racgwrap script

4) CRS cannot detect resources such as vip/ snooping of surviving nodes through scripts, and all of them have timed out.

The pwd command is called in the Racgwrap script

5) No output after the Nmon failure point of the surviving node

Loss of Nfs mount points causes nmon to hang when calling the df command

2.15 Analysis of a further step

1. Under this mechanism, if there are many root directories / small files or directories, the performance of pwd (get_cwd) will be very poor.

two。 Test environment reproduction process

Node 2 mount / testfs to node 1 / testfs, stop node 2 nfs service, failed to reproduce. The reason is that the output of pwd is / oracle, and the first letter o is smaller than t, so the pwd result is obtained without detecting / testfs and quit.

Node 2 mount / aa to node 1 / aa, stop node 2 nfs service, failed to reproduce. The reason is that through the comparison of the truss command, it is found that the behavior in the search / root directory of the production and test environment is inconsistent, and then check the OS version and find that the test environment has a higher OS version.

3. The higher version of OS cannot be reproduced, but the lower version of the operating system can be reproduced, indicating that the operating system has been modified and enhanced. After nfs hang search on ibm.com, you can find that IBM has released APAR to fix the problem.

Part 3

Summary of reasons and suggestions

3.1 reason summary

1. The hardware failure of P595 where RAC cluster node 2 is located has caused node 2 LPAR to be unavailable.

This in turn leads to the loss of Nfs mount points.

2. When logging in to the database, you need to obtain the current working directory (pwd)

3. However, due to the defects of implementing the pwd process in some versions of the AIX operating system, it is necessary to recursively check the permissions and types of directories or files under the root directory.

4. When the nfs mount point directory / arch3 is checked, because nfs is mounted in hard/background mode, when nfs server is not available, it will inevitably lead to hanging when checking the nfs directory, resulting in unable to obtain the output result of pwd, and then unable to connect to the database!

The loss of NFS server mounted with hard/background is the root cause of RAC becoming a pseudo-cluster!

All failure phenomena can be explained as follows:

1. The surviving node Sqlplus-prelim cannot attach to shared memory.

When getting the current working directory, get_cwd (pwd) hangs because the nfs mount point is missing

2. Kill-9 cannot kill some processes.

The process IO the nfs mount point and hangs, so there is no chance to receive the process termination signal.

3. CRS cannot take over the failed node vip through script, resulting in a timeout

The pwd command is called in the Racgwrap script

4. CRS cannot detect the resources such as vip/ snooping of the surviving node through the script, and all times out.

The pwd command is called in the Racgwrap script

5. No output after the Nmon failure point of the surviving node

Loss of Nfs mount points causes nmon to hang when calling the df command

3.2 problem solutions and recommendations

1) if you need a practical nfs, mount to a secondary directory, such as mount to / home / arch 2 instead of / arch3

2) use gpfs instead of nfs

3) when providing fault clues, do not filter out information that is not important according to your personal experience.

4) install AIX APAR and change the internal implementation of operating system get_cwd (pwd) calls

5) when dealing with a fault, the Df command hang is reported to Xiao y at the beginning, then this

If case is untied directly, there is no need to check it!

6) it is not enough to do high availability test before launch, because the system will undergo a series of changes after launch, and some factors may lose the redundancy of newspaper RAC. It is recommended to do high availability test on a regular basis. You can choose to restart instances and servers one by one in the change window for verification.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.