How to analyze the problem of frequent restart of rac nodes 04/24 Update SLTechnology News&Howtos

How to analyze the problem of frequent restart of rac nodes

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to analyze the problem of frequent restart of rac nodes, I believe that many inexperienced people are at a loss about this. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Environment: two physical machines of Lenovo R680 build a set of 2-node RAC, and the database version is ORACLE 11.2.0.4.

First, the phenomenon of failure problems:

Node 2 restarts frequently, which occurs many times from January to February, or even 3 times a day, which makes people have a headache.

Second, the process of analyzing and dealing with the problem:

1. Time synchronization problem

First of all, it is suspected that time is out of sync.

It is observed that the ntp time synchronization offset of the server is too large.

And an abnormal return value appears in the CTSS log of the database

A problem is found here, that is, the time source points to the old time source server, but the server is in the new data center, so it is modified to be the time source server of the new data center and the BIOS clock is modified to make the system clock consistent with the hardware clock. At this point, the problem of time synchronization is eliminated.

2. The problem of database log response.

By checking the ALERT log, it is found that some nodes are expelled.

Check the CSSD log again and find out

Displays a heartbeat with a disk but without a network.

At this point, it is judged that the node 2 node is always restarted frequently, and the probability of problems in the private network will be greater, so check from the network. After each restart, node 2 can join the rac cluster smoothly, let alone the problem of time synchronization.

Add:

If the node in the cluster continuously loses the disk heartbeat or network heartbeat, the node will be expelled from the cluster, that is, the node will restart. The node restart caused by group management is called node kill escalation (applicable only in 11gR1 and above). The restart needs to be completed within a specified time (reboot time, usually 3 seconds).

Network heartbeat: the ocssd.bin process sends network heartbeat information to each node in the cluster through the private network every second to confirm whether each node is normal. If a node continuously loses the network heartbeat and reaches the threshold, the misscount (default is 30 seconds, or 600 seconds if other cluster management software exists), the cluster will vote through the voting disk, so that the node that has lost the network heartbeat is expelled from the cluster by the master node, that is, the node is restarted. If the cluster contains only 2 nodes, there will be a brain fissure, and the result is that the nodes with small nodes survive, even if the nodes with small nodes have network problems.

Disk heartbeat: the ocssd.bin process registers the status information of this node with all voting disks (Voting File) every second, a process called disk heartbeat. If a node continuously loses its disk heartbeat to the threshold disk timeou (usually 200s), the node will automatically restart to ensure the consistency of the cluster. In addition, CRS only requires that [Namp 2] + 1 voting plates are available, where N is the number of voting plates, which is generally odd.

3. Check the problems of the network

This set of RAC heartbeat network is composed of ETH13 and ETH15 network cards, corresponding to the two ports of the two switches.

The activation of the two ports of the switch and the network card port did not solve the problem, and finally adopted the solution of changing the cable and pulling the cable separately. It was found that the light decay of the line was a little large, but the restart problem was not finally solved.

4. Is it a hardware problem?

At this point, the problem is in a dilemma, to change the way of thinking, since the network and database may not be a problem, then the hardware can really be isolated and detached?

The answer is no, that is the problem of hardware.

When the node is restarted, there are interruptions in the log of the database, so is it possible that it is the problem of CPU and memory? Just check the MCELOG log.

Logs that can not be ignored in MCELOG

Mcelog is a tool for checking hardware errors, especially memory and CPU errors, on x86 Linux systems. Its log is MCELOG.

Generally speaking, servers with large memory are prone to memory problems. Now memory controllers are all integrated in cpu. Memory verification errors and CPU problems can easily lead to server restart.

At this point, the problem surfaced. Contact with the hardware manufacturer, brush the motherboard firmware program, replace a memory after the problem is finally solved.

Third, summary and thinking of the problems:

1. The role of monitoring should not be ignored. This time, the problem of memory hardware has not been found in the server hardware monitoring platform. This needs to contact the manufacturer to continue to improve the fine granularity and sensitivity of server hardware monitoring.

2, from the log, network, database, system, hardware and other aspects of a comprehensive investigation, the problem will eventually be found.

3. The solution to the problem depends on patience and carefulness. If you go further, the problem will eventually be solved.

After reading the above, have you mastered the method of analyzing the problem of frequent restart of rac nodes? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.