It is suspected that the crs cannot be started properly due to the multicast problem of the private network Nic. 02/14 Update SLTechnology News&Howtos

It is suspected that the crs cannot be started properly due to the multicast problem of the private network Nic.

2026-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

The problem that a rac can only start one node crs is suspected to be caused by a multicast problem.

A few days ago, the PSU upgrade was tested in the history library. When the second node GI was upgraded after completing a node software upgrade, CRS could shut down normally and successfully, and then reported Error: The opatch Applicable check failed, so I tried to restart CRS, but it was obvious that CRS could not start properly.

Through checking the log, I found CRS-5818:Aborted command 'start' for resource' ora.cssd'. The CSSD resource cannot be started successfully, and a problem with CSS can be confirmed from the current process.

As can be seen from the CSSD log at that time, CSSD failed to create a local communication interface with the remote node during startup. The specific log analysis is as follows:

Get the private network information of the cluster from gpnp profile.

two。 The following starts to prepare to communicate with the remote node, and created local interface for node 'nghis-db2', but fails to bind endpoint (localAddr' mcast://224.0.0.251:42424/192.169.1.40'), which is a mcast address.

When I saw No buffer space available (74), I thought it was suspected that udp_sendspace and udp_recvspace were not big enough, and the query found that they were 65536 and 655360 respectively, which was enough for practical application. Unsurprisingly, restarting CRS after enlarging these two parameters is still not resolved, and most of the errors on MOS point to BUG,11gR2 Grid Infrastructure Node May not Join the Cluster After Evicted With Error sgipcnUdpSend "No buffer space available (74)" (document ID 1352887.1).

However, the current phenomenon is not consistent with the description of the document.

The current operation is sgipcnMctBind

SgipcnUdpSend is in the document.

3. Update the interface status, still unable to create the local interface, that is, unable to communicate with the remote node, so execute disable interface and clean disabled insterface

4. Restart add interface, but still fail.

5. After that, has a disk HB and but no network HB were reported every 1 minute in a row, indicating that there should be a connectivity failure on the private network at this time.

So we tested whether there was a problem with the connectivity of the private network address and checked it with traceroute, but there was no connectivity problem.

Therefore, I do not understand why the network heartbeat can not be detected since there is no problem with the heartbeat network card. At this time, the problem should still occur in the above process of gipcmodNetworkProcessBind with No buffer space available (74). Compared with the process of node 1 starting gipchaWorkerCreateInterface normally, a total of 4 addresses have been added:

1. Udp://192.169.1.39:13034-Private network address

2. Mcast://224.0.0.251:42424/192.169.1.39-Multicast address

3. Mcast://230.0.1.0:42424/192.169.1.39-Multicast address

4. Udp://192.169.1.127:42424-broadcast address

It is obvious that Node 2 should have a problem with adding the second address, the multicast address mcast://224.0.0.251:42424/192.169.1.40, in the above process.

Through the multicast detection tool to detect the multicast address connectivity of the private network card, it is found that the detection failed, while the test node 1 is successful, so it is suspected that the problem should appear on the multicast address of node 2.

It is suspected that there is a HAIP problem, so try to remove the HAIP disable, and the 169ip on the private network card is still unable to solve.

Disable the haip command:

Oracle/app/11.2.0.4/grid/bin/crsctl modify res ora.cluster_interconnect.haip-attr "ENABLED=0"-init

Finally, my colleague proposed to restart the host. Since the library is a history library, there is no real-time business. After confirming that there is no impact, the host is restarted. After restarting the host, CRS can start normally, and CSS has passed the gipchaWorkerCreateInterfac step normally.

Once again check the multicast address connectivity of the private network card, this time is successful.

At this point, the problem is solved, but because it is solved by rebooting the host, it always feels that this is not the ultimate cause. Does the failure of multicast detection mean that there is indeed a problem with the network? I dare not conclude on this point.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.