Misscount/disktimeout/reboottime Analysis of heartbeat and its parameters in Oracle Cluster 07/06 Update SLTechnology News&Howtos

Misscount/disktimeout/reboottime Analysis of heartbeat and its parameters in Oracle Cluster

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "Oracle cluster heartbeat and its parameter misscount/disktimeout/reboottime analysis". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "Oracle cluster heartbeat and its parameter misscount/disktimeout/reboottime analysis".

1. OCSSD and CSS

OCSSD is a Linux or Unix process that manages and provides Cluster Synchronization Services (CSS) services. Use Oracle users to execute the process and provide node member management functions if the process fails. Will cause the node to restart. The CSS service provides two heartbeat mechanisms. One is the heartbeat for the network. One for disk heartbeats. Both heartbeats have a maximum delay, the network heartbeat delay is called MC (Misscount), and the disk heartbeat delay is called IOT (Iscaro Timeout).

Both parameters are measured in seconds. By default, Misscount < Disktimeout.

The two heartbeat mechanisms are described below.

Second, the network heartbeat

The name implies that the status of the node is checked through the private network. Suppose that the hardware and software of the VPC cause the VPC between the cluster nodes to fail to communicate normally within a certain period of time. This leads to a brain fissure. Because the storage in the cluster environment is shared storage, it is necessary to isolate the failed nodes from the cluster to avoid data disaster. A detailed description of the action of this network heartbeat is as follows:

Every one second, a sending thread in the cssd sends a network tcp heartbeat to itself and all nodes. The receiving thread of the ocssd.bin receives the heartbeat.

If the package network is dropped or has error, the error correction mechanism on tcp would retransmit the package.

Oracle does not retransmit. From the ocssd.log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of miscount). Another warning is reported in ocssd.log if the same node is missing for 22 seconds (75% of miscount).. another warning continues from the same node for 27 seconds (90% miscount). When the heartbeat is missing 100%.. 30 seconds miscount, the node is evicted

The delay in the heartbeat of this network is called misscount and can be queried and changed through the crsctl tool.

[grid@Linux-01 ~] $crsctl get css misscount

CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.

The above query results show that, assuming that the inline network delay between the nodes in the cluster is greater than 30, Oracle feels that there is a brain fissure between the nodes, so it is necessary to drive the failed nodes out of the cluster.

How to find the fault node. Oracle is decided by the voting algorithm, the following is an algorithm description description demonstration example, description description refers to the lie Oracle RAC.

Each node in the cluster needs a heartbeat mechanism to report each other's "health status". Suppose that each node received a "notification" represents a vote. For a three-node cluster. During normal execution, each node will have 3 votes. When the heartbeat of node A fails but node An is still executing, the whole cluster is split into two small partition.

Node An is a. The other two are one.

It is necessary to remove a partition to ensure the healthy execution of the cluster. For the three-node cluster, after A's heartbeat fails, B and C are a partion with two votes, and A has only one vote.

According to the voting algorithm. The cluster composed of B and C gains control. A was eliminated. Assuming that there are only two nodes, the voting algorithm is invalid.

Because there is only one vote on each node. At this point, it is necessary to introduce the third device: Quorum Device. Quorum Device usually uses a shared disk, which is also called Quorum disk. This Quorum Disk also represents one vote. When the heartbeats of the two nodes fail, the two nodes fight for the Quorum Disk vote at the same time, and the earliest request is satisfied first.

Therefore, the node that first gets the Quorum Disk gets 2 votes. One more node will be removed.

Once the node is isolated, the failed node is usually restarted before 11gR2.

And in 11gR2. ClusterWare will first try to shut down all resources on that node and try to clean up the failed builds in the cluster, that is, restart the failed components.

Assuming that the component that failed to clean up was not successful, restart the node in order to force the cleanup.

3. Disk heartbeat

A thread in ocssd.bin updates the voting disk every second.

If a node does not update the voting disks for 200 seconds, it's evicted.

However, the ocssd.bin on the local node has the logic that it will bring down the node if it has an I/O error more than majority of the voting disks. Also there is a CRS reconfiguration is happening when misscount is 27 second and the local node is rebooted. As a result, you rarely see an eviction due to failure of the voting disk on 10.2.0.4 (this is more common in 10.2.0.1) because the ocssd.bin will abort the node before it get evicted by another node if writing to the voting disk is the problem.

As mentioned above, each node updates the voting disk every second. The shared voting disk is used to check the heartbeat of the disk.

Suppose the ocssd process updates the voting disk for more than 200s, which is the value set by disktimeout. Oracle will feel that the voting disk is offline, and at the same time, the offline record of the voting disk will be generated in the alarm log of Clusterware. Assuming that the number of offline voting disks of the current node is less than the number of online voting disks, the node can survive. Assuming that the number of offline voting disks is greater than or equal to the number of online voting disks, clusterware feels that the disk heartbeat is faulty. The failed node is expelled from the cluster. Perform your own proactive repair process.

For example, there are three voting disks. Node A has a voting disk that is offline. At this time, offline disk (1) online disk (1). That node An is kicked out of the cluster.

4. RebootTime parameter

Note the RebootTime parameter. It is also very important, which is 3s by default.

Default 3 seconds-the amount of time allowed for a node to complete a reboot

After the CSS daemon has been evicted.

Crsctl get css reboottime

V. Adjustment of heartbeat parameters

1) 10.2.0.2 to 11.1.0.7 version number change method

A) Shut down CRS on all but one node. For exact steps use note 309542.1

B) Execute crsctl as root to modify the misscount:

$CRS_HOME/bin/crsctl set css misscount # where is the maximum private network latency in seconds

$CRS_HOME/bin/crsctl set css reboottime [- force] # (is seconds)

$CRS_HOME/bin/crsctl set css disktimeout [- force] # (is seconds)

C) Reboot the node where adjustment was made

D) Start all other nodes which was shutdown in step 1

E) Execute crsctl as root to confirm the change:

$CRS_HOME/bin/crsctl get css misscount

$CRS_HOME/bin/crsctl get css reboottime

$CRS_HOME/bin/crsctl get css disktimeout

2) the modification method of 11gR2

With 11gR2, these settings can be changed online without taking any node down:

A) Execute crsctl as root to modify the misscount:

$CRS_HOME/bin/crsctl set css misscount # where is the maximum private network latency in seconds

$CRS_HOME/bin/crsctl set css reboottime [- force] # (is seconds)

$CRS_HOME/bin/crsctl set css disktimeout [- force] # (is seconds)

B) Execute crsctl as root to confirm the change:

$CRS_HOME/bin/crsctl get css misscount

$CRS_HOME/bin/crsctl get css reboottime

$CRS_HOME/bin/crsctl get css disktimeout

Thank you for your reading, the above is the content of "Oracle cluster heartbeat and its parameter misscount/disktimeout/reboottime analysis". After the study of this article, I believe you have a deeper understanding of the Oracle cluster heartbeat and its parameter misscount/disktimeout/reboottime analysis, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.