Exchange failover cluster network threshold tuning 07/19 Update SLTechnology News&Howtos

Exchange failover cluster network threshold tuning

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Overview

Windows Server failover cluster is a high availability platform that constantly monitors network connections and the health of nodes in the cluster. If one node is not accessible through the network, a restore operation is taken to restore applications and services on another node in the cluster and bring them online.

By default, failover clusters are configured to provide the highest level of availability with minimal downtime. The default out-of-the-box setting is optimized for completely lost server failures, which we will call hard failures in this blog. These will be irrecoverable failure scenarios, such as failures of non-redundant hardware or power supplies. In these cases, the server will be lost, and the goal of the failover cluster is to detect server loss very quickly and recover quickly on another server in the cluster. To achieve this rapid recovery from hard failures, the default setting for cluster health monitoring is quite positive. However, they are fully configurable and can provide flexibility for a variety of scenarios.

These default settings provide the best behavior for most customers, but when the cluster expands from a few inches to maybe a few miles apart, the cluster may be exposed to additional and possibly unreliable network components between nodes. Another factor is the increasing quality of ordinary servers, coupled with increased resiliency through redundant components such as dual power supplies, network card groups, and multipath Imaco, the number of non-redundant hardware failures may be quite small. Because hard failures may be less frequent, some customers may want to tune the cluster for temporary failures, where the cluster is more resilient to transient network failures between nodes. By increasing the default fault threshold, you can reduce the sensitivity to short network problems that last for a short time.

Tradeoff

It is important to understand that there is no absolute answer to the trade-offs mentioned below, and the optimized settings may vary depending on your specific business needs and service level agreements.

Active monitoring-provides the fastest failure detection and hard failure recovery, providing the highest level of availability. The cluster has a low tolerance for transient failures, and in some cases, when there is a transient network outage, resources may fail over prematurely.

Relaxed monitoring-provides more tolerant fault detection and provides greater tolerance for transient network problems. These long timeouts will cause the cluster to recover from hard failures, which will take more time and increase downtime.

Think of it as your cell phone. How long are you willing to sit there and say "Hello" when there is no sound on the other end of the phone? Are you still there? Are you still there? Before you hang up and call that guy back. When the other end is silent, you don't know when or even if they will come back.

The key question you need to ask yourself is: what is more important to you? When you unplug the power cord, do you want to recover quickly, or do you want to maintain tolerance for network failures?

Set up

There are four main settings that affect cluster heartbeat and inter-node health detection.

Delay-this defines the frequency at which cluster heartbeats are sent between nodes. Delay is the number of seconds before the next heartbeat signal is sent. In the same cluster, there may be different delays between nodes on the same subnet, between nodes on different subnets, and between nodes on different failed AD sites.

Threshold-this defines the number of heartbeats missed before the cluster takes a restore operation. The threshold is the number of heartbeats. In the same cluster, there can be different thresholds between nodes on the same subnet, between nodes on different subnets, and between nodes on different failed AD sites.

It is important to understand that both delays and thresholds have cumulative effects on overall health testing. For example, setting cross-subnetdelay to send heartbeats every 2 seconds and cross-subnetthreshold to 10 heartbeats missed before recovery means that the total network tolerance of the cluster can reach 20 seconds before taking recovery operations. In general, continue to send frequent heartbeats, but a larger threshold is the preferred method. The main scenario that increases the delay is whether there is an entry / exit fee for the data sent between nodes. The following table lists the properties used to tune the cluster heartbeat as well as the default and maximum values.

In order to better tolerate transient failures, it is recommended that the same ubnetthreshold and cross-subnetted hold values be increased to higher than Win2016 on Win2008 / Win2008 R2 / Win2012 / Win2012 R2. Note: if the Hyper-V role is installed on a Windows Server 2012 R2 failover cluster, the default value of SameSubnetThresold will automatically increase to 10, while the default value of cross-subnetthreshold will automatically increase to 20. After installing the following hotfix, the default heartbeat value on Windows Server 2012 R2 increases to the same value as on Windows Server 2016.

Https://support.microsoft.com/en-us/kb/3153887

Configuration

Cluster heartbeat configuration settings are considered advanced settings and are exposed only through PowerShell. These settings can be set when the cluster is up and running without downtime and will take effect immediately without the need to restart or restart the cluster.

To view the current heartbeat configuration values:

PS C:\ > get-cluster | fl * subnet*

You can modify settings using the following syntax:

PS C:\ > (get-cluster) .SameSubnetThreshold = 20

Other considerations for logging

In Windows Server 2012, there are additional logging in Cluster.log to record the heartbeat flow when the heart stops. By default, RouteHistoryLength is set to 10, which is twice the number of default thresholds. If you increase the SameSubnetThreshold or CrossSubnetThreold value, it is recommended that you increase the RouteHistoryLength value to twice that value to ensure adequate logging when you need to troubleshoot heartbeats that are being discarded. This can be achieved through the following syntax:

PS C:\ > (get-cluster) .RouteHistoryLength = 20

For more information about troubleshooting nodes from cluster members due to network communication problems, see the following blog:

Http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.