WSFC2016 VM elasticity and storage fault tolerance 04/26 Update SLTechnology News&Howtos

WSFC2016 VM elasticity and storage fault tolerance

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

I am very happy to introduce VM elasticity and storage fault tolerance technology of Microsoft 2016 cluster today. In Lao Wang's view, VM elasticity is a very important change for cluster operation in WSFC 2016. Like rolling upgrade, it is a disruptive thinking.

Simply put, in everyone's perception, clustering should quickly failover after detecting that a node is unavailable and continue to transfer applications to other nodes, right? I believe everyone agrees with this point.

In 2012R2, by default, the same subnet and cross-subnet are detected once every second across the network. Five times of detection failure, that is, the node is determined to be unavailable. RCM starts to operate according to the cluster database content, failover role, detection time and detection failure times. You can change it. If your environment is unstable, strict monitoring will lead to frequent node failover. You can also change it to a looser detection once every second and 20 times of detection failure. and then failover.

However, this threshold is not easy to modify for too long. One reason is that this value is for the whole cluster level. If there are many applications on the cluster, all applications will be affected by this value. The other is that if the detection time is too long, it will lead to a long downtime before being discovered. Therefore, before 2012R2, Microsoft recommends setting the maximum detection failure to failover. It is not recommended to exceed this value.

But in the final analysis, we modify the monitoring threshold, or to solve the problem of network instability, as well as user-specific needs, for example, if the customer network is unstable, detection will be instantaneous interruption, and there is no way to change, then you can set the monitoring threshold loose, if the customer environment network is very stable, need very strict detection to ensure SLA, you can also set the detection threshold strict.

This is the solution of the 2012R2 era. By 2016, Microsoft believes that real failover is rare, but transient failures are more common, such as nodes temporarily unable to communicate with the network, or temporarily unable to connect with storage, and then immediately recovered. Therefore, Microsoft redesigned the VM failover strategy in the cluster, which can make the node transient failure within a certain period of time, without triggering the node virtual machine failover.

In WSFC 2016, VM bomb ××× is enabled by default. In 2016 TP1, this feature is disabled by default. Subsequent versions are enabled by default. Run Get-Cluster.| fl * You can see configurations related to VM Resilience

parameter description

ResiliencyLevel : IsolateOnSpecialHeartbeat or 1, AlwaysIsolate or 2, the default is AlwaysIsolate, that is, after a node transient interrupt occurs, the virtual machine can be allowed to be online or suspended for a period of time, IsolateOnSpecialHeartbeat, that is, when a transient interrupt is detected, the node is immediately set to a failure state and failover is performed

As we said earlier, 2016 refactored VM failover policies

How exactly is it reconstructed?

In WSFC 2016, suppose you have a transient outage, such as

The network is temporarily unstable, and nodes cannot communicate with other nodes.

Cluster service crashes, unable to connect to other nodes

Administrator error

Cluster now has three new attributes when such a transient outage occurs

Isolation: For cluster nodes, after a specified period of time, the cluster node is marked as isolated after a transient outage, and the member is no longer an eligible cluster member, but the virtual machines hosted on it can still function normally for a certain period of time.

Unmonitored: For viewing VM status in cluster manager, if VM is viewed inside cluster after transient outage of node, VM will be in unmonitored status

If the virtual machine is stored under SMB3/SOFS path, the virtual machine can run in Online state after node isolation state, because SMB can run independently. If the virtual machine is stored under CSV path composed of block storage such as FC/FCoE/iSCSI/ShareSAS, then the virtual machine will be put into suspended state, because the node is not qualified as a cluster member after isolation, and will lose access qualification to CSV. If the node returns to normal, the virtual machine will resume normal operation from suspended state. If a node's momentary interruption is not recovered within a period of time, the virtual machine will be failed-over to other nodes.

3. Quarantine duration: We will set a time, within this time, if the node is restored after an instantaneous interruption, the virtual machine will not be migrated, but will continue to run. However, if the node is isolated for a certain number of times within an hour, and many instantaneous interruptions occur, we will determine that the node is currently abnormal, and the node may cause application instability. Therefore, we will set the node to quarantine status. The node will be in quarantine status for a period of time. All virtual machines above will be migrated in real time until we analyze and determine that the node is back to normal, and then rejoin the cluster.

ResiliencyPeriod: Configure the time a node operates in isolation. The default is 240 seconds. Transient outages within 240 seconds can be accepted without failover. If it does not recover after 240 seconds, the failover operation will be performed according to cluster detection.

#Configure isolation state time

（Get-Cluster）.ResiliencyDefaultPeriod = 60

#Turn off isolation status function

（Get-Cluster）.ResiliencyDefaultPeriod =0

#Configure isolation state time (i.e. unmonitored state time) at individual VM level

（Get-ClusterGroup"stat").ResiliencyPeriod = 60

Quarantine Threshold: The number of isolation times before a node enters quarantine status. The default is 3. That is, after the node is placed in quarantine status for 3 times within one hour, the node enters quarantine status and all virtual machines will be migrated in real time.

#Configure quarantine times before entering quarantine

(Get-Cluster).QuarantineThreshold =

Quarantine Duration: The time a node stays in quarantine. The default is 7200 seconds. During this time, the node will not carry applications. All virtual machines will be migrated in real time. Administrators can investigate frequent problems that cause instantaneous termination. If they are repaired, they can manually restore the node in advance, or wait until 7200 seconds for automatic recovery.

#Configure quarantine status time

(Get-Cluster).QuarantineDuration =

This technique may be boring to talk about too much. Let's take a look at the actual case

There are four VMs in the current environment, of which RODC operates on the SOFS path, and the other three VMs operate on the CSV path provided by ISCSI. I set the node isolation state time to 60 seconds and the quarantine state time to 600 seconds. Lao Wang here is only testing to see the results quickly. In the real environment, it is recommended to evaluate according to the actual time. How long can it be counted as an instantaneous interrupt? How long do I need to troubleshoot the problem node when instantaneous interrupts occur frequently?

Functional requirements for VM resiliency in WSFC 2016

Cluster Functional Level 9

Virtual machine configuration level upgrade at least 6x

In Lao Wang's experiment, I'll simulate a cluster service crash for a short time, simulate a forced shutdown of the cluster service, and observe the cluster's reaction

All four virtual machines in the current cluster are hosted on HV01

We simulate a cluster service crash by forcibly stopping the clussvc process on the node

Stop-process -name clussvc -Force

As you can see, nodes are placed in quarantine after cluster service crashes

All virtual machines in the cluster will appear unmonitored, this is only a temporary state seen in the cluster manager

But in Hyper-V, you can actually see that RODC stored in SOFS will continue to run for 60 seconds. Other virtual machines running on CSV, although they show running-critical, this Chinese display is wrong. In English, it shows Paused-Critical. In essence, they are suspended because the node is isolated and loses the qualification to CSV.

If the node returns to normal within 60 seconds, the instantaneous interruption is restored, the network returns to normal, the service no longer crashes, and the administrator recovers the error operation, the node rejoins back to normal, and the suspended virtual machine will resume running. In this example, after we forcibly terminate the clussvc process, it will automatically start up later.

If there is an instantaneous outage within 60 seconds, or cluster service crashes, or the network is temporarily interrupted, or the administrator misoperates and is not repaired, the node will be set to Down state, and all virtual machines that are set to Unmonitored state will be migrated to other living nodes. The migration process is rapid migration, and suspended virtual machines will be set to shutdown state and then migrated away.

Below we simulate the occurrence of three isolations within one hour, i.e., instantaneous outages at the same node within three hours

You can see that when you look at the node, you find that it has changed from red isolation to green isolation. However, in fact, this green isolation and the red isolation above do not mean the same thing. One is Isolate and the other is Quarantine. Entering the green isolation state should actually be what we call quarantine state. That is, according to the algorithm we defined, the cluster can already judge that this node is abnormal. It is sick and should not continue to host virtual machines. Therefore, all the virtual machines above will be migrated to other nodes in real time, and during the Quarantine Duration, the nodes will be in quarantine state. At this time, the administrator can diagnose the errors of the nodes and confirm whether frequent transient outages mean that there are real problems that need to be handled.

If 600 seconds is up, the cluster thinks you have solved the problem and automatically releases the quarantine status, allowing the node to join the cluster normally. If you don't want to wait for this time, or if you have solved the frequent transient outage problem in advance, you can also run Start-ClusterNode -CQ to perform ClearQuarantine operation and manually restore the node to normal.

I believe that through the above experiments, everyone has a certain understanding of VM bomb ×××

With this technology, we can keep the cluster as it was before, complete fast failover according to the detection signal, or accept short interruption and recover within a certain instantaneous interruption time according to VM elasticity technology. There is no need to perform failover immediately. We can isolate the host under instantaneous interruption, or even further quarantine treatment. Relatively speaking, it is more formal and better than the way we adjust the detection signal. Therefore, if you have a transient interrupt in your environment, you may wish to know how to use this function.

If you don't want VM bounce ×××, you want to revert to the previous way of doing failover directly based on signal detection

Set the VM elasticity value as follows, then go back to the past, you need to note that in the version after 2016TP2, this feature is enabled by default, so when there is a network outage, service crash If you find that the application does not quickly failover, do not panic, then because the cluster automatically turns on the VM bomb ×××, do not want it, just disable it like this

After turning off VM elasticity, force the clussvc process to stop again, discover that the node directly enters the failure state, and perform failover

The above is VM bomb ×××, which can help us solve node level network, system, misoperation and other transient failures. In addition to computing level can have such bomb ×××, 2016 also added this technology in virtual machine storage, mainly for virtual machine access VHDX, in 2012R2, if virtual machine suddenly does not access VHDX, virtual machine will certainly crash and cannot be used. When storage is available again, we may also need to restart virtual machine.

In WSFC 2016, cluster virtual machines can achieve better fault tolerance with storage. It is very magical. We can set an allowable outage time. During this time, if a transient failure occurs between cluster virtual machines and storage, and VHDX cannot be accessed, the virtual machine can be placed in a suspended critical state. The virtual machine will be frozen, the state will be saved, and the IO of all virtual machines will also be frozen.

When VHDX resumes access, the virtual machine returns to normal operation from the suspended state, the state is released, and all IO gets normal operation. Through this built-in storage fault tolerance function, we can provide a good solution when the storage connection fails temporarily. When the storage is available again, the virtual machine automatically loads io, and the downtime for users is improved.

Currently, except for RODC virtual machines, other virtual machines are directly connected to CSV. We directly disable 16 cluster data disks on ISCSI, so CSV fails, node to storage fails, and VHDX is no longer readable.

You can see that the VM is set to Run-Critical again, but this state should be Paused-Critical

So this frozen VM IO time is limited, the default is 30 minutes, not too long, not indefinitely, within a period of time, if the VM still can not connect to VHDX, then the VM will be shut down, the next boot will be a cold boot

If the virtual machine is restored to the storage connection within 30 minutes, it will continue to operate. The virtual machine is set to be running and will continue to maintain the previous working state. All operations and IO will operate normally. If this failure to storage is very short, it will not be felt by users. Once the storage connection is established, the virtual machine will resume operation very quickly. SOFS virtual machine is the fastest, followed by CSV virtual machine. ShareVHDX virtual machine will directly perform live migration.

#Configure virtual machine storage fault tolerance, HV level configure virtual machine

Turn on virtual machine fault tolerance

Set-VM -AutomaticCriticalErrorAction

Default is Pause, i.e. pause when storage cannot be connected, change to None to return to the previous state, if you want to modify, shut down the virtual machine!

Configure VM Storage Unable to Connect Wait Time Default 30 minutes

Set-VM -AutomaticCriticalErrorActionTimeout

For ShareVhdx virtual machines in Hyper-V 2016, every ten minutes, the polling is done to see if storage is available, and if it is unavailable, it is automatically migrated to another node in real time.

VM storage fault tolerance, only CSV-based VHD, VHDX detection, or sharevhdx, sofs supported

Native VHD VHDX for clustering is not supported

Virtual machines using pass-through disks or USB storage are not supported

Above, Lao Wang introduced the VM elasticity and storage fault tolerance function for everyone. I have been paying attention to this technology for a long time. I have always wanted to introduce this technology to my friends in China. I finally wrote it this time. I hope I can bring harvest to my friends. If there are nodes, networks, and storage transient failures in your environment, you can now control it through VM elasticity technology in 2016, especially for the magical fault tolerance function of storage.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.