WSFC status Operation Guide 02/14 Update SLTechnology News&Howtos

WSFC status Operation Guide

2026-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

In WSFC, you may see a variety of operation states, such as pausing nodes, stopping node services, evicting nodes, restoring nodes, shutting down clusters, destroying clusters, and closing connections. What exactly do these operations mean and under what scenarios should they be used? today, let's find out, tick, Lao Wang started!

First, let's start with the node operation.

Take WSFC 2012R2 as an example, when we click on any node under the failover Manager, node interface, right-click to see the following actions

suspend

Restore

Remote Desktop

Information details

Show key events

First of all, let's take a look at the pause. Since 2012, the pause not only declares that the node is in a maintenance state, but also automatically moves the load on the paused node to other nodes according to the placement policy. This operation is mainly suitable for node maintenance scenarios. For example, node OS instability needs to be debugged. In order to avoid affecting the above cluster applications, the application is drifted away first, and then the error is eliminated, or the shutdown and hardware configuration are required. You can also set the node to be paused, remove the application, shut down and configure it, and then power on and release the paused state, and then operate each node in turn.

To sum up, the pause operation is mainly used for the planned maintenance scenario, that is, we know that the maintenance operation is going to occur and the node will be unavailable, so I can move the resources away with minimum downtime by pausing the operation, and then perform the maintenance.

By default, at the beginning of 2012, there are two options for the pause mode, one is to exclude the role, the other is not to exclude the role, and the new function of the draining role is 2012, that is, the application above the node will be placed on the appropriate node according to the placement policy, and the old 2008 option will not be excluded, that is, only the node will be declared as paused, resources will not be migrated to this node, and the above resources will not be removed.

At the beginning of 2012, by default, the node is suspended to perform real-time migration operations for virtual machines, and the cluster role performs online mobile operations. Lao Wang said in the previous article that cluster maintenance and pause mode can be integrated with priority. For example, nodes suspend maintenance, high and medium priority virtual machines get real-time migration, low-priority virtual machines get real-time migration, in scenarios where resources are insufficient. Through this configuration, you can always ensure that more major virtual machines achieve the highest availability and migration performance. Refer to Lao Wang's blog WSFC maintenance mode operation granularity control.

After the pause, let's take a look at the recovery. Recovery corresponds to the pause. When we set it to pause and the maintenance is completed, it is necessary for the node to provide services normally again. In the 2008 era, recovery only means lifting the paused state of the node, so that the node can accept the resources to be migrated. After 2012, this old operation is shown as a failure-free recovery role. Since 2012, the recovery operation adds a new failure recovery role, and recovery and pause are glued to each other. When we set the node to a pause, the resources will be migrated to other nodes. When the node is completed, we can select the failure recovery role to allow the removed resources to return to the normal operation of the original node. This pauses the failure recovery, and also refers to the placement strategy, and will consider the preferred owner, anti-correlation, possible owner and other strategies. If the node is currently in the preferred owner, it will not reply, if there are already anti-correlation resources on the node, and if the original node is removed from the possible owner of the resource after maintenance, it will not reply.

In some scenarios, developers or business people may have requirements for cluster roles or virtual machines, for example, a resource can only run on a node, and it is better to be stable for some important resources. if you can continue to operate on the original node, it is applicable to restore the role function in this scenario. As cluster administrators in 2012, as long as you have this awareness. For planned maintenance, click the pause node, migrate the role, click the recovery node after the maintenance is completed, and migrate back to the cluster role

In fact, remote desktop is also a useful feature. For example, there are many cluster nodes, maybe 16 nodes, 32 nodes. The cluster administrator will open Cluster Administrator to configure and patrol every day. If there is something wrong with the node, you can launch a remote desktop to the node directly in failover Manager. As long as the remote port of the node is working, you can go there remotely.

Information details, mainly showing the failure to perform the operation for the resource or node, and what is the reason for the failure

Display key events, mainly used to aggregate key events of the current node or resource. Starting in 2008, the cluster will set up this resource-specific event manager filter for most resources. We click on a node, or a resource, to display key events, showing only the events of the current node or current resource.

For node operations, in addition to the above, there is one more operation, which is to start the cluster service, stop the cluster service, and expel the node

Start the cluster service, which is usually used to manually stop the cluster service due to some operations, such as debugging, and after debugging, you can start the cluster service through GUI, or through the command line

Net start clussvc start

Stopping the cluster service is also used in some special scenarios. Under normal circumstances, we do not need to use it. For example, if the cluster application is transferred to a node that does not work properly, we can stop the cluster service for that node by canceling the available owner or stopping the cluster service here. Before WSFC 2016, the stop of the cluster service means failover, if the cluster service of a node stops. The next health check will report that the node is unavailable and all applications or virtual machines above will perform unplanned failover operations. Since WSFC 2016, the cluster has introduced VM anti-instant disconnection function, which can prevent the rapid migration of virtual machines caused by transient conditions. For example, if it is a network disconnection or the cluster service crashes and stops, the fast migration operation will not be triggered as long as it can be restored within a certain period of time, because fast migration will bring downtime to virtual machines, if you do not need the VM instant disconnect feature.

Just turn it off (Get-Cluster) .ResiliencyDefaultPeriod = 0

Eviction of a node means that the node is completely removed from the available nodes of the cluster, and the expelled node is never recommended to rejoin the cluster, which is usually used in the following scenarios

Rename the cluster node

Replace nodes with different hardware

Node reinstalls the operating system

Permanently delete a node in a cluster

Generally speaking, evicting a node is a simple and crude way to solve the problem, but it is by no means a way to troubleshoot. If you can confirm that it is because of the OS instability of this node, you can expel it from the cluster, add a new node, or redo the system after eviction and add it with a new node name.

Although this approach is good, sometimes it can not really solve the problem. Sometimes we think it is the problem of one node, but it is actually the problem of cluster resources. Maybe even if we expel the node, we will still encounter this problem when we add the new node. Therefore, it is recommended that we should not easily do the eviction node, unless we determine the cause of the problem and finally carry out the eviction operation, we should first determine and analyze the problem.

Common misunderstandings of eviction

The cluster service could not be started and node 2 was expelled, but the cluster service still could not start

The resource is not transferred to node 2, and every time a failover occurs, the disk is not online and cannot be returned to node 1. One node is expelled and another node is added. There is still this problem.

Once this kind of troubleshooting occurs, it is recommended to look at the cluster.log and dump files for analysis to find out the real problem. Perhaps the root cause is the RHS deadlock or some third-party software compatibility problem. Do not easily perform the eviction node operation before really determining the problem, otherwise the problem may not be fully reproduced when troubleshooting.

The above are all the operations for nodes in the GUI interface, as well as some operations in scenarios to help you familiarize yourself with the process.

Node switch

Pause node 2. Shut down operating system 3. Boot operating system 4. Restore node

The above is the standard normal process of switching on and off of the cluster node, and there are also some unexpected situations. For example, there may be some special roles running on the cluster, and you need to execute a program before you can run the cluster role normally. Then this step can be done before step 4.

Starting with WSFC 2012R2, an attribute DrainOnshutdown has been added for clustered virtual machines.

If we forget to pause the node and directly shut down the virtual machine, WSFC2012R2 will automatically follow the maintenance mode operation policy to help us automatically migrate the virtual machine in real time or quickly to other nodes, while other roles in the cluster will use mobile suspend operation. After all resources have been removed, the operating system will normally complete the shutdown operation. This function is also known as "lazy helper", once we forget to pause the node. There will also be such a helper behind us to help us complete the maintenance operation.

Node failover

1. Node downtime 2. Other nodes detect registry mount shared storage online 3. Node recovery 4. Failback cluster role

For cluster failover, we would like to talk about fault reply in particular, which is an antique, which has been seen since the 2003 era, and the fault recovery must be matched with the preferred owner function, which is exactly the same as at the beginning, that is, if the application is currently in node 1, node 1 is down, and the application is de-node 2, if you want the application to return to node 1 after node 1 is restored, the application must set the preferred owner as node 1. The fault recovery operation can be immediately or at a certain time. If the application has requirements for the host and needs to operate at a certain node all the time, you can configure fault recovery to enable failure recovery after unplanned failover. Virtual machine failure recovery in the 2008 era adopts rapid migration, and real-time migration is used for fault recovery since 2012.

After looking at the node-level operations, let's take a look at the cluster-level operations. Lao Wang will mainly introduce the following operations

Close the connection with no practical effect. After clicking to close the connection, only delete the connected cluster in the current failover Cluster Administrator. Suppose that at this time a cluster node, someone who does not understand will touch your cluster, so in order to avoid its misoperation, you can close the cluster connection before he operates.

After closing the connection, if you want to connect to the cluster again, click Connect to the cluster.

Shut down the cluster, stop all cluster roles, and turn off the cluster services for all cluster nodes. If there are many nodes in the cluster, you can help us to close them all. After shutting down, the cluster is not available to the outside world. You can do this if you want the cluster nodes to temporarily lose the role of the cluster.

For virtual machine cluster resources, starting from 2008, you can set the action to be performed by the virtual machine when the cluster is turned off. The default is to save the virtual machine.

Numerical value

Effect 0

VM Direct Power Down 1 (default) VM Save 2

VMOS normal shutdown 3VMOS forced normal shutdown

Get-ClusterResource "Virtual Machine Resource Cluster name" | Set-ClusterParameter OfflineAction 2

To resume cluster work, click to start the cluster. The cluster virtual machine will be restored from the saved state by default, and the cluster role will be online from offline.

Destroy the cluster, dismantle the entire cluster, delete all roles and metadata information of the cluster, usually used in the test environment, or redeploy the cluster, just like shutting down the cluster, it is easily not recommended, once the cluster is destroyed, if you need to build the cluster again at that node, sometimes you need to reinstall the cluster feature

Before destroying the cluster, make sure that all roles of the cluster have been deleted and the virtual machine has been exported to another location. It should be noted that during the process of destroying the cluster, if the virtual machine is stored on a shared disk and CSV, it will be completely shut down, but the data of the virtual machine will not be lost, and the data of the virtual machine will be stored in CSV. After rebuilding the cluster, the virtual machine can be re-mounted, but after it is destroyed, before rebuilding. The virtual machine will not be available.

If there are still undeleted virtual machines or roles in the cluster when the cluster is destroyed, the following error will be prompted

Considerations for destroying a cluster

All nodes need to be online when destroying the cluster. If 1 node is not online when destroying the cluster, and then the node joins another cluster, it will show that the node already belongs to another cluster. In this case, you need to execute the command on the node.

Cluster node hv01 / forcecleanup

This will clean up the registry of all old cluster information on the node to allow the cluster to join the new cluster.

Behind the process of destroying the cluster, the node cluster qualification is expelled and the registry of configuration information about the cluster on each node is deleted. If you want to rebuild the cluster after destroying the cluster, and the reconstruction is not successful, try to check the registry hive to see if there is any residual information about the old cluster. If so, clean up and then try to rebuild the cluster.

Cluster CNO is disabled by default in AD after destruction. If you want to delete CNO from AD directly after destroying the cluster, you can use PowerShell operation.

Remove-Cluster-CleanupAD

Mobile cluster core resources

The resources in the cluster can be divided into two types, one is the operating resources for the cluster, the other is the cluster-based application resources, and the core resources also refer to the cluster operating resources. for a cluster to operate, it needs the cluster name, the cluster IP, and there will be witness resources. before WSFC 2016, these resources were generally these contents, and these cluster operating resources were also placed in a cluster resource group and became the core resource group. During the cluster process, the core resource group will be placed on one of the cluster nodes, and we can move the core resource group to other cluster nodes through the graphical interface. Before 2012, it can only be executed by command, and after 2012, the GUI interface execution is supported. In 2008, the core resource group of the cluster will be placed separately in the RHS monitoring process, and the entire cluster will be affected because the RHS process of other resources in the cluster crashes.

Usually we don't need to take care of the cluster core resources, unless it needs to be moved when troubleshooting, or considering the scenario of cluster load balancing, if a node hosts a lot of applications, the core resources above it can be moved to other nodes to reduce the burden.

In WSFC 2016, the cluster core group has more storage QOS resources and Virtual Machine Cluster WMI

Before 2012, use the command to move the cluster core resources

Cluster group "Cluster Group" / Move:NodeName

Mobile cluster available storage resources

Cluster group "Available Storage" / move

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.