WSFC arbitration model selection 07/06 Update SLTechnology News&Howtos

WSFC arbitration model selection

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Today we will discuss in detail the arbitration model about WSFC, the advantages and disadvantages of the main arbitration models, and how to think about how to choose the best and appropriate solution.

WSFC is introduced into arbitration for two main purposes.

Track whether the current operating votes of the cluster comply with the arbitration model agreement, and decide to shut down the cluster if it is lower than the minimum allowed nodes (before 2012)

When partitioning occurs, make sure that the majority party is responsible for taking over the services provided by the cluster, and the minority party will shut down.

Looking back at history, before the 2003 era, the cluster had only one arbitration model, that is, disk-only arbitration, under which only disk witnesses stored the cluster database. Before starting, all nodes must be able to connect to the disk witness to obtain the cluster database before starting. When a partition occurs, which side can contact the disk witness, then win if all nodes are normally connected to the disk witness. The cluster can support to the last node, but in this mode, the disk witness becomes a single point of failure, and once the disk witness is lost, the cluster will shut down, because only the disk witness is qualified to determine whether the cluster survives or not. at that time, there was no concept of voting, as long as the disk witnessed, the cluster could survive.

Later, starting from the 2003 era, MSCS introduced the majority node set, MNS arbitration model in the enterprise and data center editions. The advantage of this model is decentralization, so that the local disk of each cluster node can also store the cluster database, so that you do not have to contact the witness disk every time the cluster starts. Through the MNS arbitration model, you can allow most of the cluster nodes to survive. Each node can be qualified to determine whether the cluster survives or not, which is the predecessor of the subsequent majority node arbitration.

2003 SP1 era, the cluster introduced the file sharing witness mechanism, in order to solve the two-node MNS arbitration model, any node downtime, will lead to the cluster shutdown, then the introduction of file sharing witness is the same as later, the file sharing witness does not include the cluster database at the beginning, only plays a voting role, when the cluster current MNS model, two nodes plus one file sharing witness, one node downtime The other node can contact the file sharing witness and survive, because it can obtain the qualification of most nodes, and it can also prevent brain fissure. When the two nodes are partitioned and both try to compete for resources, which party can contact the file sharing witness to maintain the operation.

Since the launch of the 2003SP1 feature, people have been trying to deploy various cluster applications on the MNS arbitration model + FSW witness. 2003SP1+EX2007 CCR was the most used at that time. With the use, we realized a problem. My FSW sharing witness is still a single point of failure. Can there be any mechanism to make this file sharing highly available, because by default, an ideal scenario should have a third server? The server of the non-cluster node to undertake file sharing is actually running a shared directory on it, which does not take up any system resources, but once this server goes down, there is no guarantee for our cluster operation, so we began to find ways to maintain the high availability of the FSW server. After practice, we all agreed that the only feasible solution is to do fileserver cluster, (if it should be a traditional clustered file server in the 2012 era. (instead of SOFS), it can maintain the high availability of FSW, and some people have tried to use DFS, but later people found the drawback. the main reason is that the meaning of DFS lies in logically shielding the physical layer, for example, providing a DFSN path to MSCS, but the compound group is the respective DFSR server of the two sites, and then each site has its own cluster node, when zoning occurs. Each site can access file sharing, there will still be the problem of brain fissure partition, because the voting qualification is still the same, because all the nodes of DFS are AA, and there is this kind of site-aware design, it is not suitable for cluster FSW,FSW that only one shared server provides services at the same time, and can determine the partition winner in the event of a disaster.

However, although this is said, it is still rare to do a file cluster specifically for cluster file sharing witness in an enterprise, but this should also be a consideration. If there are dozens of clusters in the enterprise, then it is not possible to deploy a set of file cluster to provide highly available file sharing witness. Usually, if a single file sharing witness is built in China, it will be in DC. Build a stable server such as DHCP, or build the server separately.

In the 2008 era, the cluster changed from MSCS to WSFC, and the arbitration model also had new changes. First, the concept of voting was introduced, and voting was introduced into the cluster arbitration manager. Each node and witness had an additional attribute of voting. The survival and partition processing of the cluster began to be determined by the number of votes. Although the mechanism was similar to 2003, it maintained a majority, but it became clearer and brought up things that had not been seen before. 2008 the starting arbitration model is divided into four types: disk only, majority nodes plus witness disk, majority nodes plus file sharing, the function of compulsory arbitration has also changed in the 2008 era, if compulsory arbitration is to be performed in the 2003 era, it needs to be performed in the case of cluster shutdown, and a list of mandatory startup nodes needs to be given, 2008 can perform compulsory arbitration when the cluster is open, and another point Since 2008, each node and cluster witness disk can store the cluster database, and the witness disk is not a single point of failure, and the cluster database of each node is up-to-date. It is important to witness that the disk cluster database is not up-to-date and can be synchronized with other nodes.

Although four arbitration models were introduced in the 2008 era, in fact, the arbitration in the 2008 era is still relatively rigid, mainly emphasizing that the survival of cluster nodes must conform to the minimum node agreement of the arbitration model.

For example, if it is an odd node, select the majority node arbitration, need to survive to (node votes) / 2: 1, that is, 3 nodes must have two nodes to survive. If an odd node chooses disk witness or file sharing witness, one node will not be allowed to survive to the last node because there is one more witness vote, because if 3 nodes plus disk witness, it will be 4 votes. The same algorithm still requires three votes to survive. After downtime of one node, witness one vote plus node two votes has reached the limit.

If the even number of nodes, select the majority node + disk witness or the majority node + shared witness, you can survive up to half of the nodes if the witness device is online, or if the witness node is not online, or if the majority node arbitration is adopted, then you need to survive (node votes / 2) + 1, that is to say, four nodes with most nodes, at most one can only be down.

Therefore, in the 2008 era, the choice of cluster arbitration model is basically solid, if you want the cluster to provide services as much as possible, then if you are odd nodes, choose majority node arbitration, even nodes choose majority nodes plus disk witness or file sharing witness, even nodes cannot choose majority nodes, odd nodes cannot choose witness devices, otherwise a node will be wasted.

Since the 2012 era, this solid arbitration thinking has been broken, and the cluster does not have to abide by the minimum node agreement of the arbitration model, but can dynamically adjust the number of votes of nodes to the last node. Microsoft introduced a dynamic arbitration function in WSFC 2012, that is, dynamically adjust the number of votes of each node. For example, if there are five odd nodes and one node is down, the cluster will remove another node vote to ensure that the cluster has 3 votes. If one more node is down, exactly three nodes will not do anything. If there are two nodes left, the vote of one node will be randomly removed. In the case of normal shutdown or non-vote node downtime, the last node can survive. If the vote node is too late to exchange votes, the cluster will shut down, so the chance of 2012 dynamic arbitration surviving to the last node is 66%. Similarly, if there are four nodes, the group assembly will dynamically arbitrate to remove the vote of one node, and the chance of surviving to the last node is 66% if one node goes down and one vote is removed.

Dynamic arbitration always allows the cluster to maintain odd votes. From 2012, the cluster no longer maintains a majority, but an odd number. The purpose of arbitration is more to help us survive to the last node and avoid brain fissure zoning.

If we choose to configure an even node + witness device in the 2012 era, then if the witness device is online, the cluster can survive to the last node, and if the witness device is offline, it can survive as (node votes) / 2cm 1

If we choose to configure odd nodes + witness devices in the 2012 era, in the case of one node down + witness devices offline, the cluster will shut down, for example, the cluster currently has three nodes, one node and the witness device down. the cluster is closed because there are two votes left and the winner cannot be decided. Therefore, odd nodes in the 2012 era still have to use the majority node arbitration model. 2012 odd nodes do not bring survival advantages because of witnessing devices.

In the era of 2012R2, WSFC dynamic arbitration evolved into dynamic witness, that is, the cluster always recommends to configure disk witness or file sharing witness, because the witness device can dynamically adjust the number of votes, such as 3 nodes + witness disk, the cluster automatically removes one vote from the witness disk, and now the cluster has three votes. If one node is broken and the cluster has two votes, the cluster will automatically add the witness vote. Now the cluster is three votes, or an odd number, at this time, if one more node is broken, there is still the last node and witness, the cluster can still survive. That is, as long as the cluster witness device, whether the current odd or even nodes can survive to the last node, it is always right to configure a witness device for the cluster.

As mentioned earlier, 2012 began to introduce the dynamic arbitration function, which can automatically remove a node to vote in the case of even nodes, and always maintain the odd number of votes in the cluster. 2012R2 can start to be specified through the LowerQuorumPriorityNodeID attribute and always remove the number of votes in that node.

For example, if I have an even number of four nodes in two sites, then I can specify that the cluster automatically replaces the vote of one node in the backup site, so that there is only 1 vote left in the standby site and 2 votes in the primary site. If the two sites are partitioned, the primary site directly wins, and if all the primary sites are down, there is a 66% chance that the standby site can take over directly. Before 2012R2, we usually manually go to the backup site node directly, which has achieved this effect, but there is only a 2012 chance that the standby site can take over automatically. Before 2012, we need to manually force the standby site to take over. However, some enterprises will deliberately design the architecture of manual failover because the application at the upper layer of the cluster takes too long to fail over, and some operational applications need to be performed to provide services after failover. in this case, it is suitable for manual failover.

Although 2012R2 said well that the cluster can survive to the last node, there is a premise in this sentence: when the witness device is online, once the witness device goes offline, the cluster becomes 50% alive to the last node. Lao Wang has already done this experiment. The current cluster downtime is 3 nodes + witness device, if the witness device and one node are down. The cluster will not automatically adjust the vote, or 2 nodes + 1 witness vote, but in fact, it should automatically switch from dynamic witness to dynamic arbitration, with 3 votes changing to 1 vote, but the cluster has not changed. If it has changed, 66% can survive to the last one, but it has not changed. If the remaining two nodes are down, the cluster will shut down.

The key problem here is that when 3 leaves 2, a node and witness device are offline, and the cluster cannot switch from dynamic witness to dynamic arbitration, resulting in the inaccuracy of cluster arbitration. in fact, at this time, the cluster should first become 2 votes, and then dynamic arbitration to remove 1 vote, but the cluster did not automatically adjust the number of failed witness votes, nor adjusted the number of votes of nodes, resulting in the downtime of either of the two nodes. The cluster is down.

2012 is odd nodes plus witness devices, witness devices and nodes are offline. Once the cluster becomes a 2-node even-numbered vote, the cluster will be closed directly.

2012R2 is when the remaining odd nodes + witness devices, witness devices and nodes are offline, once the cluster becomes a 2-node even-numbered vote, the cluster will shut down if any node is broken.

In the final analysis, it is the reason why the device cannot be switched to dynamic arbitration after being offline.

Therefore, the witness device in the 2012R2 era is particularly important. Only when the witness device is present (each cluster node can access it), can it survive to the last node.

OK, we finally talked about modern times from the river in the years of WSFC arbitration. In this long river, there was a torrent, which still affects WSFC, which is the cluster database.

Since the 2008 era, the WSFC cluster database has introduced the paxos mechanism. The cluster database is synchronized at each node, each node can update the cluster, and the other nodes will synchronize with the newly modified node. The following process mainly compares the paxos tag and finds that the other party's is newer than mine, then synchronize with it. In addition to recording cluster information consistency at each node, the cluster database is used for failover. It is also used for cluster service startup check. Each time the node cluster service starts, it checks whether its own cluster database is up-to-date and consistent with other nodes. If it is not up-to-date, it needs to be synchronized with other nodes before it can be online.

It should be noted that if the cluster uses a witness disk, each node synchronizes the cluster database to a copy of the witness disk, and the cluster database of the witness disk is loaded on the node where the disk is located. Only disk witness will have the cluster database, while shared witness and 2016 cloud witness only record the latest paxos tags of the current cluster.

When there is a time-divided scene, you can see which arbitration model is better.

Time Node 1 Node 1 Node 2 File sharing online

Time node 2 node 1 downtime

Time node 3 node 2 modify cluster data

Time node 4 node 2 downtime

Time node 5 node 1 start

If you are using a file sharing witness, node 1 will not be able to start because the current node does not have the latest cluster database. When the cluster starts up, compared with the paxos tag in the file share, it is found to be old, and the cluster member manager will prevent the node from starting. At this time, node 1 can start synchronizing the cluster database with node 2 only after waiting for node 2 to boot, if not waiting for node 2 to boot. If it is forced to start on node 1, the cluster database of node 1 will be promoted to a golden copy, and node 2 will be overwritten by the golden copy of node 1 after startup, resulting in the loss of previously modified cluster data, as witnessed by cloud sharing.

If you are using a disk witness, when time node 5, node 1 starts, and the witness disk will be contacted after startup, because the cluster database will also witness disk synchronization, and when time node 3 is modified, the cluster witness disk will also be synchronized, so node 1 can get the cluster database marked by the latest paxos from the witness disk and start normally.

Based on this, Lao Wang's suggestion is to configure witness disks for 2012R2 clusters, whether they are odd or even nodes.

You can also select most nodes, but the disadvantage of majority node dynamic arbitration is that 66% supports to the last node.

Witness disks are added to most nodes. You need to maintain to ensure that witness disks are always online.

There needs to be a tradeoff between both.

For further discussion, Lao Wang believes that if you are in the same data center, if you witness the disk plus most nodes, there is no doubt that you should choose it first. As long as the witness disk is online, the cluster will be able to survive to the last node. As for the reliability of the witness disk, you can configure the multipath from each node to the array by configuring Raid on the array to ensure the continuous availability of the witness disk. Or the underlying layer is directly built by super-converged software, such as S2DJVSAN cross-machine architecture, and then use virtual disks to create cluster witness disks.

If it is a remote data center, Lao Wang still recommends using witness disk when conditions permit. Witness disk plus most nodes 2012R2 is always the best solution. The cluster architecture of remote data center is usually recommended by architects. One is to store witness equipment in the third data center, and the two data centers are connected to the third data center. We also need to consider the link between the two data centers and the third data center, which brings additional costs. The other is storage replication, which is widely used now, that is, one storage device in each of the two data centers replicates each other synchronously, which is usually implemented directly by hardware or software. After one site is down, storage and computing are started at another site, which requires attention. If it comes to the replication of witness disks, 2016 of storage replication cannot be achieved at present. 2016 storage replication can only copy CSV and role disks, not witness disks.

In the final analysis, it is still a matter of cost, if funds permit, witness disks can be allocated to two data centers at a third site, or synchronous storage replication arrays can be directly distributed between the two sites.

If funds are not allowed, you can find a file server at the third site to witness file sharing and assign it to two data centers. It is also possible to avoid the problem of time zoning. For example, if there is already a node downtime, do not modify the cluster data on the existing node.

Or if you don't even have a third site, you can use 2016 Cloud shared Witness to open a blob on Azure for cluster arbitration, but you need to open a local data center to port 443 of Azure

Although file sharing witness and cloud witness do not have a cluster database, these two arbitration models can also support the dynamic witness arbitration model to help the cluster support to the last node and avoid the problem of brain fissure partition.

Whether it is file sharing witness, cloud witness or disk witness, the main concern of remote data centers is the link problem. The link from each node to the witness device does not need to be very fast, but the quality must be guaranteed.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.