Discussion on the influence of WSFC compulsory Arbitration 04/19 Update SLTechnology News&Howtos

Discussion on the influence of WSFC compulsory Arbitration

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Compulsory arbitration is a common operation in WSFC cluster management. Under what circumstances should compulsory arbitration be carried out and whether it will affect the cluster after compulsory arbitration? Lao Wang will discuss with you in detail in this article. In order to facilitate understanding, I will briefly review the cluster database and the concept of arbitration involved in compulsory arbitration, so that you can better understand and think about compulsory arbitration.

First of all, let's take a look at the cluster database. Many friends may not know this concept. Lao Wang has a special article earlier, so here is only a summary. To put it simply, the cluster database is a configuration database operated by Microsoft WSFC, which exists in each node registry and witness disk.

Main uses of cluster database

When each node of the cluster starts, check whether the cluster database registry configuration unit is complete, if it is complete, it can allow the node to start the cluster service normally. If it is incomplete, it needs to be synchronized with other nodes before the service can be provided normally.

During the operation of the cluster, the metadata is synchronized in real time to maintain the consistency of the cluster information of each node, and as a reference for failover, during the normal operation of WSFC, the configuration data such as the status of the cluster node, the status of the cluster and the role of the cluster will be recorded in the registry of each node. During the operation of the cluster, the information of the cluster is modified on each node, and the modified data will be synchronized to each node, and the cluster will witness the disk. The cluster communication network is used during synchronization, in which the cluster roles currently hosted by each node are synchronized, so that each node and the witness disk know what services are being carried on each other, and in the event of a failover, other nodes detect the registry unit and mount the cluster roles originally hosted by the down node online.

Since WSFC 2008, the paxos tagging mechanism has been introduced into the cluster database. Each cluster node can keep the latest copy of the cluster database. If a node modifies the cluster data, the paxos tag of the node increases, and then each node senses an updated paxos tag and automatically synchronizes the contents of the cluster database with it, ensuring that the paxos is always consistent with the latest tag. When the cluster node is down and recovers, It compares its own paxos tag with the disk witness paxos mark. If the disk witnesses the paxos tag update, it goes online after synchronizing with it. If the disk witness detects that the cluster node has an updated paxos tag, the cluster database will also synchronize with it.

So what does the cluster database have to do with mandatory arbitration?

Under the normal operation of the cluster, the replication synchronization of the cluster database should be multi-master, and any node that modifies the data can synchronize with other nodes, that is to say, I can modify the cluster information on any node. I know in my heart that what I modified will be correct. However, in the case of compulsory arbitration, it changes. For example, if there is a 50max 50 brain fissure, I need to force one of the parties to provide services. I want the cluster to be served by the site that I forced to start next, but in the case of the previous version, even if one party is forced to start, if the other party does not do anything to prevent arbitration, the other party will try to start the cluster, if the other party modifies the cluster data. Then what I forced to start will be overwritten by him. Therefore, the cluster database introduces a golden copy mechanism. When a node has an authorization recovery operation or a forcequorum compulsory arbitration operation, the node cluster database copy is promoted to the golden copy, and the paxos priority is the highest. Other nodes must synchronize with the golden copy cluster database node, and the synchronized node can provide services normally.

Lao Wang will introduce the concept of compulsory arbitration again later. Here, the relationship between single table cluster database and compulsory arbitration can be seen that this golden copy mechanism has been introduced since WSFC 2008. Through this mechanism, we can tell the cluster clearly in a scenario of compulsory arbitration for disaster recovery, which site data should prevail, and other nodes cannot provide services before synchronizing with the golden copy. Or should not provide services to the outside world.

Let's look at the concept of cluster arbitration. To put it simply, arbitration is an agreement to maintain the availability of the cluster. According to the arbitration model we choose, the minimum number of working nodes acceptable to the cluster is specified. The arbitration model uses a voting mechanism. Normally, each node has one vote, and the cluster witness has one vote. Arbitration determines whether it conforms to the arbitration model agreement according to the number of votes. If the number of votes violates the worker nodes acceptable to the quorum model, the cluster is deemed to be currently ineligible for maintenance availability and the cluster is shut down.

Arbitration serves two purposes in a cluster

1. Track whether the current operating votes of the cluster comply with the arbitration model agreement, and decide to shut down the cluster if it is lower than the minimum working node.

two。 When zoning occurs, maintenance ensures that one side of the majority node wins, so we need to always ensure that the cluster has an odd number of votes. When a partition occurs, the majority party is always responsible for taking over the services provided by the cluster, and the minority party will shut down.

So how to ensure that the number of votes in the cluster is always odd? on the one hand, we can make use of the existing technology of the cluster, on the other hand, the design concept of the architecture designer should be accurate, if it is an even number of nodes. then you must design it as a disk witness or a shared witness or a cloud witness, otherwise there will be brain cleavage.

The so-called brain fissure means that in a 50hip 50 partition scenario, the cluster is unable to make a decision on which party should win the service, so both sides think that they win and seize resources, resulting in the abnormal operation of the cluster and unable to provide external services. therefore, the cluster introduces witness mechanism, disk witness, file sharing witness and cloud witness, which can solve this problem. After the introduction of witness, the cluster vote is still dominated by the number of cluster votes, but the number of witness votes is added. When this 50amp 50 partition occurs, that partition can access the witness device, then that partition can get the number of witness votes and eventually take over the cluster service to ensure the majority winning principle.

With the evolution of the arbitration model, by WSFC 2012, the cluster no longer emphasizes mainly that the operation process must follow the arbitration model agreement, but more emphasis is placed on maintaining the continuity of the cluster application. 2012 introduces the dynamic arbitration mechanism, which can dynamically adjust the number of node votes. In the case of the majority node arbitration model, the cluster has a 66% chance of holding on to the last node, and the even number of nodes plus witness disks. Witness disk online can survive to the last node, odd nodes plus witness disk, up to two nodes. 2012R2 introduces a dynamic witness mechanism, which can dynamically adjust the number of witness votes. Therefore, at the beginning of 2012R2, whether odd or even nodes, it is always recommended to configure a witness for the cluster. When the witness device is online, 2012R2 can ensure that the cluster really survives to the last node.

So what is compulsory arbitration?

To put it simply, compulsory arbitration is to force one of the parties to start the cluster to provide services in the case of a brain fissure scenario or a scenario that does not conform to the cluster arbitration agreement.

Main application scenarios of compulsory arbitration

Disaster recovery: for example, if there are two nodes in the primary site and one node in the standby site, the primary site all crashes. Although the votes of the standby site do not comply with the cluster arbitration agreement, it is still forced to start the standby site to provide services.

Division of brain fissure: 5050 zones occur in the cluster, and the cluster shuts down, forcing one of them to provide services.

You may often hear about the forced launch of a site in Microsoft videos, and many friends can wonder how to force the launch of the site, is it necessary to run a mandatory startup command on each node on the site?

In fact, no, the command of compulsory arbitration, we only need to run on one of the nodes of the standby site, and after execution, we can start the cluster. Other sites within the same site or different sites will feel that there is compulsory arbitration here.

With the evolution of technology, there are few application scenarios of compulsory arbitration.

For example, 2012 begins to introduce dynamic arbitration. If there are currently four nodes in the cluster, dynamic arbitration automatically removes the number of votes for one node. When a partition occurs, the two-node vote wins, and 2012R2 can start by specifying the number of votes for each node to be removed through the LowerQuorumPriorityNodeID command.

Unless the cluster dynamic arbitration is misconfigured to stop, the four nodes do not automatically remove a node vote, resulting in brain fissure zoning and cluster shutdown, which requires forced startup.

There is also a scenario that few people mention, that is, 2012R2, witness failure scenario. When 3 nodes are left in the cluster plus dynamic witness, if the witness device fails, the cluster will shut down after a node is broken. In this case, the cluster needs to be forced to start.

In addition, in fact, the main purpose of compulsory arbitration after 2012R2 is to start a small number of site nodes to use in disaster recovery scenarios.

So what will be the impact on the cluster after compulsory arbitration?

In fact, the function of compulsory arbitration is very simple. If the cluster shuts down, you need to force the cluster to provide services. Just run this command on the desired party. After performing the compulsory arbitration, two actions will take place behind it.

1. Force the node cluster service to start, which in turn starts the cluster

two。 Upgrade the node cluster database paxos to the golden copy

After startup, other nodes that have not been forcibly arbitrated must synchronize the cluster database with the forced golden copy cluster node to join the cluster.

This operation is called blocking arbitration. Before 2012R2, if compulsory arbitration is performed on a few nodes, then when the failed primary site recovers, you need to manually execute the blocking arbitration command at the failed primary site as soon as possible, telling the primary site that there is a mandatory arbitration node in the current cluster environment, and you need to synchronize the cluster database with him and go online, otherwise the main site will also try to form a cluster and cluster database overwriting operations will occur easily. At that time, Microsoft also suggested that after the main site was restored, one should start synchronization.

At the beginning of 2012R2, forced arbitration is introduced, and the cluster has built-in logic to track forced startup partitions. when other partitions detect forced startup partitions, they automatically perform blocking arbitration operations until the cluster database is synchronized with it and then go online.

Forced startup itself, it does not understand the application of the upper layer of the cluster, so as long as there is no additional setting for the application, there will be no additional downtime after forced startup. For example, the current three-node cluster, two nodes in Beijing, one node in Tianjin, Beijing station has a bit of downtime, forced startup to add Tianjin site, after startup, the application can be online in Tianjin site, and after Beijing site resumes, it will join the Tianjin partition cluster after blocking arbitration. At this time, in fact, the cluster can work normally, and all the node paxos tags have been synchronized to the latest. Theoretically, the golden copy effect has been eliminated and multi-master updates can be made. Lao Wang believes that the cluster has returned to normal operation by this time. If you are still worried that the golden copy effect has not disappeared, you can move the application online from the Tianjin site to the Beijing site. Then start the cluster service again in the way of normal startup for the Tianjin site node. Therefore, in theory, as long as the upper application does not need to perform the operation after compulsory arbitration, there will be no additional downtime as a result of compulsory arbitration.

What Lao Wang knows is that for SQL AG, the database tracking operation needs to be performed after the compulsory arbitration is executed.

SQL AG forces the post-processing operation to start

Https://technet.microsoft.com/en-us/library/hh313151(v=sql.110).aspx

Https://blogs.msdn.microsoft.com/alwaysonpro/2014/03/04/manual-failover-of-availability-group-to-disaster-recovery-site-in-multi-site-cluster/

According to Lao Wang's experience, there are two additional points that need to be paid attention to in the use of compulsory arbitration.

In the 1.2012R2 scenario, compulsory arbitration starts the standby site, and after the primary site is restored, the cluster service cannot start to join the cluster, that is, the blocking arbitration process is not performed. The reason may be that the network jitter between the primary node and the mandatory arbitration standby node is unstable during the arbitration process, resulting in the failure of synchronizing the cluster database, or the configuration update patch of the primary site is different from that of the standby site. After the disaster recovery in the actual scenario Ensure that the network is stable and that all site nodes have the same system update configuration before joining the cluster. if it still does not work, try manually blocking arbitration on the primary node, and then observe the cluster log log.

two。 Correct understanding and use of compulsory arbitration, in the process of 2008R2 cluster operation, arbitration will make the cluster as far as possible to maintain a majority node survival model, I mean, when a cluster main site has 3 nodes, sub-site has 2 nodes, the cluster uses the majority node arbitration model, when the main site downtime, even if the remaining four nodes of the sub-site are capable of supporting the cluster application, the arbitration will be offline and sub-site two nodes Let the cluster go offline, stop working, and prevent the sub-site node from starting in a normal way. At this time, we use compulsory arbitration to force two nodes of the sub-site to provide services. Although there are fewer current nodes in the sub-site and do not meet the minimum number of votes in the cluster, but there are two nodes that can provide services, which is better than both downtime. Compulsory arbitration is mainly used to deal with this. If it does not meet the minimum number of votes allowed for cluster arbitration, it is still necessary for the cluster to start the scenario of providing services to the outside world, or in the case of brain fissure, to determine the external service of one party.

However, in the course of a WSFC operation, the downtime of the node cluster service may also be due to system configuration, driver, and third-party software, which may lead to the failure to start the cluster service. In this case, it is not applicable to compulsory arbitration, which is mainly used to deal with situations where the cluster cannot start normally caused by arbitration. In other scenarios, it will have side effects. For example, the current cluster has five nodes. There are four main sites, one sub-site, the cluster is automatic failover, and the cluster is currently provided by the main site to provide application services, but the sub-site cluster service suddenly cannot be started. At this time, if you force the sub-site to start, then good. Assuming that you really restart successfully, the cluster database of the sub-site will completely cover the main site, assuming that the sub-site does not synchronize the latest cluster database with the total site. That is to say, if the sub-site lags behind the configuration of the main site, for example, 10 versions of paxos tags, then the copy of the cluster database of the backward sub-site will override the copy of the cluster database of the home site after compulsory arbitration, because the golden copy of the cluster database will be promoted after compulsory arbitration, and more seriously, the cluster configuration of the main site will fail. Therefore, when the cluster service cannot be started, you must first go through the event log. The cluster log, dump log confirms the problem before performing the repair operation.

To sum up, compulsory arbitration itself does not cause cluster downtime, it is just an operation that can not start the cluster normally caused by arbitration, forcing the cluster node to be in UP state, mainly for brain fissure and disaster scenarios.

The possible impact

1. The upper layer is attached to the cluster application and may need to perform additional operations after forced startup, which is related to the application mechanism.

two。 After compulsory arbitration, other nodes need to block the arbitration process before they can start to join the cluster. if other nodes want to join the compulsory arbitration partition, please ensure that the system configuration is consistent and the network is stable when joining again.

3. Do not blindly use compulsory arbitration, only for the cluster can not start the arbitration agreement insufficient scenarios, blind use will lead to cluster database error coverage impact

Finally, let's talk about a compulsory arbitration operation in a disaster recovery scenario.

Take SQL Always on FCI as an example, according to the recommendation of Microsoft's official website, under normal circumstances, the master and replica nodes are eligible to vote normally, and the auxiliary replica nodes are disqualified from voting.

Voting qualification, that is, the qualification for each node to participate in the cluster arbitration, the voting of the node can be removed when the node is normal, and the node that has been removed from voting can also host the cluster application, but once the main site is down, unless the sub-site is manually forced to start, all non-voting nodes of the sub-site will not be online and the cluster will be offline.

Starting from WSFC 2012, the cluster supports GUI to adjust the voting eligibility of each node.

The mainstream scenario of manually adjusting the voting eligibility of each node is to avoid the additional downtime caused by automatic failover during disaster recovery, because the SQL failover time is longer, which is even longer if it is cross-site. We want each failover to be controllable, so we can control the cluster as a manual failover model.

The specific control method is to change the voting qualification of all standby sites to 0, so that when a disaster occurs at the primary site, the application will not automatically fail over to the standby site, because the voting qualification of the standby site is 0, so the standby site is not qualified to form a cluster. therefore, the operation at this time should be to manually start the standby node, then grant the voting qualification to the online cluster application, and set the voting qualification of the primary site to 0. When the primary site is restored, set the voting eligibility to 1, and then manually move the cluster resources past

Reference link

Https://blogs.msdn.microsoft.com/alwaysonpro/2014/03/04/manual-failover-of-availability-group-to-disaster-recovery-site-in-multi-site-cluster/

Https://technet.microsoft.com/en-us/library/mt607084(v=office.16).aspx

Https://msdn.microsoft.com/en-us/library/jj191711.aspx

After this manual control, although the administrator is required to operate manually during the failover, the brain fissure scenario can be avoided, because it is not qualified, so the sub-site is not qualified to form a cluster at 50 max 50 partition.

It can also avoid the downtime caused by the failure caused by the detection signal due to the unstable quality of the network.

The mind map of this article

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.