How to choose the Optimization Scheme of WSFC Arbitration Model 05/06 Update SLTechnology News&Howtos

How to choose the Optimization Scheme of WSFC Arbitration Model

2025-05-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about how to choose the optimization scheme of WSFC arbitration model. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

Optimization Scheme of WSFC Arbitration Model

Production environment description of a system database with high availability environment

In the production environment, a system is deployed with three nodes: primary, standby and disaster recovery SQL Server 2014 AlwaysOn AG, the operating system is Windows Server 2012 Standard, and the environment is based on domain (Domain) and Windows failover cluster (WSFC). No witness disk, using majority node arbitration. Both the master and the standby have the right to vote, while disaster preparedness does not have the right to vote. Dynamic arbitration is enabled by default, and the default dynamic arbitration randomly selects a node to remove the voting rights. The production environment is currently voting on the standby node.

Part 1: test environment resource description

Windows failover clusters, AlwaysOn AG resources, and role descriptions:

Description:

The original environment arbitration witness is no witness, and the new shared disk is only attached to TEST-GS-ZHXT1 and TEST-GS-ZHXT2, which is used for the configuration disk witness of solution 4.

Concerns:

1. Check whether WSFC and AG can provide services normally.

two。 The current number and movement of votes.

3. Check that it is too late to exchange votes (the current vote is all 0), the status of WFSC and AG after compulsory arbitration, and whether AG can successfully execute Failover and copy can Resume.

Part 2: test the current configuration of the simulated production environment account system

Scenario 1: standby node downtime

Objective: to simulate the compulsory blanking when the voting rights of standby nodes are not exchanged in time.

After downtime, there is no time to exchange votes, WSFC crashes, and AG is in Resolving status.

Maintenance operations:

1) the primary node performs compulsory arbitration

Open cmd as an administrator and execute the following command

Net stop clussvc

Net stop clussvc / FQ

WSFC can provide services to others.

On the primary node, get-clusternode shows that the primary node is in Up status with a vote of 1, the disaster recovery node is in Up status, with a vote of 0, and the standby node is in Down status with a vote of 0.

2) check whether AG is running properly, or if it is Resolving.

AG is Resolving statu

3) if Resolving, whether Failover can be enforced

On the primary node, open the instance to execute ALTER AVAILABILITY GROUP testag FORCE_FAILOVER_ALLOW_DATA_LOSS

Failover,AG can be forced to provide services to others.

Remaining questions:

After performing the above operations, the standby node also recovers from failure.

On the slave node, get-clusternode shows that the slave node is in Up status with a vote of 1, and the master and disaster recovery node is in Down status with a vote of 0.

The standby node did not rejoin the primary node's cluster as expected.

It needs to be executed on the standby node

Net stop clussvc

Net start clussvc

Restart the cluster standby node

WSFC recovery, AG secondary copy can be manually resume

Part 3: current arbitration model optimizable scheme testing

Option 1: modify the cluster parameter value LowerQuorumPriorityNodeID to standby node ID

Purpose: to move the current vote to the primary node.

Scenario 1: standby node downtime

After the slave node goes down:

WSFC is normal, AG master copy is normal, and disaster recovery node secondary copy is normal.

The secondary copy of the backup node, Down,AG, can provide services normally.

After the backup node is restored:

WSFC is normal, AG master copy is normal, and disaster recovery node secondary copy is normal.

If the recovery time of the standby node is short, the database of the standby node recovers automatically.

If the recovery time of the standby node is long, the secondary copy of the standby node is out of sync, and the database cannot be accessed, the database can be accessed after manual resume database.

If the recovery time of the slave node is very long, and the previous log of the master node is truncated by the log backup, the log is insufficient and cannot be used for the synchronization of the slave node, and the AG copy of the slave node needs to be rebuilt.

Scenario 2: disaster recovery node downtime

Phenomenon same as scene 1

Scenario 3: primary node downtime

After the primary node goes down:

There is no time to swap the vote, WSFC crashes, and AG is Resolving.

The standby node performs the compulsory arbitration and the WSFC resumes.

After the slave node AG forces Failover to the slave node, the AG master copy is normal. After the disaster recovery node needs to manually resume the database, the database can be accessed.

After the primary node is restored:

Tested 4 times, there are 2 different results:

A) the master node does not join the cluster of the slave node, and it is normal after restarting the clussvc service of the cluster master node.

B) the primary node automatically joins the cluster of the standby node.

Solution 2: let all three nodes have the right to vote, and modify the cluster parameter value LowerQuorumPriorityNodeID to disaster recovery node ID

Objective: when any one node fails, one vote can keep the WSFC normal.

Scenario 1: standby node downtime

After the slave node goes down:

WSFC and AG are normal.

After the backup node is restored:

WSFC voting returned to normal and AG returned to normal.

Scenario 2: disaster recovery node downtime

Phenomenon same as scene 1

Scenario 3: primary node downtime

After the primary node goes down:

WSFC is normal, and AG is in Resolving status. Performing mandatory Failover,AG on slave nodes can provide external services, and manually resume disaster recovery node database.

After the primary node is restored:

WSFC voting returns to normal. After manual resume, the database is synchronized with the data of the new master node and can be accessed.

Scenario 4: the primary node and the standby node are down at the same time

After the primary node and standby node are down at the same time:

When WSFC crashes, compulsory arbitration is performed on the disaster recovery node, and WSFC can provide services.

AG is Resolving status, and the execution force Failover,AG can provide services to the outside world.

After the primary and standby nodes are restored:

WSFC voting returns to normal. After manual resume, the database is synchronized with the data of the new master node and can be accessed.

Scenario 5: standby node and disaster recovery node are down at the same time

After the slave node and disaster recovery node are down at the same time:

WSFC crashes, mandatory arbitration is performed on the primary node, and WSFC can provide services.

AG is Resolving status, and the execution force Failover,AG can provide services to the outside world.

After the backup node and disaster recovery node are restored:

WSFC voting returns to normal. After manual resume, the database is synchronized with the data of the new master node and can be accessed.

Option 3: add one arbitration node to give the right to vote

Purpose: to make the current total number of votes 3, based on the majority of node arbitration.

Scenario 1: standby node downtime

After the slave node goes down:

WSFC and AG were normal.

After the backup node is restored:

WSFC voting returns to normal, AG is normal, and the slave node database can be accessed after manual resume.

Scenario 2: standby node and disaster recovery node are down at the same time

After the slave node and disaster recovery node are down at the same time:

WSFC and AG were normal.

After the backup node and disaster recovery node are restored:

WSFC voting returned to normal and AG returned to normal.

Scenario 3: primary node downtime

After the primary node goes down:

WSFC is normal, AG is in Resolving status and cannot provide service.

After the slave node AG forces Failover, it can provide services. After manual resume, the disaster recovery node database is synchronized with the data of the new master node.

After the primary node is restored:

WSFC voting returns to normal. After manual resume, the database is synchronized with the data of the new master node and can be accessed.

Scenario 4: the primary node and the standby node are down at the same time

After the primary node and standby node are down at the same time:

When WSFC crashes, a mandatory service is performed on the disaster recovery node, and WSFC can provide services.

AG is Resolving status, and the execution force Failover,AG can provide services to the outside world.

After the primary and standby nodes are restored:

WSFC voting returns to normal. After manual resume, the database is synchronized with the data of the new master node and can be accessed.

Solution 4: add disk witness and attach to primary and standby nodes

Objective: based on dynamic arbitration, disk witness uses Windows Server 2012 R2 dynamic witness behavior.

Description:

The shared disk can only be attached to the primary and standby nodes and can also be configured with arbitration witness as disk witness

Because the primary and standby nodes each have 1 vote, the disk witness also cast 1 vote to maintain the odd number of votes in the cluster.

Scenario 1: standby node downtime

After the slave node goes down:

WSFC and AG were normal.

After the backup node is restored:

WSFC voting returned to normal and AG returned to normal.

Scenario 2: standby node and disaster recovery node are down at the same time

After the slave node and disaster recovery node are down at the same time:

WSFC and AG were normal.

After the backup node and disaster recovery node are restored:

WSFC voting returned to normal and AG returned to normal.

Scenario 3: primary node downtime

After the primary node goes down:

WSFC is normal, AG is in Resolving status and cannot provide service.

After the slave node AG forces Failover, it can provide services. After manual resume, the disaster recovery node database is synchronized with the data of the new master node.

After the primary node is restored:

WSFC voting returns to normal. After manual resume, the database is synchronized with the data of the new master node and can be accessed.

Scenario 4: the primary node and the standby node are down at the same time

After the primary node and standby node are down at the same time:

When WSFC crashes, a mandatory service is performed on the disaster recovery node, and WSFC can provide services.

AG is Resolving status, and the execution force Failover,AG can provide services to the outside world.

After the primary and standby nodes are restored:

Re-join the cluster, the two-node database manual resume, synchronized with the new master node data, can be accessed.

Part 4: evaluation of optimization schemes

The following table is a summary of the testing results of each of the four scenarios:

Explanation: the green scene is not the focus of this test, but a theoretical conclusion.

In the case of a WSFC crash, forced blanking can provide services.

After the WSFC can provide the service, the AG is in the Resolving state, forcing the Failover to provide the service.

Schemes 3 and 4 can satisfy that the WSFC and AG are normal when either or both of the backup nodes and disaster recovery nodes are down at the same time.

Scheme 3 is simple to configure and uses a virtual machine as a cluster node to give voting rights and participate in arbitration.

In scenario 3, when the arbitration node goes down, it is consistent with the online configuration, and when the standby node goes down, WSFC crashes and AG is unable to provide services.

Further optimize the scheme, combining the advantages of scheme 1 and scheme 3, modify the cluster parameter value LowerQuorumPriorityNodeID to standby node ID on the basis of scheme 3.

Therefore, plan 5 is conceived: an additional arbitration node is added, the voting right is given, and the cluster parameter value LowerQuorumPriorityNodeID is modified to standby node ID.

The results of further tests are as follows:

Tests show that the performance of the first five scenarios is the same as that of scenario 3.

In the scenario where the arbitration node is down first, and then the standby node is down, the performance is stable, which is better than solution 3.

Reference:

"when using dynamic arbitration, you need to consider the following two issues that may be encountered but easily overlooked:

1. Pure use of the majority of nodes, dynamic arbitration to adjust the number of nodes, when there are 2 nodes, there is a 66% chance that the cluster can survive to the last node normally. When the selected voting node suddenly goes down, the cluster shuts down. You need to manually force the cluster to start.

two。 Use witness plus node vote, dynamic arbitration + dynamic witness. When there are 2 scenarios left, witness suddenly lost contact, witness will not remove its own vote, dynamic arbitration will not automatically adjust to 1 vote, if one more node is down, the cluster will shut down. When the other two nodes are restored, you can manually switch to the majority node arbitration model. In this way, when there are 2 scenarios left in 3, the vote will be automatically adjusted to 1 vote, and the chance of survival to the last node scenario will be about 66%. Then, since we are forced to start the cluster, even when we witness the recovery later, the forced-started cluster database will also cover the database that witnessed the disk. "

Revelation:

The order in which ① selects nodes when creating a WSFC is important.

When a vote exchange occurs in a voting node, WSFC chooses the node with a higher ID value.

We should try to keep voting on the master node in all scenarios. Then it is necessary to keep the ID value of the primary node maximum.

During the creation of the WSFC, the node ID is incremented.

According to the increasing importance, the nodes without voting rights should first establish a cluster, then join the arbitration nodes with voting rights, then join the standby nodes with voting rights, and finally join the primary node.

If you add nodes in batches when you create a WSFC, the ID of the nodes is uncontrollable. )

② optimizes the selection priority of the node when the vote is exchanged.

Because LowerQuorumPriorityNodeID is a global parameter of the cluster, it can only be set to one node ID, and can not set weights or priorities for multiple nodes.

The parameter should be adjusted as needed on the basis of ①.

Part 5: evaluation of optimization schemes

Based on the consideration of cost and benefit, we simulate different scenarios where voting nodes are down for scenarios 1 and 5, and do further tests. The test results are as follows:

From the result, in scenario 5, WSFC is normal and AG needs to force Failover under scenario ②④⑤, which brings the benefit of upgrading from minute-level recovery to second-level recovery. Considering that the downtime of the main database will only be switched manually, after discussion, option 1 is selected.

The above is the editor for you to share how to choose the optimization scheme of the WSFC arbitration model, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.