WSFC2016 site Awareness and Health Service 07/06 Update SLTechnology News&Howtos

WSFC2016 site Awareness and Health Service

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Lao Wang once wrote an article that briefly discussed the failure domain and site awareness functions of WSFC 2016, but with further application and use, Lao Wang found that the concept of site awareness runs through many functions in the WSFC 2016 system, so he decided to write another article, mainly to discuss how to think about site awareness, failure domain, and WSFC 2016 health service functions in ReallyWorld.

# 1. Preliminary discussion on WSFC 2016 Fault Domain

Generally speaking, the concept of failure domain is heard only when we deliver SLA or enjoy SLA. For example, when we purchase a cloud service from a cloud vendor, they guarantee a lot of 9s, but only if we put multiple application virtual machines in the cloud service in different failure domains. For users, the cloud manufacturer will usually tell you to put them in different failure domains. Your virtual machine will be placed in a different rack, will never be maintained together, the chance of failure together is very low, and so on.

As a deliverer, we need to set the failure domain policy in the background. The failure domain is not a fixed technology, it should be a specification. After introducing the fault domain specification, the administrator should know that the machines in all failure domains of a user should not be maintained together at the same time, and the technical level will ensure that different failure domain resources are always placed on different racks or cabinets through the cluster system or VIM system. Implementation to this stage can be regarded as logical definition + physical implementation. As for whether the physical implementation can be achieved or not, it depends on the infrastructure's perception of the fault domain. WSFC2016 supports logical definition of Chassis,rack,site three fault domain levels.

At present, only S2D function can really realize fault domain awareness. Once S2D senses that WSFC is configured with Chassis or rack fault domain level, it will always ensure that multiple copies of extent are scattered on different Chassis or rack.

# 2. Further discussion on site awareness and fault domain

Lao Wang believes that there are seven main functions of site perception.

Failover rules: when site awareness is configured, the application will first attempt to fail over at the nodes at the same site, and anti-correlation and available owner configuration will override site awareness

Drainage maintenance rules: the application will first try to drain water at the nodes of the same site, and anti-correlation and available owner configuration will override site awareness

Site-specific heartbeat: only when site awareness is configured for the cluster can we configure the site heartbeat detection frequency

Site ticket pruning: after the site awareness feature is configured, we can configure the preferred site feature. The node of the selected preferred site will win in 50max 50, and the non-preferred site will automatically remove one vote.

Hierarchical preferred site: you can configure the cluster-level preferred site to achieve non-preferred site ticket pruning, or you can configure the cluster group-level preferred site to achieve multi-master preference.

Storage site affinity: after configuring site awareness, the virtual machine looks for the site where CSV is located by default. Site awareness logic believes that the virtual machine and CSV are at the same site to improve efficiency. By configuring the storage preferred site, you can always ensure that the virtual machine and CSV are located at the same site. If the virtual machine finds that the current site is not the same as CSV, it will be moved to

Extended cluster configuration: when we configure an extended cluster, there are actually two extended clusters, one is the application above the cluster, and the other is the storage that is replicated and automatically failed over, although the storage can automatically fail over across sites. however, the extended cluster storage replication does not consider the multi-site problem, it does not understand, it only knows how to copy the disk contents to the specified nodes and interact with the cluster. But we need to consider the problem of multi-site failover for the application. By default, the application will be transferred to all available nodes, and the virtual opportunity may be transferred to the remote site, but in fact, at this time, the storage is still provided by the primary site. At this time, the access efficiency of the primary site will be reduced, so the best practice of extending the cluster is to cooperate with site awareness to achieve the underlying storage failover and the best availability of the application. The implementation application fails over locally by default, and is always stored locally by default

When we think about a cross-site cluster architecture, in addition to network, storage, and arbitration, another point we need to consider is the placement strategy of the cluster. in many cases, if you ignore the cluster placement policy, it will lead to additional downtime, and if you make good use of the cluster placement strategy, you can solve many complex problems.

Site awareness, to put it bluntly, Lao Wang believes that it and S2D fault domain awareness are two different things. Site fault awareness is to define the site architecture in the cluster, so that failover, drainage, heartbeat, arbitration can be performed with an additional reference project, the multi-site architecture in our heads is displayed through software definition, and the cluster components refer to it to work.

Site awareness definition is a method implemented by WSFC2016, some of which we can also implement in previous versions. For example, the application fails over at the local site first. In the past, we defined the preferred owner and the site votes were built. In the past, we defined the new method of LowerQuorumPriorityNodeID,WSFC 2016 site fault awareness. Unlike in the past, we strung together the different functions in these clusters through a site awareness function. This is its powerful point, while site-aware support through PS batch configuration, easier to manage than the old solution before 2016, in short, we need to slowly accept this concept and try to apply it to make the multi-site cluster architecture more perfect.

# 3. Finally talking about failure domain and health service

According to the summary, Lao Wang believes that the definition of fault domain in WSFC2016 has three main uses.

1. Work with applications such as S2D to achieve fault domain awareness (I hope there will be more and more fault domain awareness applications like S2D in the future)

two。 Cooperate with WSFC to realize site awareness to control inter-site, failover, drainage maintenance, heartbeat detection, arbitration execution.

3. Cooperate with health service to realize location troubleshooting

When we create fault domains in powershell, they are actually logical definition texts. Without components that can perceive them, such as S2DMagneWSFC, they are just ordinary Text and will not work. Only components that can perceive them can the defined fault domain level be physically realized.

After clarifying this concept, let's take a look at the health service function. Lao Wang omitted it when talking about the WSFC 2016 series and specially made it up.

Basically, we can understand it as a WSFC's own monitoring function, through health services can help us to focus on a cluster application, cluster component performance collection, working status, its different levels of running status for event reporting.

At present, the health service can only for S2D. When we enable S2D in the cluster, the health service feature is enabled by default. The health service will monitor the operation of S2D and collect its performance reports. Unlike ordinary event logs, Lao Wang believes that the logs collected by health services are very friendly and clear to the administrator at a glance.

For example, these

When we need to use health services to monitor S2D, enter the following command

Get-StorageSubSystem * Cluster* | Debug-StorageSubSystem

Parameter field

Severity

Practical description of the problem

Recommend the next step to solve the problem

If its physical location defines the fault domain, it shows the current fault alarm in the rack under that site, the cabinet and the server according to the nesting relationship.

The description of the resource, if there is a defined failure domain, will also be displayed in a nested relationship

Different from ordinary monitoring software, why does Lao Wang say that it is friendly? it is because its error display is very clear. It can directly tell you that the network cable has fallen off, that network card, that server has lost its connection, or that disk has fallen off.

As shown in the figure, there is a critical level of log indicating that hv01 is missing, and the following location and description automatically show their location or address according to the nested relationship.

Get-StorageSubSystem * Cluster* | Debug-StorageSubSystem this command can only be run when S2D is enabled in the cluster

By default, the execution of this command shows logs that affect the overall operation of the S2D cluster, most of which are related to hardware or configuration

It can also be run

Get-Volume-FileSystemLabel | Debug-Volume

Get-FileShare-Name | Debug-FileShare

This returns fault logs that affect only the specified file share or volume level, which are usually related to capacity planning or recovery feature configuration

In addition to monitoring fault logs, another point of health service is performance collection, which collects some useful performance parameters during S2D operation, such as CPU utilization, IOP, capacity.

Execute the command Get-StorageSubSystem * Cluster* | Get-StorageHealthReport to display the overall performance report of S2D

Displays the S2D performance report for the specified second interval

Get-StorageSubSystem Cluster* | Get-StorageHealthReport-Count

Display a performance report for a share or volume in S2D

Get-Volume-FileSystemLabel | Get-StorageHealthReport-Count

Get-StorageNode-Name | Get-StorageHealthReport-Count

Another function, the health service function, can also be used to monitor important jobs being performed during the operation of S2D.

Get-StorageHealthAction

If the current S2D is doing the following, the

A physical disk that is about to fail, lose connectivity, or be unresponsive

The current storage pool is replacing physical disks

Restore complete recovery data

Rebalance the storage pool

Basically, health service mainly implements these three functions at present. Microsoft hopes to help cluster administrators improve the efficiency of monitoring operation and maintenance through the health service function. Using practical monitoring logs and performance indicators, monitoring logs can be integrated with the failure domain function. When an error occurs, it can automatically nest fault domain relationships to help administrators locate the problem. At present, this function is only used in S2D. Lao Wang hopes that more and more cluster functions will support monitoring services in the future.

In this article, Lao Wang mainly discussed the following concepts with you. If you need to implement, please refer to another blog of Lao Wang, WSFC2016 fault domain site awareness.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.