Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Daily management operation of WSFC

2025-01-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/03 Report--

In this article, Lao Wang will introduce you to some daily management functions of WSFC, which will be relatively easy. Some of them may have been seen before, but you don't know what it means, or have never been used. Lao Wang hopes that through this article, more people will know that WSFC has these management functions and how to operate and use them.

Lao Wang will mainly focus on two levels, one is the operation and placement of WSFC, the other is the maintenance and update of WSFC.

Speaking of the placement strategy of WSFC, first of all, I would like to talk to you about the concept of owner. in WSFC, whether we do planned maintenance or unplanned failover, the cluster always migrates the resources on the maintenance failure node, so where to migrate? the first concept to consider here is the owner. by default, if you install a cluster, there is nothing extra. Then when a node fails, the resources above it will be randomly placed on other living nodes in the cluster, because for the cluster resources on that node, all the surviving nodes are the same. I can go to that node.

When it comes to 2008R2, WSFC clusters begin to implement intelligent placement, that is to say, if no management configuration for cluster applications is made, by default, when a node has planned maintenance or unplanned failover, the cluster will evaluate and see that there are few cluster resources on that node, and the cluster will help us transfer the resources of the failed node to the node with less cluster resource load to continue operation as far as possible.

Take 2008R2 as an example, we currently have three nodes and two cluster applications, namely devtestdtc and devtestdtc1,devtestdtc1 currently running on Node3, and devtestdtc currently running on Node1

If you directly cut off the Node1, you can see that devtestdtc did not go to Node3, but went to Node2 without any load.

First of all, the concept of owner is the preferred owner. As I just told you, if nothing is done by default, for cluster applications, when there is a failure, the operation of transferring to that node is the same. But once we set the preferred owner, it is tantamount to telling the application when planned maintenance or unplanned migration occurs. You should go to this preferred node first. You will run better on it.

Open the cluster application properties to see the preferred owner setting, which is not checked by default, that is, all nodes are the same for the cluster application

In this example, we manually set the preferred owner of devtestdtc to Node3

Currently, the preferred owner of devtestdtc is set to Node3,Node3, which already has the application devtestdtc1 running.

Here we choose to move the devtestdtc to another node. Choosing the best means letting the cluster evaluate according to the placement policy and help us choose the most suitable node.

By default, Node2 with no load should be selected according to smart placement, but since we manually set the preferred owner of devtestdtc to Node3, devtestdtc will be placed in Node3

It can be seen that the execution of the preferred owner is better than that of the default smart placement of the cluster, and the cluster perceives that there is our manual assignment here, so the preferred owner will prevail.

Another important concept is the concept of possible owner, which has been around since 2003, that is, for a cluster application, when you do planned maintenance or unplanned failover, which nodes do you have to transfer? by default, all nodes can go for cluster applications, but we can edit the list of possible owners manually. So that the cluster application can only be online to the specified node, if the specified node is not there, the application should not be run online.

In this example, we use four clustered servers, devtestdtc hosted on node1,devtestdtc1 and hosted on Node3

The current preferred owner of devtestdtc is Node3

If you directly power down the Node1, you can see that the devtestdtc has gone to the Node3 as we expected, so the preferred owner setting will take effect whether it is planned maintenance or unplanned failover.

Turn on the devtestdtc property, and the advanced policy can see that all cluster nodes are currently possible owners, so devtestdtc can go to other nodes even if the preferred owner Node3 is not available.

Let's take a look at another example. Currently, both devtestdtc and devtestdtc1 run on Node1. The preferred owner of devtestdtc is set to Node1,Node2,Node3, but only Node1 and Node2,devtestdtc1 may be the owners without any settings.

Devtestdtc preferred owner

Possible owner of devtestdtc

Devtestdtc1 does not make any settings

At this time, the power is cut off directly for HV01 and HV02. As you can see, because devtestdtc1 has not made any settings, it will be randomly placed on Node3,devtestdtc according to intelligent placement. It is just transferred to the node, but there is no way to go online because there is no qualified possible owner.

Although we have set the preferred owner of devtestdtc to Node3, because the possible owners of devtestdtc are only Node1 and Node2, devtestdtc will not go online at the preferred owner of Node3. As you can see, whether it is the default smart placement of the cluster or the preferred owner, as long as there is no qualified preferred owner, the application will not be online, and it is possible that the owner's settings will override the preferred owner and smart placement.

So far, we have learned about the preferred owner and the possible owner. Lao Wang thinks these two concepts seem very common, but they also have their own uses. For example, if you know that your cluster application works well on some nodes, then you can set the preferred owner to ensure that in the event of planned maintenance or unplanned failure, as long as you see that the preferred node is alive. The app will run on it first. If you know that some nodes in the cluster are very old in hardware and inefficient in execution, then you can set the possible owners of some key cluster applications, which are only performed on high-performing nodes. Never let critical applications be executed on old nodes.

In addition to the preferred owner and possible owner, in the 2008R2 era, the cluster also added a resource attribute to maintain the mode, which Lao Wang also called the default owner.

So what is the default owner? to put it simply, once you check this attribute for the cluster resource, the cluster will remember which node the resource is running before the next cold start, and when it is cold started again after a failover, it will allow the application to run back to the node before the failover. Through this attribute, you can control some applications that are glued to the node. For example, if you want some key VM to run on this node all the time, you can start the hold mode.

In 2012, this feature is hidden in the GUI interface and can be controlled by the powershell command. It is enabled by default for VM in the 2012 era.

The current devtestdtc1 does not set the preferred owner and possible owner, just check the enable hold mode

When Node3 fails, devtestdtc1 is transferred to node1

When Node3 recovers, restart the cluster service of Node1 and simulate the cold start of the node

You can see that the application runs back to Node3, and the default owner setting takes effect

In the actual use, Lao Wang found that the default owner has the following nodes to pay attention to.

1. If you manually move to another node for an application with hold mode enabled, for example, devtestdtc1 is currently running in Node3, and you manually move it to node1, when the cluster node is cold started, devtestdtc1 will not return to Node3, because when you move manually, the cluster will remember that the default owner of devtestdtc1 is node1.

two。 The default owner setting does not take effect immediately after the node is restored. You need to restart the cluster service after the transfer or restart before you can return to the default node.

3. If the resource has a preferred owner set, the preferred owner setting is better than the default owner setting, for example, if the preferred owner of devtestdtc1 is set to Node1, then when Node3 fails, devtestdtc1 will always operate in Node1 and will not return to Node3.

4. The default owner can be regarded as the best node among the possible owners. If the cluster application has a specified preferred owner, the preferred owner is preferred, and if the preferred owner is not available, the default owner is considered.

5. Whether it is the default owner setting or the preferred owner setting, you need to follow the list of possible owners. If the list of possible owners changes, for example, if the default owner of the application is removed, you will not go back to the default owner node, but choose other available nodes according to the list of possible owners.

In 2008R2, WSFC also adds a new attribute for resource placement, ClusterGroupWaitDelay. If we set the preferred owner or use hold mode, each time the cluster starts cold, the cluster waits for the preferred owner of the application or keeps the node online, and then prioritizes the application on the preferred owner and hold node, which ensures that the application always goes back to the node we want it to be.

In 2008R2 era, the ClusterGroupWaitDelay attribute of each cluster application is 30 seconds by default, and it is 120 seconds in 2012R2, which can be set by powershell command. After cold start, if the preferred owner or default owner is not online during this period of time, the cluster will choose other possible owners for online application.

You can also set to allow failure reply for the cluster application, so that even if the preferred owner and default owner are not online during ClusterGroupWaitDelay time, the application runs on other possible owners, but when the preferred owner or default owner recovers, you can also return to the original node through failure recovery.

# manually modify ClusterGroupWaitDelay time

(Get-Cluster devcluster). ClusterGroupWaitDelay = 300

To sum up, the concept of owner is the first concept when we look at cluster placement, in which both the preferred owner and the default owner can be understood as enhanced × × once we have set it up, whether it is planned maintenance or unplanned failover, the cluster will try to help us put resources on top of the preferred owner, or on the default owner if the preferred owner is not available If the preferred owner is not set, the default owner setting takes effect directly.

If the preferred owner and the default owner node are unable to go online after waiting, the cluster application will also try to use other possible owner nodes to go online, and the application will not be online because the preferred owner and default owner are not there. The cluster default is to want all applications to be continuously available online. However, if we want the application to never run on the specified node, it can also be achieved by manually modifying the list of possible owners. The owner may be mandatory × ×, which will override the preferred owner, default owner, and smart placement.

After talking about the concept of owner, let's take a look at priority. Priority is also an important concept for cluster placement. By default, if priority is not set, then when the node is turned on and off, or when it fails over, the application will randomly scramble for resources, because everyone is the same, there is no difference, we all want to grab resources and go online quickly, but at this time There could be a problem of starting the storm.

For example, if your node server resources are limited and a single node cannot carry all applications online, then when a failover occurs, if only this node is left, it is likely to happen. Many important cluster applications are not online, while less important cluster applications are online, seizing resources, resulting in insufficient computing resources for important applications.

If you set the priority of the cluster resource, you can avoid this problem, and the priority setting will take effect in the following scenarios

When the cluster node shuts down and starts up, priority is given to high priority applications online.

When the node is set to maintenance mode, priority migration is given to high-priority applications

When a node fails over, priority is given to transferring high priority applications

Priority function was introduced in the 2008R2 era, in the 2008R2 era, we can only set whether resources should be started automatically, we can set some unimportant resources so that they do not start automatically after cold start or failover, and wait for the administrator to manually put them online, which can initially ensure that important applications will not be preempted by unimportant applications.

In the 2012 era, the priority function has been improved and matured. In the 2012 era, we can not only set whether resources start automatically, but also set the priority of resources to high, medium and low. This can help us control the start-up storm from a more fine-grained level.

For example, when a node reboots, plans to maintain, or fails over, it will first help us deal with high-priority resources, ensuring that high-priority resources will first be transferred online, then medium-priority ones, and finally low-priority ones. If high-priority and medium-priority node CPU and memory resources are full, low-priority applications will not preempt resources with high-school priority applications. For resources that are not automatically started, the administrator is required to manually start online after failover. In the 2012 virtualization scenario, if the node where the high-priority virtual machine is located fails, if there is no available memory on other servers, the resources of the low-priority virtual machine will be preempted, the offline low-priority virtual machine will be released, and the high-priority virtual machine will be brought online.

In addition to helping us deal with the startup storm and transferring the storm, priority can also help us solve some dependent scenarios. For example, the current cluster runs a set of sharepoint environment with AD domain and sharepointdb,sharepointweb. In the case of sufficient resources, we can set the priority of the AD domain to high, the priority of the database to medium, and the priority of web to low, so that when the node is turned on and off, the AD will be online first. Followed by the database, the final Web back online, planned maintenance or failover, will also give priority to the transfer of AD, followed by DB, and finally Web.

The experiment verifies that the current cluster has two nodes and four cluster applications. The priority is high, middle and low, and it does not start automatically. All current applications are hosted on the HV01 node.

When the HV01 suddenly goes down, you can see that the cluster application is processed online according to the priority order. For the devtestdtc2 that is not set to start automatically, it will not be online. After the administrator confirms that there are sufficient resources, it will be manually put online.

Give priority to high priority Test1

Priority devtestdtc in processing

Dealing with low priority Test2

Processing does not start devtestdtc2 automatically

Any technology has its meaning, and the key lies in whether people can dig deep into its use. Lao Wang believes that the significance of priority setting is to help deal with the problem of starting the storm, or to help deal with the problem of dependent startup.

If your cluster resources are well planned, the node resources are abundant, a single node is down, or when there is only the last node, and the node can support all applications, then you may not need priority settings. Unless your environment involves dependency startup, you can also take advantage of priority settings.

If your server resources are limited, or if there are important applications on your cluster, Lao Wang suggests that you can use the priority feature to gradually plan the priority of resources to ensure that important applications come online first in the event of failure or cold start. Priority settings can take effect whether for cluster roles or virtual machines, although priority settings sometimes sacrifice low-priority availability To ensure the availability of high-priority resources, but at least in the case of insufficient resources, to ensure that key applications are online first, and then re-plan resources when resources are sufficient to ensure that all applications can be online all the time

Another factor to be considered in the placement strategy is anti-correlation. What is anti-correlation? to put it simply, by default, if we use preferred owner, default owner, possible owner, smart placement and so on, there may be no way to avoid a situation in which two resources run on the same node at the same time. Then when this node goes down, the whole application will need to fail over, and the application will have downtime.

For example, we deploy AD domain controllers, two domain controllers, DC1,DC2 in the WSFC cluster. Suppose that now two virtual machines are running on the same node, this node is suddenly powered off, and both virtual machines need to fail over to other nodes. During the failover period, there is no way for users to log in to the domain control.

Anti-correlation can solve this problem. We can set two resources with the same anti-correlation attributes, so that no matter whether it is manually moved to the best node, or maintenance mode, failover, as long as one resource sees a resource with the same anti-correlation attribute on the other resource, it will not be transferred, ensuring that two resources with the same anti-correlation attribute will never run on the same node. This will help reduce application downtime to some extent.

In WSFC clusters, anti-correlation can be achieved through custom classes in the 2008R2 era. Since 2012, a new powershell setting command has been added, which makes it easier to encapsulate the process of custom classes, but it cannot be configured in the GUI interface. In SCVMM and Azure, it is implemented as a set of GUI interface availability, and is also used to increase the availability of applications.

Experimental verification

Currently, there are three nodes in the cluster and two DTC applications working together, which operate on HV01 and HV03 respectively.

Currently, the preferred owner is not set. HV01 goes down directly, and devtestdtc will be placed in HV02 according to memory intelligence.

Move devtestdtc back to HV01 and set HV03 as the preferred owner

When the HV01 goes down again, you can see that the resource is not placed in the HV02 according to the memory smart placement policy, but in the HV03 according to the preferred owner policy, and the preferred owner overrides the default memory smart placement.

Move devtestdtc back to HV01 again, and the preferred owner is still set to HV03

# set the same anti-correlation attributes of devtestdtc and devtestdtc2

(Get-ClusterGroup devtestdtc) .AntiAffinityClassNames = "DTC"

(Get-ClusterGroup devtestdtc2) .AntiAffinityClassNames = "DTC"

Anti-correlation attributes can take effect for cluster roles or cluster virtual machines, and the same cluster role or cluster virtual machines can have multiple anti-correlation properties

When HV01 is down again, you can see that devtestdtc did not choose the owner HV03, but went to HV02, and the anti-correlation policy took effect.

When you look at clusterlog, you can see that when the cluster evaluates the placement policy, you will see that Node 2, that is, HV03, has a custom class class, that is, the AntiAffinityClass property DTC, so RCM is unplaced on it, and finally decides to place it on Node4, that is, HV02.

So you can see that anti-correlation will work in some scenarios. For example, if you are a virtualized cluster, in which you run two AD, or two DHCP, or two DNS, two SQL, and other applications completed by the cooperation of two nodes, you hope that one virtual machine can always provide services, then you can use anti-correlation to set the same anti-correlation between the two virtual machines. To ensure that under normal circumstances, the two virtual machines are always scattered on different nodes, preventing the failure of a single node from causing the failover of the entire application.

Through practice, Lao Wang sums up some theories about anti-correlation.

The anti-correlation setting takes precedence over the preferred owner execution, and the default owner execution takes precedence over the memory smart placement execution.

Anti-correlation, first owner, default owner, smart placement, all need to be supported by possible owners.

The anti-correlation setting is also an enhanced setting, which works only when the cluster has more than one node. If there is only the last node left in the cluster, then two applications with the same anti-correlation attributes will also be online on this node. In the case of only the last node, it does not prevent two applications from going online because of anti-correlation.

We still have a lot to say about anti-correlation, and we'll look back at it later when we finish the maintenance mode and fault recovery.

In a cluster, we may often see a concept, that is, the best node, and many friends may be curious about what is the best node and how the best is judged. In fact, the best is the so-called best. It is the most appropriate node that the cluster selects for us. When we click on the best, in fact, the cluster will go according to the anti-correlation, preferred owner, available owner. Intelligent placement strategy is considered comprehensively, and finally the most suitable node is determined.

In the initial case, if no additional configuration has been made, in the 2008R2 era, click move to the best, and the cluster will help us select other living nodes according to the intelligent placement strategy, which currently carries the least load of cluster applications.

In the 2012 era, clicking move to the best will take into account not only the node application load, but also the node's remaining memory. Intelligent placement in the 2012 era, also known as memory smart placement, will be based on the node cluster application load and memory load. To help us choose as much remaining memory as possible, the node with less load on it is the best.

If the cluster application makes additional settings, the cluster will re-evaluate the best node

If the application sets the preferred owner, it is best to go to the preferred owner first

If the application sets the preferred owner and anti-correlation, the anti-correlation takes precedence and continues to select other nodes according to memory smart placement as the best.

If the application sets the preferred owner, anti-correlation, possible owner, and if the node determined by the anti-correlation does not have a qualified possible owner vote, it is best to continue to select other nodes according to memory smart placement.

In a WSFC cluster, in addition to the best node calling memory smart placement to determine the best node, when planned maintenance or unplanned failover, the cluster will give priority to placing the cluster applied to the appropriate node based on memory smart placement by default. If a preferred owner, anti-correlation, possible owner and other settings are detected, then filter layer by layer, but by default The awareness of clustering is always to enable applications to be online as soon as possible, both planned and unplanned, so cluster load balancing, WSFC failover in the 2012 era, can only be considered by default based on the above memory load and cluster application load.

If you have SCVMM in your environment and manage the cluster through SCVMM, the dynamic optimization function of SCVMM can cooperate with the intelligent placement of memory of the cluster. By default, the cluster node fails, and the virtual machine is quickly transferred and put online according to the intelligent placement strategy of memory. Later, SCVMM detects a change in the load of each node, and according to the comprehensive indicators such as CPU, memory, hard disk IO, network, etc. To rebalance the load of each node dynamically to achieve more in-depth and accurate load balancing control.

If SCVMM perceives the preferred owner, anti-correlation, possible owner, arbitration and other settings of the cluster when performing dynamic optimization, it will also comply with the execution.

WSFC cluster focuses more on bringing applications online after failover to perform simple application and memory load scheduling, while SCVMM focuses more on load balancing of the whole virtualized cluster environment, so the combination of WSFC cluster and SCVMM can achieve real load balancing of virtualized resources.

When using the function of the best node, there is one thing to note. Before we clicked on the best node to move, we clicked on a single role and chose to move to the best node. If the smart memory placement strategy is applied at this time, it will help us select nodes with as little memory and load as possible, but if we select multiple roles at a time and move to the best node together, we may not all move to the same node. Because the intelligent placement of memory can only handle one role or one virtual machine at a time, it may be best for this virtual machine to move to the HV01 node, because the HV01 node has no load, but the next virtual machine will be re-evaluated when it is about to move. Currently, both HV01 and HV02 nodes have a cluster resource. Which node should be the best? at this time, it will be selected according to memory usage. If the final memory usage is also consistent, it is best to select other nodes based on the list of possible owners.

Above, we have looked at some concepts set in cluster running placement, such as memory smart placement, preferred owner, retention mode, possible owner, priority, and anti-correlation. It can be said that it has almost covered most of the points to be considered in running placement. Let's take a comprehensive look at the application of these concepts in different placement scenarios.

Manually move to the specified node

Priority failure, memory smart placement failure, preferred owner failure, anti-correlation failure, when manually moving to a specified node, the cluster will only evaluate whether the target node is a possible owner node. If the node has sufficient resources in the list of possible owners, it can be moved.

Manually move to the best node

The priority takes effect, and the cluster is processed according to the priority setting, and the high-priority applications are moved first to ensure that the high-priority applications will be online first. After the priority confirms the processing order, the placement strategy will gradually evaluate the placement of each application according to the processing order. The cluster will evaluate the intelligent placement of memory based on the list of possible owners by default. If it is detected that the application has the preferred owner setting. Then move to the preferred owner node as the best node, ignoring the decision of smart memory placement. If the anti-correlation is set, the anti-correlation is better than the preferred owner execution, ignoring the preferred owner decision, and then selecting other nodes according to the intelligent placement of memory is the best.

Cold start of the whole cluster

If the priority takes effect, the cluster node will give priority to the high-priority application online according to the priority order. if only the last node is left, the high priority will be priority online, followed by the low-priority application in processing. If the final low-priority application does not have computing resources available, it will not go online. As the node is gradually online, when it is detected that the application has a set preferred owner and the failure reply, the application will fail back to the preferred owner to run, but if it is detected that the target node already has anti-correlation resources, reselect other nodes.

In the case that only the last node is left to start, as long as the node is a qualified possible owner, even if it is not the preferred owner of the application, even if two applications with the same anti-correlation are put on it together, the application will be online.

Cluster node failover

The priority takes effect, according to the priority processing, the cluster gives priority to the high-priority applications, transfers the high-priority applications first, and queues other priorities to wait. After confirming the processing order, the cluster is evaluated according to the placement strategy. First, consider the memory smart placement strategy according to the list of possible owners, and give priority to placing high-priority applications on nodes with less memory and application load. If you perceive that the application has a setting preferred owner and the preferred owner is alive, place the application directly on the preferred owner node. If it is detected that the application has set anti-correlation, the anti-correlation setting will be better than the preferred owner. In the case of failover, resources with the same anti-correlation are not put together in the case of more than one possible node. Instead, it is selected based on a list of possible owners and memory smart placement strategies to ensure that anti-correlation is applied.

From the above scenario, we can see that the priority setting is applied before the cluster processing placement policy, the priority setting helps us to determine the order in which it should be processed, and then the placement policy will be placed intelligently according to memory, preferred owner, anti-correlation, to help us select the appropriate node for each order, but memory smart placement, preferred owner, anti-correlation is only enhanced attributes. If there is only one node left in the cluster, the application will also be transferred to that node to run, ignoring the compliance with memory smart placement, preferred owner, anti-correlation, and if memory smart placement, preferred owner, anti-correlation selected node is not a list of possible owners, then reselect, and the final placement node must be placed in the list of possible owners.

In the cluster, in addition to manual movement, the best node, cold start, failover, there is also a placement behavior, that is, maintenance mode, that is, planned maintenance, what is planned maintenance? planned maintenance means that we know that maintenance behavior will occur, that some nodes will be down to be maintained, may be replacing hardware, or dealing with performance problems, in a scenario where we know that problems will occur. We can receive the application on the node to be migrated away, and wait until the migration is completed before shutting down and replacing the configuration.

The unplanned failure means that when we do not expect it, the node suddenly goes down due to network or storage reasons, and the application on the node will be failed over.

The difference between planned maintenance and unplanned failover is that I know that downtime will occur, so we can migrate all the above applications as smoothly as possible by minimizing downtime. When the unplanned failover occurs, it will involve the offline of the cluster group and the remount of the cluster disk, so the downtime of the planned maintenance is usually very short.

In the past, if the cluster itself did not have the technology of planned maintenance, we had to make our own planning, for example, on Thursday night, we had to do planned maintenance for the cluster and configure it on the nodes, then on Thursday night, we needed to manually move the resources on the nodes, make sure that after all the resources were moved away, shut down the configuration, turn it on again, and then execute one by one.

In the 2008 era, the cluster has a pause mode, and we can pause by clicking on the node, but the pause function will only tell other nodes that I am currently in pause mode and that your resources should not be migrated to me. However, it still requires the administrator to manually move the resources of the paused node away, which is especially useful in the scenario of cluster update, if there is no pause mode. We also need to worry about patching nodes, but don't come over at this time, otherwise, if you patch, you will restart, and there will be downtime for failover. If you have a pause mode, you need to worry about it. Put the node into pause mode and manually drift away the above resources. Then you can patch the node at ease. At this time, any other resources will not consider suspending nodes when placing them.

In the 2012 era, the cluster is more intelligent, and 2012WSFC has realized that when we pause a node, we can not only prevent other resources from being placed on the current node, but also evaluate the smart placement of memory according to the placement strategy, preferred owner, anti-correlation, possible owner, and automatically help us to remove the resource drainage from the paused node while minimizing downtime when we pause maintenance for the node. For virtual machines to perform non-downtime real-time migration, cluster group switching will be used for cluster roles, and switching cluster group owners may involve temporary offline reconnection of cluster roles, which is the only part of downtime that occurs during planned maintenance.

Experimental verification

In this example, Lao Wang will show you a comprehensive experiment. At present, the cluster has a total of three HV01,HV02,HV03 nodes working, and there are five applications running above. The placement strategy of these five applications is as follows. Later, I will set the HV02 node to maintenance mode.

Test1: preferred owner HV01, so the maintenance mode will go directly to the HV01 node

Test2: preferred owner HV02,HV03, but probably only HV01 and HV02, so the maintenance mode will try to go to the HV03 node, but will not be a qualified possible owner, but will be moved to HV01

Devtestdtc: preferred owner HV01, but will be moved to HV03 because of the same anti-correlation DTC as devtestdtc2

Devtestdtc3: the preferred owner is not set, so it will be placed in the HV03 according to the memory smart placement policy, but it will not start automatically once placed.

In the location of the node, click HV02, click pause, and remove the role, and the node will be placed according to the placement policy as we said. If you click not to exclude the role, then, as in the 2008 era, it will only declare that the current node does not accept the placement of resources, but the administrator can manually remove it.

When we click on the role, we can see that the node will first be set to be exhausted.

A successful discharge according to the placement policy will show that it has been paused, and a prompt will be given if some roles are not successfully drained.

As you can see, the cluster application has been placed as we expected.

Give priority to high priority Test1 virtual machines

Priority virtual machine Test2 in processing

Priority role in processing devtestdtc

Test1 virtual machine processing strategy

When you open cluster log, you can see that for the resource placement of the cluster, the concepts we already know, intelligent placement of memory, preferred owner, possible owner, and anti-correlation are all implemented as filter. When we want to maintain the resources, in fact, HV02 will first wait for the RCM-plcmt component to evaluate each node placement policy according to filter, and finally come to the conclusion placement manager result. Then RCM returns the results to the HV02 maintenance mode node, and the maintenance mode node places the node according to the conclusion drawn by RCM-plcmt.

HV01 according to the RCM placement components, Test1 should go directly to the preferred owner node HV01

HV02 will wait for RCM to return the placement result. After receiving the placement result, move according to the result, and place the Test1 virtual machine to the preferred owner HV01.

Test2 placement strategy

The preferred owner of Test2 is set to HV02,HV03, so after HV02 maintenance, it will first try to migrate to HV03 in real time, but as you can see on HV03, since the Test2 virtual machine may have only Node1, or HV01,Node4, or HV02, and cannot be placed on Node2, that is, HV03, the RCM placement component on HV03 redetermines that Test2 should be moved to the possible owner HV01.

HV02 received the RCM placement result from HV03 and redecided to move the Test2 virtual machine to the HV01 node instead of the preferred owner HV03

Devtestdtc3 placement strategy

If you look at clusterlog, you can see that when HV02 needs to deal with devtestdtc3, it first asks RCM where to place the component, and after filter filtering, it is finally decided that devtestdtc3 should be placed in Node2, that is, HV03, in accordance with the memory smart placement policy.

HV02 receives the RCM placement result and starts to move the devtestdtc3 role to the HV03 node.

Devtestdtc3 will first be set to not automatically start offline, but will try to go online later if there are sufficient resources.

Devtestdtc placement strategy

Because the preferred owner of devtestdtc is HV01, it will first be transferred to HV01 in the event of failover, but above HV01, RCM-plcmt according to filter evaluation, HV01 has a custom class, that is, anti-related resource devtestdtc2, so cancel the decision to place it in HV01, and placement manager finally decides that it should be placed in Node2, that is, HV03.

HV02 receives the result returned by RCM, so the operation devtestdtc moves to HV03. You can see the movement request received by HV02 on HV03, accept the request, and complete the devtestdtc role migration. Finally, devtestdtc runs on HV03.

When we finish excluding roles, the load on the current maintenance node is empty and is set to pause mode. If other node resources are moved to the maintenance node, they will find that there is no way to move them.

At this time, we can configure and configure the maintenance node, and the patches will not affect any cluster applications.

In Microsoft product system, the concept of maintenance mode runs through many products. If the cluster is managed through SCVMM, the node can be operated in maintenance mode in SCVMM. If VMM detects that the current node belongs to the cluster, the virtual machine will be migrated in real time according to the cluster placement policy. If VMM detects that the current node does not belong to the cluster, the virtual machine will be migrated in real time. When set to maintenance mode, the virtual machine will be migrated to other nodes by means of rapid migration. VMM can also be integrated with SCOM. After integration, if the node is set to maintenance mode by VMM, the SCOM will also jointly set the node into maintenance mode to avoid alarms during maintenance. The maintenance mode in SCOM is mainly to prevent alarms during maintenance and does not play a practical role. In SCCM, we can also set the maintenance mode for the collection, so that if we apply the requirement, it must be installed at a specified time. Collections in maintenance mode will not be installed and can be delayed. If SCOM,SCVMM,SCSM is combined, SCVMM sets the node to maintenance mode, SCOM will be linked to maintenance mode, and SCSM will not generate events for maintenance mode if it is configured to generate events for SCOM alerts. Only the SCVMM and WSFC clusters that really migrate resources from the node in maintenance mode

When we have finished maintaining the node, there are usually several options. If you only want to maintain this node, when you maintain it, the application will drift to other nodes to continue to run. After the node maintenance is completed, you can choose to restore or not to restore. Restore means to drift the previously drifted application back, and no recovery means that there is no need for the application to come back to maintain the node. You can operate at other nodes first. If it is for the maintenance of all nodes in the cluster, Lao Wang believes that you can choose to restore the role, so that you can perform the maintenance one by one, maintain the recovery role, and then maintain the next one, and also ensure that the application returns to the original node operation.

In 2012WSFC, there are two types of fault reply, one is the reply in the maintenance mode, and the other is the fault reply that comes with the application. The difference between the two is that if it is a fault reply in the maintenance mode, regardless of whether your application has set the preferred owner or not, it will be restored to the original node and will not move unless it is detected that the application is currently in the preferred owner. If you are applying your own fault reply, be sure to look at the preferred owner, and there will be no failure reply if the preferred owner is not set.

The similarity between the two is that in the 2012 era, whether it is fault recovery or maintenance mode reply, the virtual machine is replied by real-time migration, and for the cluster role, the cluster group is replied by offline online movement.

Click pause node-resume-you can see the option of reply or no reply. If you click reply, the application that was run by that node will be migrated back, unless it is detected that the application is already running in the preferred owner. Otherwise, it will be moved back.

In the actual test, Lao Wang found that the maintenance mode fault recovery is a separate recovery mechanism, which is not the same as the simple application fault recovery, and does not automatically check the application fault recovery option, even if your cluster application does not check the fault recovery option, the maintenance mode will also help you recover the original node.

When it comes to fault reply, some friends here may not have used it, not just 2012. In the previous cluster, if we clicked on a virtual machine or a cluster role, we could see the option of fault reply in the attributes. What exactly is fault reply? whether to reply or not is also a topic worth thinking about in the 2008 era. At that time, there was only one kind of fault reply in the cluster. That is, when the node fails, the above application will be migrated to other nodes. When the node returns to normal, does the application want to come back?

In the 2008 era, whether failure recovery will involve the problem of application downtime, because when a failure occurred in the 2008 era, rapid migration was adopted for virtual machines, and for roles, the whole cluster group went offline and went online again. The downtime of its own failover is a bit long. After the failure has been transferred, it can provide services normally, and you have to reply to the failure again. In the 2008 era, if the application fault reply is set up, then the application and the virtual machine will be migrated again through rapid migration and cluster offline, and there will be downtime again, so in the 2008 era, whether the application wants fault recovery or not, unless it is an adhesive application, it can only work well on the original node, otherwise we generally do not choose fault reply, even if we choose fault reply We usually make another setting, that is, set the failure interval, select a time period, for example, from 1: 00 to 3: 00 in the morning, and reply when there is no user access, instead of replying immediately.

At the beginning of 2012, the downtime consideration of fault recovery becomes no longer important, because when it comes to fault recovery, the virtual machine is returned to the original node by real-time migration, and the fault recovery time of traditional cluster groups is also optimized.

Another attribute has been added in 2012R2, that is, in DrainOnshutdown,2012R2, if we are a node that shuts down normally, then the virtual machine is migrated in real time for the above virtual opportunity, and then shut down. In the past, when we maintained a node, we forgot the pause mode, updated the operation directly on it, and then shut it down. The virtual machine migrated away quickly, resulting in downtime. 2012R2 helps us set up such an umbrella. Once we forget the pause mode, the virtual machine will be migrated away in real time. Unless there is a sudden power outage and there is no time for real-time migration, it will still be migrated quickly. However, it is recommended to use the pause mode during maintenance, which is really a very good feature.

Experiment to verify fault recovery

Currently all cluster roles operate on HV02 nodes, no anti-correlation settings, no preferred owner settings, but check all applications to allow failure reply immediately

If the HV02 node is powered off directly, you can see that the application is distributed to other nodes.

When HV02 returns to normal, you can see that we did not apply all the fault reply HV02 nodes as we expected, because we did not set the preferred owner, so the failure recovery is invalid!

Once again, migrate all the applications back to HV02, and then set the preferred owner of the application to HV02

HV02 is powered off again, and applications are migrated to other nodes.

HV02 recovery, all cluster applications because the preferred owner is set, so the failure reply takes effect and goes back to the HV02 node.

The above Lao Wang introduced the placement strategy in the cluster, the maintenance reply, the placement in the cluster, and the maintenance concept also needs to cooperate with arbitration to start. Imagine that the main purpose of arbitration is to ensure the availability of the cluster. When the nodes change, add, downtime, partition, whether it affects the cluster arbitration model, whether the number of downtime nodes affects the normal work of the cluster, if only the last node is left, the arbitration fails. We also need to perform compulsory arbitration to make the cluster available, and our placement and maintenance are effective only if the arbitration determines that the cluster is available, so arbitration and placement complement each other. Without arbitration, the cluster cannot be started, and placement is meaningless. There is only arbitration, but there is no placement strategy, and cluster management is meaningless.

Finally, there are some small dishes to share with you, some points that Lao Wang got through practice, which are easy to be ignored about placement strategy and maintenance.

If it doesn't start automatically, whether it starts automatically or not.

Lao Wang's actual verification found that it does not start automatically, but only takes effect in one scenario, that is, after the failover, the application will not be started automatically and will not be started automatically. The administrator must start it manually.

In manual movement, maintenance mode, maintenance mode reply, failure recovery scenario, do not automatically start will wait for the high school low-level application to start, if the node has remaining resources, it will automatically try to start online!

Does the fault recovery reply or not?

For the maintenance mode discharge role, if the preferred owner is not set, it will be restored to the original node, and if detected in the preferred owner node, it will not reply.

Apply fault reply in case of failure, reply according to the preferred owner, and will not reply to the original node if the preferred owner is not set.

The maintenance mode does not automatically set the application failback attribute

Whether the anti-correlation is anti-or not.

A situation that will not be reversed:

Maintenance mode

If the first two anti-correlation applications are maintained in HV01, and the first choice is set to HV02, the anti-correlation will fail and both applications will go to HV02.

If the first two anti-correlation applications are maintained in HV01 and the preferred owner is not set, the two anti-correlation applications will still be placed on the same node at random according to the memory placement policy.

The key here is that in the maintenance mode, anti-correlation applications need to be referenced. If there are already anti-correlation applications on the target node, they will not be migrated. If all the target nodes are empty, they cannot be referenced, and the anti-correlation will fail.

Maintenance mode failure recovery

If the first two anti-correlation resources are in HV01, and the preferred owner is set to HV01, there is still a chance of being placed together because of the smart placement of memory after the maintenance mode is set. If the recovery role is selected after maintenance, both applications will revert to HV01, and the anti-correlation will fail. Because HV01, currently empty, the maintenance mode recovery has not found a reference.

If the first two anti-correlation resources are in HV01 and the preferred owner is not set, there is a chance that the memory will be placed together due to intelligent placement and random placement after maintenance mode, and the recovery role will be selected after maintenance is completed. The two roles will revert to HV01, and the anti-correlation will take effect. For the same reason as above, the maintenance mode failure reply does not find a reference, and even if the preferred owner is not set, they will return to the original node together.

Fail-over

If the first two anti-correlation resources are in HV01, and the preferred owner is set to HV02,HV02, there is no reference above, then both applications will be transferred to HV02 in the event of failure, and the anti-correlation will fail.

If the first two anti-correlation resources are in HV01, the preferred owner is not set, and there are no anti-correlation resources that can be referenced on other nodes, the cluster is evaluated according to the default memory smart placement policy, and there is still a good chance that the two anti-correlation resources will be placed on the same node.

Failure recovery after failover

If the two resources have the same anti-correlation and the same preferred owner, when you need to perform a failure reply after the failover, if the preferred owner node is empty, the reference will be lost and the anti-correlation will not be considered. anti-correlation resources will go back to the preferred owner

When only the last node is left, the anti-correlation fails.

The reverse situation:

Anti-correlation reference takes effect

Whether it is manually moved to the best node, maintenance mode, maintenance mode recovery, failover, fault recovery, as long as the same anti-correlation resources are detected on the target node, it will not be moved to that node.

A miraculous beat

Some magical things will happen in some scenarios, such as setting the same first owner HV02 for the same anti-correlation. Currently, they all run on the HV01,HV02 node with empty resources and no anti-correlation reference resources. Whether we perform maintenance mode or fail over, they will go to the first owner HV02. Normally, whether it is the reply of the maintenance mode or the application of the built-in fault reply, if it is detected that the application is currently running on the preferred owner node, it will not be moved back to the original node. But if it is the same anti-related resource, the preferred owner is different. When the HV01 returns to normal and joins the cluster, or when other nodes join the cluster, the anti-related resource will try to spread to the new node, but according to normal logic, the preferred owner node should no longer have such an attempt, so the old king called it a magical beat.

At this point, this article is drawing to a close. I don't know if there will be any friends who will be able to see the last one. In this article, Lao Wang not only introduced the functions of cluster operation, placement and maintenance management, but also explored the underlying process of placement and execution through cluster logs. Lao Wang believes that as long as friends who love cluster technology can see their own gains in the end. When we learn the technology, we should not only learn to use it, but more often we should pay attention to the Why level. When we move to the best node, failover, fault recovery, why does it have such an effect, why is it placed on this node, and whether this placement is what I expect? through those functional controls, I can know that after understanding the technologies introduced by Lao Wang, I believe you will have your own answer in your head. Finally, I hope that through this article, I can let more friends know that the cluster has these management functions, have these places of thinking, and hope that friends who already know something about the placement function can have a deeper impression!

Colored egg

Dangdang ~ colorful eggs arrive, hey, in order to reward the friends who saw the last, Lao Wang specially prepared small colored eggs. Lao Wang told you at the beginning of the article that this article will mainly focus on the operation, placement, maintenance and updating of the cluster. in fact, the part of placement called placement strategy seems to be more appropriate, but Lao Wang is called operation placement because of placement. It only makes sense when the arbitration determines that the cluster is available. Only when the whole cluster is activated to provide services, whether through normal arbitration or compulsory arbitration, can it reach this step of placement. Therefore, Lao Wang named the operation and placement in the hope that everyone can understand this concept.

In this article, we covered the placement strategy, maintenance mode and fault recovery, but we didn't explain too much about the update. Since we already said it at the beginning, how could we not talk about it? so Lao Wang decided to talk to you about the update of the cluster in the colored egg.

If your environment is currently virtualized without a cluster, or if the cluster does not use pause mode, in an ideal situation, the update process should be like this. First of all, establish a patch policy through WSUS, patch this thing, Lao Wang suggests only critical updates, security updates, definition updates at the system level, if it is a Hyper-V server, or a SQL node. Then you should be careful when downloading application patches.

We should establish a test environment close to the production environment. When WSUS detects a patch, first hit it in the test environment. If there is no problem in the test environment, then go to the production environment. The summary is that WSUS only downloads important updates and does not want any patches. It is best to establish a test environment. The new patch is passed in the test environment first, and then hit in the production environment.

If in 2012 environment, the process should be WSUS test environment pass, approval to the production environment, nodes receive patches, but should not automatically install, should be notified of installation, but can be delayed. If there is no clustered virtual environment, you can manually migrate the virtual machine in real time, and then patch and restart the host. If necessary, you can record the virtual machine of the node. After the migration, the maintenance is completed and then migrated back. Everything is done manually. But in an ideal environment, all virtualized nodes should be managed by VIM systems like VMM, which should be treated as an overall resource pool. It doesn't matter which node the virtual machine goes, and the VIM system will rebalance the load.

If the cluster pause mode is not used in the cluster environment, the process is the same as above, in the 2012 era, virtual machine resources can be removed by real-time migration, while other traditional cluster roles go offline and patch in the cluster environment. There is a problem, that is, the hidden danger caused by the placement strategy. For example, we are currently going to patch this node, and we just move the resources away manually. But at this time, if placement occurs on other nodes, we will also consider moving to the node where we patched. At this time, we may need to restart after patching, and the moved cluster application will fail over, resulting in downtime. Therefore, in the 2008 era, if we want to patch the cluster, or shut down the configuration, after the resources are manually moved away, put the node into maintenance mode. In this way, other nodes will not consider the maintenance of the node when implementing the placement policy, and can rest assured to update the configuration. If necessary, you can record the roles and virtual machines hosted by the node before the maintenance migration, and then migrate back manually after the update is completed.

As you can see, whether in a 2008R2 cluster or in an environment without a cluster, when we want to perform cluster updates, we inevitably have to perform a lot of manual operations, manually move the resources away, set them as pause nodes, and even manually record the roles hosted by the nodes before, and migrate back after maintenance, and it is a big hassle for administrators to update each time.

All these have changed in the 2012 era. First of all, the maintenance mode in 2012 has changed. It has become super intelligent and is set to maintenance mode, which can automatically remove resources according to the placement strategy. After the maintenance is completed, you can also choose to reply to the failure, and then drift all the roles back. You do not have to manually move or manually write down the maintenance node to host the application. As long as you click on maintenance during maintenance. Click restore after maintenance, it's as simple as that. After clicking on maintenance, the nodes are cleared, announcing that I am a suspended node, and do not migrate to me. The administrator can manually apply the patch approved after the WSUS test, restart after the update is completed, and click restore, then the previous role will be migrated back, and then the next node will be executed. What we need to do here is to manually select the update. Install useful updates

In the final form, in the 2012 era, the cluster also added a special function for cluster node updates, called CAU,Cluster Aware Updating, cluster-aware update, which can be summed up in one sentence. Lao Wang will say: while maintaining the continuous availability of cluster applications, the tool for patch update management of the cluster automates the repetitive manual work of cluster updates in the 08-03 era.

To put it simply, CAU is a cluster update coordination tool that can cooperate with the maintenance mode to complete the update, you can think of it this way, CAU itself does not download patches, its patches can come from directly synchronizing with microsoft, WSUS, or SCCM,CAU just do coordination, and standardization, to ensure that cluster updates are completed in accordance with my coordination, after the completion of I have to output standard reports.

Coordination, that is, when we trigger the CAU maintenance operation, the CAU will start to update according to the drainage logic when it receives the WSUS patch, and the virtual machine resources will be migrated in accordance with the placement policy in real time. After the resources are all migrated, update the patch, restart and check whether there are dependent patches that have not been installed, confirm that the installation is not completed, automatically perform the recovery operation, and then migrate the previous load of the node back.

As you can see, CAU directly helped us to do three things automatically, automatically set to maintenance mode-install patches-automatic fault recovery, before we need to manually click maintenance, manual click to install patches, manual click to reply to the failure, now you do not have to order these three times, CAU directly helps you do it all, all you have to do is click and start the CAU update.

There are two working modes of CAU. One is that Lao Wang said, click, or we choose to trigger the cluster for CAU update process at an appropriate time node. Each time the update is triggered manually, CAU will randomly select a node as the coordinator. At this time, CAU Update Coordinator will temporarily reside on a node, and then the cluster nodes will complete the update and maintenance mode one by one according to the instructions of CAU. And then automatically recover. The key point of the manual trigger mode here is that the administrator can choose an appropriate time node update, and the administrator can carefully check the patch before confirming that the update is triggered.

The other is to adopt a fully automated update method. CAU will grow a VCO role in the cluster, and this VCO role will act as the CAU update coordinator. In a fully automated update mode, you only need to specify a time period, and the rest do not need to be managed. Every time at that time period, CAU will automatically update according to the drainage logic.

Whether the update is triggered manually or fully automated, a report will be generated after the overall update of the cluster is completed, detailing whether all the CAU updates are carried out in accordance with the coordination process of CAU, whether all the expected patches have been installed, and what are the exceptions in the installation process. Lao Wang believes that this is the meaning of CAU, which can be coordinated with the maintenance mode to achieve continuously available cluster updates. After completing the update, you can also enter a standardized report, which not only liberates the hands of administrators, but also standardizes the work. With regard to CAU Lao Wang, I will only introduce the main theories involved. My good friend ZJUNSEN Zhang Junsen has written a very good blog about the actual operation of CAU. If you are interested, you can read it.

In Microsoft's update system, clusters can be detected in the current architecture, and zero downtime can be achieved by using drainage logic. At present, only SCVMM and CAU are available. VMM updates also coordinate the update process with compliance baselines, draining one by one to ensure cluster availability, while other update methods WSUS,SCCM,VMST,MBSA will not perceive the cluster, which will cause certain update downtime by default.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report