Advanced articles on WSFC log analysis 07/12 Update SLTechnology News&Howtos

Advanced articles on WSFC log analysis

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

In the basic part of cluster log analysis, Lao Wang introduces the location and use of several cluster logs. For example, the event manager system log can tell us what the general cause is when the cluster fails, and give a direction. The FailoverClustering-Manager-Diagnostic log in the application log can help us backtrack those operations after the event occurred. FailoverClustering-Operational logs can help us understand whether basic changes in cluster resources, network detection, and security are normal, as well as summary logs in Cluster Administrator, which can usually give us some clear directions, tell us what to look at, see some basic resource change information and operation records, and help us trace back some of the problems.

However, in some cases, the problems that may occur are not clearly explained in these logs, or we think that it may not be the problem described in the event log, and we still need more detailed information. at this time, we can look at the cluster diagnostic log, what is the cluster diagnostic log, to put it simply, cluster diagnosis is to record all internal execution processes during the operation of the cluster. As you can imagine, resources online and offline migration, health testing, and so on, almost everything related to the operation of the cluster will be recorded in this diagnostic log, and the default situation in 2012 can be seen in event management. while the cluster is running, the event log will continue to update and grow.

The biggest difference between diagnostic logs and other logs is that other cluster-related logs also record the health and resource changes of the cluster, but they are relatively friendly in the event log, and basically there are not many underlying languages. Let us see that we can basically understand it, while the diagnostic log is different. In the diagnostic log, some core components in the cluster are also presented. For example, the operation of core components such as RHS,RCM,NetFT will also appear, so the diagnostic log is also the most in-depth and detailed, whether it is to do in-depth troubleshooting for the cluster, or want to understand the internal operation of the cluster, learning to look at the diagnostic log is the best choice.

When you look at the diagnostic log, you may see that there are many abbreviations of the core components of the cluster, such as the following figure. If beginners don't understand the meaning of the abbreviation, they can copy it, search the Internet, and write it down. Here, Lao Wang chooses three major core components to explain to you.

RHS: the Chinese resource host subsystem will be presented as a system process in actual operation. What does this component do? it will monitor the health of various cluster objects in real time, such as cluster disk, cluster IP address, cluster network name, application program, according to the detection rules defined in Resource.dll and the detection policy we defined. The actual detection of RHS is mainly based on two parameters in the currently loaded cluster resource object Resource.dll to determine whether the cluster resource is alive or not, namely looksalive check,is alive check.

As the name implies, looksalive check means that the resource appears to be alive, so during the looksalive check process, RHS usually performs relatively simple detection operations, such as checking whether the cluster disk accepts reservation requests every 3 seconds. If looksalive check cannot effectively detect whether a resource is alive, RHS will also try to use is alive check for specific detection. Compared with looksalive, is alive will detect the survival status of resources from a more in-depth point of view. For example, the detection of cluster disks by is alive will actually require the execution of a Dir command, and for SQL detection, it will actually execute a query to confirm the survival of SQL cluster resources. Therefore, the detection of lookalive is usually basic, and the detection of is alive is usually thorough. If the detection of is alive also fails, RHS will report that the status of the resource is failed, and then report back to the RCM component for further resource processing.

RHS, no matter for any cluster resource, will try to perform looksalive check,is alive check detection operation, but for different cluster resource objects, the detection method will be different. For some cluster default resources, such as cluster IP, cluster resource, cluster network name, RHS will be detected by loading the default ClusRes.DLL. Different cluster objects may use different Resource.dll. Developers can integrate their programs with WSFC and write the Resouce.dll of their own programs into the cluster

Normally, Resource.dll will play the following two main roles. One acts as an agent between the application and the cluster. When we do online, offline, enable, disable and other operations on the failed cluster administrator for the cluster object, in fact, the request will be sent directly to the corresponding Resource.dll of the resource, and then the Resource.dll will inform the resource of the state change, so if you want to write the Resource.dll yourself First of all, make sure that the dll can perform administrative operation awareness on the corresponding resource objects.

In addition, Resource.dll should also clearly define that for looksalive check and is alive check detection methods of specific resource types, the False parameter should be returned when is alive detection fails, and RHS will mark the resource as ClusterResourceFailed when it receives the False parameter.

The RHS detection system detects the survival of cluster resources according to all the looksalive check and is alive check rules defined in the Resource.dll in the cluster. In general, if you want to cluster custom developed applications and clusters, it is recommended to write a good Resource.dll, so that the cluster can conduct more in-depth detection, otherwise you can only use the default ClusRes.DLL to detect the basic status of the application process.

For example, after Microsoft's SQL and Hyper-V are clustered, RHS will use their own separate Resource.dll rules to detect. After SQL Server clustering, it will generate specific SQL service resource objects in the cluster, as well as its own specific detection methods. RHS can actually issue a real query to is alive to check the survival of SQL, and Hyper-V will call its own specific Resource.dll after clustering. The implementation can check whether the guest OS is blue and whether the services in the guest OS are alive through the advanced policy we defined. When the resource is not alive according to our defined detection policy and is alive detection, it will mark the resource as a failure state in RHS and report it to RCM later. Then RCM will see the RHS flag and attempt to fail over the resource according to the failover policy of the resource. Restart, start and other operations.

In the previous 2003 era, all resources in the cluster were managed under a RHS process, so if one resource object failed because of detection, it would also cause other cluster resources to crash or restart together, so in 2008, Microsoft will part of the cluster's own resource objects, such as cluster IP, cluster name. The cluster disk and other resources that can be shared by the cluster are placed in a separate RHS process, and the cluster application we create in the cluster can work in a separate RHS process, so that if the RHS detection of a single cluster application fails or the process has problems, it will not affect the work of other resources in the cluster. Res / prop command to view all the properties of the cluster resource

Several of the key parameters related to RHS

MonitorProcessID: ID of the RHS process associated with the cluster resource. You can determine which cluster resources are currently in the same process by viewing this parameter.

SeparateMonitor: indicates whether the resource has been placed in a separate monitor (0: no, 1: yes)

IsAlivePoleInterval: the default value is shown in the figure, indicating that it is using the default settings for that particular resource type.

LooksAlivePollInterval: the default value is shown in the figure, indicating that it is using the default settings for that particular resource type.

DeadlockTimeout: resource deadlock response wait time. Default is five minutes.

At the time of 2008R2, not all cluster resources are running under the same RHS process. For critical cluster resources and cluster applications, different processes have actually been separated to avoid the collapse of other cluster resources due to the crash of a single cluster application. But by default, most cluster resources still operate in a shared configuration monitor when RHS detects a cluster resource failure or dll crash. Will be placed in a separate monitor for detection, completely prevent it from affecting other cluster resources, when debugging for the cluster, you also need to set the resource.dll value to 1, otherwise it will cause other resources in the shared monitor to fail if debugging is possible.

# set cluster resources to work in a separate monitor

(Get-ClusterResource "Resource Name") .SeparateMonitor = 1

For example, in the virtualization cluster scenario in the 2008R2 era, by default, all virtual machines run under the same shared monitor. In the event of a resource deadlock on a single virtual machine, all of the above virtual machines may not be used. Therefore, the virtual machine with problems can be run separately in the isolation monitor. In practice, the use of the isolation monitor needs to be cautious. Because sometimes enabling a separate isolation monitor results in separate RHS processes, each of which consumes CPU and memory resources, you need to enable this advanced feature, taking into account server resources.

RCM:Resource Control Manager, Resource Control Manager, as the name implies, this component helps us to manage cluster resources, from cluster disks to a cluster group, all through RCM to help us manage operations. It can be said that the main function of RHS is to detect, have problems, and then report problems, while RCM actually handles it, according to our operation instructions for the cluster. Or the result detected by RHS to perform operations such as resource online, offline, suspend attempt, failover, etc.

RCM will consider two points when performing operations, one is dependency. For example, we need to launch a cluster group online. The cluster group depends on the cluster network name, and the network name depends on the network IP. Therefore, when dealing with our online request, RCM will first try to build a dependency relationship, and then gradually complete the online resources according to the dependency logic. For example, we will first try to online the network IP. After the network IP is online, try to online the network name, and finally rely on the resources have been online results, just online the entire cluster group.

In addition, when RCM performs an operation, by default, it uses the failure policy defined by the resource and the advanced policy to evaluate and finally make the appropriate action. For example, when RHS reports that the cluster resource failed, RCM will try to suspend the resource online at regular intervals according to the failover policy. After a period of time, it will put the resource into a failed state and still try to restart the resource for a period of time. If the startup is successful all the time, it will also try to move the resource to another node to try, and if it is not successful, it will put the resource in a failed state.

From this, it can be seen that the main function of RCM components is to perform the management operations of cluster resources, and to evaluate dependencies, failure policies, and advanced policies to attempt, failover, and state confirmation when cluster resources and cluster groups fail.

NetFT:NetFT usually refers to Failover Cluster Virtual Adapter. After we install the cluster, we show the hidden device in the device manager and we can see that there is such a virtual network adapter. You can also see this network card with ipconfig / all, but there is no IP address configured. The main function of this cluster virtual network adapter is to help us build a highly available topology for network communication in the cluster, such as We need heartbeat detection between cluster nodes. Every once in a while, NetFT will help us rebuild this topology. For example, Node 1 and Node 2 have two network cards, one for cluster network and one for cluster network + management network. When NetFT detects that if the dedicated cluster network cannot perform heartbeat detection, It will dynamically switch to another network card to help us detect the heartbeat. The main function of NetFT is to help administrators automatically build the pre-existing communication topology of the cluster, to the maximum extent possible, to ensure that the cluster network can communicate normally.

After introducing several important theories, let's take a practical look at the diagnostic log of the cluster. what does the diagnostic log look like? by default, the diagnostic log can be seen in the event manager in the 2012 era, but because it is growing in real time and does not look very intuitive, it is difficult for us to find the information we want quickly, so in addition to the event manager We can also generate the diagnostic log of the cluster through the Powershell command Get-ClusterLog. Unlike the diagnostic log in the event manager, when we use Get-Clusterlog to obtain the diagnostic log of the cluster, whether it is 2008 or 2012, we will merge the event manager or the diagnostic log in the ETL file, and then screen some of them to remove useless information, leaving only the really useful diagnostic information to form a log file. Easy for administrators to analyze.

By default, if we execute Get-CluserLog directly in powershell, we can output the diagnostic log of the cluster, and you can see it after opening it. However, by default, the maximum diagnostic log in 2012R2 is 300MB. Once this size is exceeded, the previous log will be overwritten when the log reappears. The 2008 era is the limit of 100MB.

If you execute Get-ClusterLog directly, it will output all the Log from the beginning of the cluster to the present. Assuming that your cluster log has not reached 300MB or overwritten, all log will be generated, but sometimes it will take some time to generate such a large log, and once the command Get-ClusterLog is executed, it will not only generate the diagnostic log of the current node, but also ask other nodes to generate cluster.log to the report directory. Therefore, there are some parameters that can be used together when using Get-ClusterLog

# there are not many logs. You can run Get-ClusterLog directly, which will output everything from the establishment of the cluster to the present.

Get-ClusterLog

# only want to output the last five minutes of the log

Get-Clusterlog-TimeSpan 5

# only want the specified node to generate logs

Get-Clusterlog-Node Nodename

# if you do not specify a path, each generation will overwrite the existing log in the Report directory. If you want to generate the logs of each node into a single directory, you can use the Destination parameter to see each log with the node name in the directory

Get-Clusterlog-Destination path

By default, the log level of cluster log is 3, and usually Level 3 is detailed enough. If you are making some diagnoses, you can also use Set-ClusterLog-Level 5 to set the level to 5 for advanced diagnosis. It should be noted that setting Level 5 will cause the log to grow rapidly in a short period of time. It is recommended to restore 3 in time after the diagnosis is completed.

Let's actually generate it and see.

The generated clusterlog will be stored under the Report path of each node by default.

After opening, the interface is as follows

At this point, you may need a log analysis tool for your own habits.

We can see that clusterlog seems to be different from the log we see elsewhere. How should we understand it? let's take an example.

Process ID: the 16-bit RHS process ID where the resource resides

Thread ID: resource 16-bit RHS thread ID

GMT time: the GMT time when the event occurs, accurate to the millisecond level. Initially, considering that the cluster nodes may be distributed in different time zones, GMT is used. In fact, the Eastern Hemisphere needs to add 8 hours. The UseLocalTime parameter is added in the 2016 cluster. When clusterlog is generated, if we confirm that the nodes are all in the same time zone, we can use UseLocalTime to generate local timestamps.

Log level: there are usually INFO,ERR, WARN,DBG and other statuses, where the ERR keyword can be tracked in log analysis.

Resource category: is the log generated by the resource type and cluster component of that category

Resource name: specific resource name

Description: a detailed description of the log

There are some key attributes in Cluster.Log, which you can focus on when using Cluster.Log and set them to track in the analysis tool.

OnlinePending: resources suspended online

OfflinePending: the resource is offline

Offline: resource offline

ProcessingFailure: resource failed

Failed: cluster group failed

By directly searching for the corresponding keywords in the log analysis tool, you can see the contextual processes that take place nearby.

Let's take a few practical cases.

The NetFT component tries to help us build a cluster network communication topology. We can see the detailed operation process here. It is found that 3343 connections will only be attempted between 18 network segments and 10 and 20 network segments, because we set 3040 network segments as storage networks and do not participate in cluster communication, so NetFT will not consider these two networks when building the topology.

08R2 cluster because there is only one node running, the cluster storage has been in a failed state before. 17:10 this time node, I will restore the cluster storage online, 17:11 time node, disable the ISCSI target

When generating clusterlog, you can see that at 10 points, RHS continues to try to detect cluster disk 1 and finds that the RES resource of cluster resource 1 has been mounted normally and the test has passed normally. Therefore, RHS will report the status of cluster disk 1 to RCM,RCM to change the status of cluster disk to online.

At 11: 00, RHS failed to detect the Is Alive of cluster disk 1, determined that the resource failed, and reported the failed status to RCM,RCM to change the status of cluster disk to failure. Then RCM will suspend and retry the cluster disk attempt according to the failure policy, which will mark the cluster disk as failed for a period of time. Then try to go online or migrate to another node after a while.

In the third example, we look at the actions that occur within the cluster when we perform LowerQuorumPriorityNodeID.

Time node 1, four nodes in the cluster, witness disk failure, the current cluster randomly adjusts and removes one node to vote.

Time Node 2, use the LowerQuorumPriorityNodeID setting to remove the voting of the HV04 node

# use the command to output the last five-minute directories of all cluster nodes to the unified folder

Get-ClusterLog-Destination\\ iscsi\ clusterlog-TimeSpan 5

When you open the log of HV03, you can see that time node 1, when the cluster disk is detected offline, arbitration first chooses to remove the vote of node 1, time node 2 we manually set LowerQuorumPriorityNodeID to HV04, the cluster re-adjusts the dynamic arbitration and removes the vote of HV04

In the fourth example, we can see that when NetFT builds an intra-cluster network communication topology, it will take into account whether the network will be in a subnet or across subnets, and the health detection frequency in different network environments can be designed according to the situation. NetFT will build the topology for health detection in different network environments according to our definition.

In the last example, let's simulate the situation that when the cluster has four nodes, three nodes are broken, and the last two nodes, the voting node is suddenly powered off and the arbitration is blocked after the cluster is forced to start.

Now that only the HV01 node of the cluster is forced to start, we start the HV04 node at this time.

To get the cluster diagnostic log, let's take a look at the last 20 minutes or so to see what happened from cluster shutdown, compulsory arbitration, to blocking arbitration.

Get-ClusterLog-Node HV01,HV04-Destination\\ iscsi\ clusterlog-TimeSpan 20

Let's take a look at HV01's log first.

As you can see, at one point in time, HV01 forces the cluster service to start

Immediately after initializing each cluster component, the arbitration component determines that the HV01 has been initialized, and the paxos that promotes the HV01 is marked as authoritative. At this time, it should be marked as FixQuorum. You can see that although it is not displayed in the UI, it will still indicate that the current node is running in forced mode when it is running in the background.

At this time, HV04 attempts to go online, but will be blocked. HV01 will receive a new join request in the cluster environment and tell me that I am the authoritative node, my paxos tag is the highest, and you should join my partition and synchronize my cluster database based on mine.

At this point, let's take a look at the HV04 log. Because the ID of HV04 is 3, we will see Node 3, that is, HV04, in the log.

Time Node 1Jing Node3 attempted to start the cluster service, but was terminated after startup

Time Node 2 Node3 pointed out that my arbitration status is blocked. I try to join. I should synchronize with that authoritative node first.

Time node 3 Node3 receives the response from HV01 and updates the database according to the paxos tag of HV01 with HV01 as the authoritative party of the cluster

Time node 4 Node 3 has aligned the cluster database with HV01, joined the cluster again, joined normally, turned off blocking quorum mode, and removed the NeedsPrevent flag!

The above Lao Wang explained how to look at cluster log through a few simple examples, and told you the usage and basic skills of the fishing rod. If you want to master the analysis technology, you still need to keep watching the log practice. In cluster log, it will really involve the detailed operation process of many cluster underlying components, if you can figure out how these underlying components work. Lao Wang believes that whether it is to make mistakes for you, or to study, there will be a different realm.

In the actual cluster troubleshooting, Lao Wang believes that it is mainly through continuous learning, experiments, and actual combat to consolidate his knowledge and experience, and the means account for a part of the mistakes, but the accumulation of his own knowledge and experience also accounts for a large part. For example, when Lao Wang encounters a cluster problem, my idea is to get the voting of the cluster node first, witness the vote, and see what the witness qualification is like at present. Then I will look at the screening cluster section of the system events to see if the network and storage have reported errors, and then follow my own experience to analyze and judge the problem.

In fact, Lao Wang feels that the cluster logs in the general troubleshooting, event manager system and application logs can provide enough clear information. In fact, Microsoft had a vision when it changed the cluster events in the 2008 era. I hope that more administrators can understand and solve some of the problems by looking at the event manager, and the logs in the event manager in the 2012 era have indeed reached Microsoft's vision. It is true that there are a lot of questions, and the log has made it very clear.

But once some problems are encountered and the prompts given in the event manager cannot be solved, we still need to look directly at cluster log to understand the root cause of the failure from the bottom of the cluster.

If you need to perform a comprehensive troubleshooting, such as comprehensive troubleshooting of cluster errors and virtualization errors on three nodes, you can choose to view the aggregated cluster events in Cluster Administrator for a comprehensive troubleshooting.

If you need to perform a cluster health check to see what best practices have not been implemented, you can perform a cluster verification report, which also provides a lot of useful, underlying information.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.