The fundamentals of WSFC log analysis 02/09 Update SLTechnology News&Howtos

The fundamentals of WSFC log analysis

2026-02-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

In the previous blog, Lao Wang introduced the arbitration in WSFC, which is mainly used to maintain the continuous availability of the cluster and some ideas that should be dealt with in the event of downtime. In the following article, Lao Wang will introduce the log analysis in WSFC. In many cases, when there is a problem or needs performance optimization, you need to look at the log to analyze and judge, so mastering the log analysis in WSFC is the most important. Lao Wang hopes to teach people to fish with the log analysis function of WSFC through several articles and introduce it to more friends.

In fact, from the beginning of 2012, Lao Wang personally felt that the cluster event log had been optimized a lot, and it was basically very clear that for ITpro, the problem could be found intuitively from the event log.

First of all, let's take a look at the system log. By default, the WSFC cluster will display logs about cluster status, such as node, storage, network, cluster, arbitration status information, whenever critical, error, warning, resource failure, and so on, will be displayed in the system log. Administrators can filter logs from cluster categories directly in the system log.

After the filtering is completed, you can see the cluster-related logs. Basically, in most cases, WSFC will tell you what the failure is in the system log, whether the cluster is broken, the storage is offline, it is a network partition, or there is no way to arbitrate, and so on. Therefore, the first step is to look at the system log and understand what the cluster log says. In some cases, the repair can be done directly in the direction given in the system log, or at least a clear range of directions has been given.

In addition to the Syslog, there are also two cluster-related critical logs in the application log, which may also be used in some troubleshooting scenarios

FailoverClustering-Operational

FailoverClustering-Manager-Diagnostic

FailoverClustering-Operational logs are mainly recorded in the cluster during the operation of resource changes and other information, security manager, NetFT cluster network communication topology generation, health detection, cluster application or cluster disk status changes, online, offline or transfer, etc., will be recorded in detail in this event log, so if you want to reproduce some problems in the cluster, confirm whether the resource changes are effective. You can check the Operational log to know.

FailoverClustering-Manager-Diagnostic, a log that records every action and modification performed by a cluster administrator when opening Cluster Administrator, which is useful in some troubleshooting scenarios and can help administrators find problems that may be caused by what changes have been made

Other logging functions are as follows

FailoverClustering-Diagnostic: cluster diagnostic log, level 3 level of detail in 2012R2, which can fully show the steps that occur in the background during the operation of the cluster, for advanced troubleshooting and principle learning.

FailoverClustering-Performance-CSV: performance analysis log for CSV

FailoverClustering-Client: detailed analysis log when creating a cluster or adding nodes

FailoverClustering-CSVFT-Diagnostic:2012 has been added to help administrators analyze logs such as CSV mount reads, Metadata reads and writes, IO redirects, etc.

FailoverClustering-CSVFS-Operational: used to track CSV mounts and direct IO

FailoverClustering-Manager-Operational: mainly records the administrative operations performed on the cluster, such as whether the PS script is issued and executed normally, and which nodes currently cannot accept administrative operations such as management.

FailoverClustering-WMIProvider-Admin: used to troubleshoot when a cluster uses a generic WMI program or other cluster program that calls WMIProvider

In addition to the logs of the cluster itself, there will also be separate logs updated by CAU starting in 2012, where you can see the status of the cluster node during CAU, as well as the details.

In Lao Wang's view, for the general enterprise administrator to maintain the cluster, it is enough for the event manager to look at the cluster system log, FailoverClustering-Operational,FailoverClustering-Manager-Diagnostic, which can reproduce and analyze most of the problems, but it may not be satisfied for some enthusiasts who are obsessed with technology. They want to go deep into the bottom of the cluster, or some advanced troubleshooting scenarios. If you want to be able to see the most detailed execution process of the entire cluster, then you need to look at the Diagnostic log. Almost the most detailed cluster execution process will be recorded in the FailoverClusterin-Diagnostic diagnostic log. You will see that this log will continue to grow. Later, Lao Wang will specifically explain this kind of diagnostic log in the advanced section.

In the above, Lao Wang directly cited 2012R2 as an example, but in fact, for cluster logs, it has existed for a long time. In Windows Server 2003, the event manager was not as fancy as it is now, so at that time, the cluster logs were all completed through a log. While the cluster was executing, the logs there continued to grow. When something went wrong, the administrator directly connected to the cluster.log under C:\ Windows\ Cluster to troubleshoot.

Some changes took place in 2008 when part of the cluster logs were collected in the form of event tracking sessions.

You will find that all logs collected by this kind of data collector cannot be viewed directly in the event manager.

You can see the diagnostic log, which has been divided into several ETL files since 2008. This kind of file cannot be opened directly.

Can only be viewed by converting to csv format through the tracerpt command

Therefore, in the 2008 era, if you want to see detailed cluster diagnostic logs, it cannot be seen in the event manager and can only be viewed through the Clusterlog / gen or Get-Clusterlog command. When this command is executed, it will merge all the ETL files for diagnostic analysis, and then remove the useless metadata and save it as a cluster.log file for everyone to view. Therefore, Lao Wang thinks that the operation of cluster logs in the 2008 era is still a little worse than that in the 2012 era.

By 2012, you can see that the diagnostic log has been separated from the data collector and has its own event unit, which can be viewed directly in the event manager.

So far, it mainly introduces the viewing and analysis of cluster logs in the event manager. Lao Wang thinks that to learn the cluster log analysis, we can start with the event manager, first learn to look at the cluster system logs, FailOverClustering-Operational,FailoverClustering-Manager-Diagnostic, and then look at other logs when we use them. In this part, Lao Wang is only passing through the diagnostic logs, because he is going to talk about them in more detail. In fact, Lao Wang also suggested that we first learn to look at these three basic logs, and then look at the diagnostic logs, because there is more underlying knowledge of the cluster involved in the diagnostic logs. If you do not have a deep understanding of the cluster, it may seem a bit difficult. The event manager is now clear and clear, which is a good way to start.

In addition to the event manager, the cluster also provides some intuitive reports. In the C:\ Windows\ Cluster\ Report directory, you can see verification reports, reports for adding nodes, reports for creating clusters, cluster arbitration configuration reports, and so on. These MHTML documents are all designed for us by the cluster. After opening them, there will be a very friendly interface, whether it is for the administrator or for the manager.

Among them, the cluster verification report can be understood as a private doctor of a cluster. when creating a cluster, it is strongly recommended to run a cluster verification report, which will help us from the system configuration, network, storage and other angles to diagnose a detailed report, whether the current environment is suitable for creating a cluster, it will give an error prompt for inappropriate places. Built-in best practices are also used to suggest which ones should be improved.

In addition to running the cluster verification report when the cluster is created, it is also recommended to run the cluster verification after changing the network to the cluster and after the storage environment, which will help us analyze whether the changed environment will affect the normal operation of the cluster.

If the application is already running in the cluster, running the cluster verification report will also help us to verify the simulated cluster application. It should be noted here that when running the cluster verification report, the storage column should be carefully checked. Once the storage is checked in the cluster verification report, then the verification process will try to go offline and go online to the cluster disk, which may lead to the downtime of the application. You can choose to do it at the right time. Or uncheck the storage box.

The MHTML report in the reports directory is mainly when the cluster changes, or when we trigger a report, we provide an intuitive report display interface, but when the administrator wants to make detailed troubleshooting, sometimes we still need to look at the ValidateStorage log in the folder, which is more detailed than the MHTML information.

For those who want to learn the cluster log analysis, the second step is to choose to master the cluster verification report and other reports under the directory, at least learn to read the report and understand the cluster verification report, which will help you quickly understand, the steps that occur when the cluster is created, and the requirements that the cluster should follow at run time.

The third step is to master the use of event queries in Cluster Administrator

When we open Cluster Administrator, we can see that the home page tells us that there are 2 keys, 30 errors and 3 warnings for the current recent cluster events, so where did these events come from? The answer actually comes from the event manager, but the cluster calls the event manager and uses its own GUI to make a query display.

By clicking on the link to the event, you can see that you jump to the interface of a cluster event, which is similar to what we see in the event manager

But in fact, the events in the cluster manager are still a little different from those in the ordinary event manager. Imagine that we have made a cluster, then we must want to be able to look at the state of the cluster from an overall point of view. By default, the event manager only shows a single server.

So, optimized in Cluster Administrator, the log we see in the cluster event is actually the log that the cluster collects all the cluster nodes in the cluster and renders it.

If you open the query under the cluster event interface, you can see that the current log source collects the key, error and warning parts of cluster-related events in all nodes of the cluster, and the default is within 24 hours of querying. This design is very good. Help administrators to see the logs of all nodes under the interface of a cluster event.

In addition to collecting system cluster logs from all nodes by default, we can also manually select the individual node logs that we want the cluster to collect

For example, if we are a Hyper-V cluster, we can also choose the related logs on Hyper-V. When we debug a virtualized cluster, we can see not only the cluster-related logs in the cluster event, but also the error logs in Hyper-V.

One thing to note here is that since this query is done on all nodes, it is recommended that you do not select too many logs except the logs of the cluster itself, but choose a separate item or two, such as SQL or Hyper-V. The key here is that we have to judge the fault point of a problem as a whole and accurately in the process of troubleshooting. If there are too many sources collected here, it will be meaningless.

We are currently looking at the overall log of the cluster in the cluster event. if there is a problem with a single cluster application, we can also click on a single application and choose to display key events next to it. Key error information about the current application, aggregated on all cluster nodes

If it is a cluster disk, you can also obtain the key error information collected at each node for the cluster disk by displaying key information.

So we can see that WSFC built-in has helped us to achieve the function of cluster node event summary analysis, we can see the logs of all cluster nodes on the overall cluster events, WSFC also helps us to help us in specific cluster applications, cluster disks built-in this function, for individual applications or disk analysis, we can also use this simple way to get logs on all nodes.

At this point, the basic part of WSFC log analysis is over. In this article, Lao Wang mainly introduces three relatively basic points of WSFC log analysis, namely, event manager, cluster report directory, and cluster administrator events. Friends who have no clue about cluster log analysis can start from these three places, understand the contents carefully, and learn to use them. I believe it will be helpful to improve your log analysis ability.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.