WSFC resource deadlock case 07/13 Update SLTechnology News&Howtos

WSFC resource deadlock case

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

In the advanced section of WSFC log analysis, we have mentioned some underlying principles of WSFC, such as Resource.dll,RHS,RCM. Understanding these components is of great help for us to do cluster troubleshooting later. In this article, we will use an actual case of resource deadlock to help you make an impression.

First of all, what does Resource.dll do? Resouce.dll is the component on which each cluster resource survives. Only with Resouce.dll can the cluster resource operate normally. Resource.dll includes the detection function definitions of Look alive and is alive for the detection of cluster resources, as well as the operation call definition of resources.

RHS is a cluster resource host subsystem, which operates as a process and is responsible for monitoring whether the cluster resources are running normally. RHS monitors the cluster resources according to the Look alive and is alive methods defined in the resource resource.dll.

When the is alive method tests and confirms that the resource fails, RHS reports the failure to the RCM resource control manager component, and RCM restarts and retries the resource according to the resource failure policy definition and fails over.

Basically, Resource.dll is responsible for defining how to see a doctor and the operation steps of seeing a doctor. RHS follows the methods and procedures of Resource.dll to see a doctor. After RHS judges the condition, it is handed over to nurse RCM, and nurse RCM decides whether to follow the strategy to get an injection or to be hospitalized.

Today, we mainly focus on this stage seen by RHS. Normally, RHS is either optimistic, or bad, or good will report normal, and then cycle testing, bad RCM will begin to deal with according to the strategy.

But in addition to these two situations, there is another common situation, that is, resource deadlock.

What is a resource deadlock? that is to say, RHS sends is alive detection signals to cluster resources, but cluster resources do not respond, so if you do not respond, how does the cluster know whether you are alive or not? the cluster will not wait for the response from your resources all the time. The cluster must accurately confirm whether each resource is available or not. After a period of time, the cluster will determine that the resource has entered the Deadlock deadlock. RHS will put your unresponsive cluster resource in a separate isolated RHS process, and then RHS tries to restart the resource to create a deadlock WER report.

For the time of resource deadlock, WSFC 2008R2 starts the cluster to wait for 5 minutes by default. If the resource still does not respond within 5 minutes, the resource is deadlocked, and the cluster tries to restart the resource in a separate RHS process.

During 2008R2, the deadlock time of Deadlock resources can be changed by the following command

For a single resource level

(Get-ClusterResource "Resource name") .DeadlockTimeout=300000

For resource type level

(Get-ClusterResourceType "Virtual Machine") DeadlockTimeout=300000

The next layer of protection is that when the cluster issues a request to terminate the RHS process, it will wait for the RHS process to be terminated four times (the default is equal to 20 minutes). If the RHS is not terminated within 20 minutes, the cluster will assume that the server has some serious health problems and will check the server to force failover and recovery, during which time the cluster resources may stop access, and the error checking code will be Stop 0x0000009E (Parameter1,Parameter2,0x0000000000000005 Parameter4). Note: if the RHS process fails to terminate, Parameter3 will always be the value of 0x5

After the WSFC 2008R2 era

Cluster IP, cluster network name, quorum resources work in a separate RHS monitoring process

Cluster available disks, CSV working in a separate RHS monitoring process

Other cluster resources work on dedicated RHS monitoring processes

This can also avoid the disadvantage that all resources are managed by one RHS process before 2008R2. Before 2008R2, cluster resources are managed in the same RHS process. As long as one of the cluster resources crashes, the entire RHS process may fail and all managed resources will fail.

Once quarantined, if a single cluster resource does not respond and causes the RHS to crash, the cluster service will consider the specific resource suspicious and need to be quarantined. The cluster service will automatically set the resource public property SeparateMonitor to mark the resource to run in its own private RHS process, so that if the resource becomes unresponsive again; it does not affect other cluster resource processes

In 2008R2, once the resource deadlock we mentioned above occurs, that is, the application does not respond to the is alive request, a WER report will be generated, which can be seen in the control panel and collected in the dump,WSFC2016 era of the deadlock RHS process. This function goes a step further by displaying more detailed information about the resource deadlock and helping the debugger Zero Downtime Debugging.

OK, after the above basic information has been explained, let's take a look at this resource deadlock case.

Sometimes you will encounter a lot of inexplicable and strange problems in some scenes, especially when you change a lot of content together, and suddenly something goes wrong, and you don't know which change operation caused the problem. This is the biggest headache. At this point, you need to sit down and take a deep look at the problem.

The problem this time is probably an update change, a patch update is carried out somewhere for a batch of cluster nodes, in which two nodes restart the machine after the cluster update is completed, the machine runs slowly, and the cluster service cannot be started. the attempt to carry out compulsory arbitration on the two cluster nodes was fruitless, and the cluster failed immediately after the compulsory arbitration, and witnessed that the disks and cluster disks could not go online all the time.

The initial suspicion was due to the system update patch. After uninstalling the patch, it was found that the problem remained, and the cluster-level troubleshooting began. First, check whether the cluster-to-storage is normal. For link confirmation, start with the cluster event log.

View the cluster event log

Log Name: System

Source: Microsoft-Windows-FailoverClustering

Date:

Event ID: 1573

Task Category: Quorum Manager

Level: Error

Keywords:

User: SYSTEM

Computer:

Description:

Node 'ZQ1' failed to form a cluster. This was because the witness was not accessible. Please ensure that the witness resource is online and available.

Log Name: System

Source: Microsoft-Windows-FailoverClustering

Date:

Event ID: 1069

Task Category: Resource Control Manager

Level: Error

Keywords:

User: SYSTEM

Computer:

Description:

Cluster resource 'cluster disk 1' in clustered role 'wlc' failed.

Log Name: System

Source: Microsoft-Windows-FailoverClustering

Date:

Event ID: 1230

Task Category: Resource Control Manager

Level: Error

Keywords:

User: SYSTEM

Computer:

Description:

A component on the server did not respond in a timely fashion. This caused the cluster resource 'Cluster disk 1' (resource type'', DLL 'clusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.

Most cluster event logs indicate that the cluster disk cannot go online and that the deadlock and RHS process timed out

Look at the clusterlog and see that deadlock appears on cluster disk 2

00001148.001158 RhsCall::DeadlockMonitor: Call ONLINERESOURCE timed out for resource'è o bicycle'2 ERR.

00001148.001158VG 2017According to 15-20 handling deadlock 24Rod 36.365 ERR [RHS] Resource è o bicycle 2. Cleaning current operation.

00001148.001158 ERR [RHS] About to send WER report.

0000084c.0000159c::2017/09/15-20 RCM 24VR 36.365 WARN [RCM] HandleMonitorReply: FAILURENOTIFICATION for'è o people', gen (0) result 5018.

0000084c.0000159c::2017/09/15-20 RCM 24VR 36.365 INFO [RCM] TransitionToState (è o Zhuan 2) OnlinePending-- > ProcessingFailure.

0000084c.0000159c::2017/09/15-20 RCM 24VR 36.365 ERR [RCM] rcm::RcmResource::HandleFailure: (è o people's 2)

The current log indicates that the cluster problem is related to the cluster disk being unable to go online on both nodes, the cluster disk is inaccessible, and RHS deadlock appears. Indicates that the cluster disk IO did not get a timely response

The computer has rebooted from a bugcheck. The bugcheck was: 0x0000009e (0xfffffab030cce600, 0x00000000000004b0, 0x0000000000000000, 0x0000000000000000). A dump was saved in: C:\ Windows\ MEMORY.DMP

But at present, we still have no way to determine what caused the problem. According to bugcheck instructions, further we need to look at the dump file to see the cause of the resource deadlock. Dump can be a process dump in WER, preferably a complete memory dump. In this case, we take a complete memory.dmp as an example.

The RHS process in which deadlock occurs from dump is fffffab030cce600, and there are three threads in the process, two of which are waiting on the system thread fffffab0`30f0a6d0, callstack is as follows, this system thread is waiting for the TmXPFlt.sys driver

00 fff880`08d92420 fffff800`01ad3142 ntfolk KiSwapContexttext0x7a

01 fffff880`08d92560 fffff800`01ad596f ntbrush KiCommitThreadWaitcake 0x1d2

02 fffff880 `08d925f0 fffff880`056880e4 ntasking KeWaitForSingleObject0x19f

03 fffff88008d92690 fffff880`05680838 TmXPFlt+0xb0e4

04 fffff880 `08d926f0 fffff880`05670be2 TmXPFlt+0x3838

05 fffff880`08d927e0 fffff880`0148c0f7 TmPreFltkeeper TmpQueryFullNameplate 0xb66

06 fffff880 `08d928a0 fffff880`0148ea0a fltmgrange FltpPerformPreCallbacks0x50b

07 fff880 `08d929b0 fffff880`014aa2a3 fltmgrange FltpPassThroughInternalroom0x4a

08 fffff880`08d929e0 fffff800`01dd22bb fltmgrange FltpCreatekeeper 0x293

09 fffff880 `08d92a90 fffff800`01dcddde ntasking IopParseDeviceroom0x14e2

0a fffff880`08d92bf0 fffff800`01dce8c6 ntdistinct ObpLookupObjectNameplate 0x784

0b fffff880`08d92cf0 fffff800`001dd06bc ntspeak ObOpenObjectByNameplate 0x306

0c fffff88008d92dc0 fffff800`01d7316b nttriple IopCreateFile0x2bc

0d fffff880 `08d92e60 fffff880`014b1f60 ntrabbit IoCreateFileExportable 0xfb

0e fffff88008d92f00 fffff880`014bdc61 fltmgrange FltpCreateFilekeeper 0x194

0f fffff880 `08d92fff0fff880`014e3506 fltmgrange FltCreateFileExportable 0x91

10 fffff880`08d93080 ffffff880`014de40e dfsrrofoam DfsrRopLoadPrefixEntriesFromFileholders 0x416

11 fffff880`08d93250 fffff880`014b00c6 dfsrrofoam DfsrRoNewInstanceCallbackaccount0x2e2

12 fffff880`08d932b0 fffff880`014af0cb fltmgrange FltpDoInstanceSetupNotification0x86

13 fffff880`08d93310 fffff880`014afe81 fltmgrange FltpInitInstance0x27b

14 fffff880 `08d93380 fffff880`014b0d5b fltmgrange FltpCreateInstanceFromNameplate 0x1d1

15 fffff880`08d93450 fffff880`014aed6c fltmgrange FltpEnumerateRegistryInstancesroom0x15b

16 fffff880`08d934f0 fffff880`014aa3f0 fltmgrange FltpDoFilterNotificationForNewVolume0xec

17 fffff880`08d93560 fffff800`01dd22bb fltmgrange FltpCreatekeeper 0x3e0

18 fff880`08d93610 fffff800`01dcddde ntasking IopParseDeviceroom0x14e2

19 fffff880`08d93770 fffff800`01dce8c6 ntrabbit ObpLookupObjectNamekeeper 0x784

1a fffff880`08d93870fffff800`01dd06bc ntspeak ObOpenObjectByNameplate 0x306

1b fffff880`08d93940 fffff800`01ddbd34 ntfolk IopCreateFilekeeper 0x2bc

1C fffff880`08d939e0 fffff800`01acd0d3 ntcreative NtCreateFilekeeper 0x78

1D fffff880`08d93a70000000`76fac28a ntbrush KiSystemServiceCopyEndkeeper 0x13

1e 000000`0219f7480000000000000000 0x76fac28a

Another thread in the RHS process, fffffab0`30ef2b50, is also waiting on the TmXPFlt.sys driver.