In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
In the advanced section of WSFC log analysis, we have mentioned some underlying principles of WSFC, such as Resource.dll,RHS,RCM. Understanding these components is of great help for us to do cluster troubleshooting later. In this article, we will use an actual case of resource deadlock to help you make an impression.
First of all, what does Resource.dll do? Resouce.dll is the component on which each cluster resource survives. Only with Resouce.dll can the cluster resource operate normally. Resource.dll includes the detection function definitions of Look alive and is alive for the detection of cluster resources, as well as the operation call definition of resources.
RHS is a cluster resource host subsystem, which operates as a process and is responsible for monitoring whether the cluster resources are running normally. RHS monitors the cluster resources according to the Look alive and is alive methods defined in the resource resource.dll.
When the is alive method tests and confirms that the resource fails, RHS reports the failure to the RCM resource control manager component, and RCM restarts and retries the resource according to the resource failure policy definition and fails over.
Basically, Resource.dll is responsible for defining how to see a doctor and the operation steps of seeing a doctor. RHS follows the methods and procedures of Resource.dll to see a doctor. After RHS judges the condition, it is handed over to nurse RCM, and nurse RCM decides whether to follow the strategy to get an injection or to be hospitalized.
Today, we mainly focus on this stage seen by RHS. Normally, RHS is either optimistic, or bad, or good will report normal, and then cycle testing, bad RCM will begin to deal with according to the strategy.
But in addition to these two situations, there is another common situation, that is, resource deadlock.
What is a resource deadlock? that is to say, RHS sends is alive detection signals to cluster resources, but cluster resources do not respond, so if you do not respond, how does the cluster know whether you are alive or not? the cluster will not wait for the response from your resources all the time. The cluster must accurately confirm whether each resource is available or not. After a period of time, the cluster will determine that the resource has entered the Deadlock deadlock. RHS will put your unresponsive cluster resource in a separate isolated RHS process, and then RHS tries to restart the resource to create a deadlock WER report.
For the time of resource deadlock, WSFC 2008R2 starts the cluster to wait for 5 minutes by default. If the resource still does not respond within 5 minutes, the resource is deadlocked, and the cluster tries to restart the resource in a separate RHS process.
During 2008R2, the deadlock time of Deadlock resources can be changed by the following command
For a single resource level
(Get-ClusterResource "Resource name") .DeadlockTimeout=300000
For resource type level
(Get-ClusterResourceType "Virtual Machine") DeadlockTimeout=300000
The next layer of protection is that when the cluster issues a request to terminate the RHS process, it will wait for the RHS process to be terminated four times (the default is equal to 20 minutes). If the RHS is not terminated within 20 minutes, the cluster will assume that the server has some serious health problems and will check the server to force failover and recovery, during which time the cluster resources may stop access, and the error checking code will be Stop 0x0000009E (Parameter1,Parameter2,0x0000000000000005 Parameter4). Note: if the RHS process fails to terminate, Parameter3 will always be the value of 0x5
After the WSFC 2008R2 era
Cluster IP, cluster network name, quorum resources work in a separate RHS monitoring process
Cluster available disks, CSV working in a separate RHS monitoring process
Other cluster resources work on dedicated RHS monitoring processes
This can also avoid the disadvantage that all resources are managed by one RHS process before 2008R2. Before 2008R2, cluster resources are managed in the same RHS process. As long as one of the cluster resources crashes, the entire RHS process may fail and all managed resources will fail.
Once quarantined, if a single cluster resource does not respond and causes the RHS to crash, the cluster service will consider the specific resource suspicious and need to be quarantined. The cluster service will automatically set the resource public property SeparateMonitor to mark the resource to run in its own private RHS process, so that if the resource becomes unresponsive again; it does not affect other cluster resource processes
In 2008R2, once the resource deadlock we mentioned above occurs, that is, the application does not respond to the is alive request, a WER report will be generated, which can be seen in the control panel and collected in the dump,WSFC2016 era of the deadlock RHS process. This function goes a step further by displaying more detailed information about the resource deadlock and helping the debugger Zero Downtime Debugging.
OK, after the above basic information has been explained, let's take a look at this resource deadlock case.
Sometimes you will encounter a lot of inexplicable and strange problems in some scenes, especially when you change a lot of content together, and suddenly something goes wrong, and you don't know which change operation caused the problem. This is the biggest headache. At this point, you need to sit down and take a deep look at the problem.
The problem this time is probably an update change, a patch update is carried out somewhere for a batch of cluster nodes, in which two nodes restart the machine after the cluster update is completed, the machine runs slowly, and the cluster service cannot be started. the attempt to carry out compulsory arbitration on the two cluster nodes was fruitless, and the cluster failed immediately after the compulsory arbitration, and witnessed that the disks and cluster disks could not go online all the time.
The initial suspicion was due to the system update patch. After uninstalling the patch, it was found that the problem remained, and the cluster-level troubleshooting began. First, check whether the cluster-to-storage is normal. For link confirmation, start with the cluster event log.
View the cluster event log
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date:
Event ID: 1573
Task Category: Quorum Manager
Level: Error
Keywords:
User: SYSTEM
Computer:
Description:
Node 'ZQ1' failed to form a cluster. This was because the witness was not accessible. Please ensure that the witness resource is online and available.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date:
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer:
Description:
Cluster resource 'cluster disk 1' in clustered role 'wlc' failed.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date:
Event ID: 1230
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer:
Description:
A component on the server did not respond in a timely fashion. This caused the cluster resource 'Cluster disk 1' (resource type'', DLL 'clusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.
Most cluster event logs indicate that the cluster disk cannot go online and that the deadlock and RHS process timed out
Look at the clusterlog and see that deadlock appears on cluster disk 2
00001148.001158 RhsCall::DeadlockMonitor: Call ONLINERESOURCE timed out for resource'è o bicycle'2 ERR.
00001148.001158VG 2017According to 15-20 handling deadlock 24Rod 36.365 ERR [RHS] Resource è o bicycle 2. Cleaning current operation.
00001148.001158 ERR [RHS] About to send WER report.
0000084c.0000159c::2017/09/15-20 RCM 24VR 36.365 WARN [RCM] HandleMonitorReply: FAILURENOTIFICATION for'è o people', gen (0) result 5018.
0000084c.0000159c::2017/09/15-20 RCM 24VR 36.365 INFO [RCM] TransitionToState (è o Zhuan 2) OnlinePending-- > ProcessingFailure.
0000084c.0000159c::2017/09/15-20 RCM 24VR 36.365 ERR [RCM] rcm::RcmResource::HandleFailure: (è o people's 2)
The current log indicates that the cluster problem is related to the cluster disk being unable to go online on both nodes, the cluster disk is inaccessible, and RHS deadlock appears. Indicates that the cluster disk IO did not get a timely response
The computer has rebooted from a bugcheck. The bugcheck was: 0x0000009e (0xfffffab030cce600, 0x00000000000004b0, 0x0000000000000000, 0x0000000000000000). A dump was saved in: C:\ Windows\ MEMORY.DMP
But at present, we still have no way to determine what caused the problem. According to bugcheck instructions, further we need to look at the dump file to see the cause of the resource deadlock. Dump can be a process dump in WER, preferably a complete memory dump. In this case, we take a complete memory.dmp as an example.
The RHS process in which deadlock occurs from dump is fffffab030cce600, and there are three threads in the process, two of which are waiting on the system thread fffffab0`30f0a6d0, callstack is as follows, this system thread is waiting for the TmXPFlt.sys driver
00 fff880`08d92420 fffff800`01ad3142 ntfolk KiSwapContexttext0x7a
01 fffff880`08d92560 fffff800`01ad596f ntbrush KiCommitThreadWaitcake 0x1d2
02 fffff880 `08d925f0 fffff880`056880e4 ntasking KeWaitForSingleObject0x19f
03 fffff88008d92690 fffff880`05680838 TmXPFlt+0xb0e4
04 fffff880 `08d926f0 fffff880`05670be2 TmXPFlt+0x3838
05 fffff880`08d927e0 fffff880`0148c0f7 TmPreFltkeeper TmpQueryFullNameplate 0xb66
06 fffff880 `08d928a0 fffff880`0148ea0a fltmgrange FltpPerformPreCallbacks0x50b
07 fff880 `08d929b0 fffff880`014aa2a3 fltmgrange FltpPassThroughInternalroom0x4a
08 fffff880`08d929e0 fffff800`01dd22bb fltmgrange FltpCreatekeeper 0x293
09 fffff880 `08d92a90 fffff800`01dcddde ntasking IopParseDeviceroom0x14e2
0a fffff880`08d92bf0 fffff800`01dce8c6 ntdistinct ObpLookupObjectNameplate 0x784
0b fffff880`08d92cf0 fffff800`001dd06bc ntspeak ObOpenObjectByNameplate 0x306
0c fffff88008d92dc0 fffff800`01d7316b nttriple IopCreateFile0x2bc
0d fffff880 `08d92e60 fffff880`014b1f60 ntrabbit IoCreateFileExportable 0xfb
0e fffff88008d92f00 fffff880`014bdc61 fltmgrange FltpCreateFilekeeper 0x194
0f fffff880 `08d92fff0fff880`014e3506 fltmgrange FltCreateFileExportable 0x91
10 fffff880`08d93080 ffffff880`014de40e dfsrrofoam DfsrRopLoadPrefixEntriesFromFileholders 0x416
11 fffff880`08d93250 fffff880`014b00c6 dfsrrofoam DfsrRoNewInstanceCallbackaccount0x2e2
12 fffff880`08d932b0 fffff880`014af0cb fltmgrange FltpDoInstanceSetupNotification0x86
13 fffff880`08d93310 fffff880`014afe81 fltmgrange FltpInitInstance0x27b
14 fffff880 `08d93380 fffff880`014b0d5b fltmgrange FltpCreateInstanceFromNameplate 0x1d1
15 fffff880`08d93450 fffff880`014aed6c fltmgrange FltpEnumerateRegistryInstancesroom0x15b
16 fffff880`08d934f0 fffff880`014aa3f0 fltmgrange FltpDoFilterNotificationForNewVolume0xec
17 fffff880`08d93560 fffff800`01dd22bb fltmgrange FltpCreatekeeper 0x3e0
18 fff880`08d93610 fffff800`01dcddde ntasking IopParseDeviceroom0x14e2
19 fffff880`08d93770 fffff800`01dce8c6 ntrabbit ObpLookupObjectNamekeeper 0x784
1a fffff880`08d93870fffff800`01dd06bc ntspeak ObOpenObjectByNameplate 0x306
1b fffff880`08d93940 fffff800`01ddbd34 ntfolk IopCreateFilekeeper 0x2bc
1C fffff880`08d939e0 fffff800`01acd0d3 ntcreative NtCreateFilekeeper 0x78
1D fffff880`08d93a70000000`76fac28a ntbrush KiSystemServiceCopyEndkeeper 0x13
1e 000000`0219f7480000000000000000 0x76fac28a
Another thread in the RHS process, fffffab0`30ef2b50, is also waiting on the TmXPFlt.sys driver.
00 fff880`08d92420 fffff800`01ad3142 ntfolk KiSwapContexttext0x7a
01 fffff880`08d92560 fffff800`01ad596f ntbrush KiCommitThreadWaitcake 0x1d2
02 fffff880 `08d925f0 fffff880`056880e4 ntasking KeWaitForSingleObject0x19f
03 fffff88008d92690 fffff880`05680838 TmXPFlt+0xb0e4
04 fffff880 `08d926f0 fffff880`05670be2 TmXPFlt+0x3838
05 fffff880`08d927e0 fffff880`0148c0f7 TmPreFltkeeper TmpQueryFullNameplate 0xb66
06 fffff880 `08d928a0 fffff880`0148ea0a fltmgrange FltpPerformPreCallbacks0x50b
07 fff880 `08d929b0 fffff880`014aa2a3 fltmgrange FltpPassThroughInternalroom0x4a
08 fffff880`08d929e0 fffff800`01dd22bb fltmgrange FltpCreatekeeper 0x293
09 fffff880 `08d92a90 fffff800`01dcddde ntasking IopParseDeviceroom0x14e2
0a fffff880`08d92bf0 fffff800`01dce8c6 ntdistinct ObpLookupObjectNameplate 0x784
0b fffff880`08d92cf0 fffff800`001dd06bc ntspeak ObOpenObjectByNameplate 0x306
0c fffff88008d92dc0 fffff800`01d7316b nttriple IopCreateFile0x2bc
0d fffff880 `08d92e60 fffff880`014b1f60 ntrabbit IoCreateFileExportable 0xfb
0e fffff88008d92f00 fffff880`014bdc61 fltmgrange FltpCreateFilekeeper 0x194
0f fffff880 `08d92fff0fff880`014e3506 fltmgrange FltCreateFileExportable 0x91
10 fffff880`08d93080 ffffff880`014de40e dfsrrofoam DfsrRopLoadPrefixEntriesFromFileholders 0x416
11 fffff880`08d93250 fffff880`014b00c6 dfsrrofoam DfsrRoNewInstanceCallbackaccount0x2e2
12 fffff880`08d932b0 fffff880`014af0cb fltmgrange FltpDoInstanceSetupNotification0x86
13 fffff880`08d93310 fffff880`014afe81 fltmgrange FltpInitInstance0x27b
14 fffff880 `08d93380 fffff880`014b0d5b fltmgrange FltpCreateInstanceFromNameplate 0x1d1
15 fffff880`08d93450 fffff880`014aed6c fltmgrange FltpEnumerateRegistryInstancesroom0x15b
16 fffff880`08d934f0 fffff880`014aa3f0 fltmgrange FltpDoFilterNotificationForNewVolume0xec
17 fffff880`08d93560 fffff800`01dd22bb fltmgrange FltpCreatekeeper 0x3e0
18 fff880`08d93610 fffff800`01dcddde ntasking IopParseDeviceroom0x14e2
19 fffff880`08d93770 fffff800`01dce8c6 ntrabbit ObpLookupObjectNamekeeper 0x784
1a fffff880`08d93870fffff800`01dd06bc ntspeak ObOpenObjectByNameplate 0x306
1b fffff880`08d93940 fffff800`01ddbd34 ntfolk IopCreateFilekeeper 0x2bc
1C fffff880`08d939e0 fffff800`01acd0d3 ntcreative NtCreateFilekeeper 0x78
1D fffff880`08d93a70000000`76fac28a ntbrush KiSystemServiceCopyEndkeeper 0x13
1e 000000`0219f7480000000000000000 0x76fac28a
Start end module name
Fffff880`0567d000 fff880`056d3000 TmXPFlt (no symbols)
Loaded symbol p_w_picpath file: TmXPFlt.sys
Image path:\ C:\ Program Files (x86)\ Trend Micro\ OfficeScan Client\ TmXPFlt.sys
Image name: TmXPFlt.sys
Browse all global symbols functions data
Timestamp: Wed Jun 10 18:54:43 2009 (4A2F90F4)
CheckSum: 00040739
ImageSize: 00056000
Translations: 0000.04b0 400 0409.04b0 4.0904 million
According to this dump analysis, the RHS process deadlock is related to the driver of Trend Micro. Based on the above analysis, this problem is likely to be related to the failure of the disk to go online.
Try to uninstall the Trend Micro trend on the cluster node, update the patch again, and the problem is resolved.
In fact, later on the official website of the trend, it has been found that the cause and solution of this problem have been given.
Https://success.trendmicro.com/solution/1060123-core-protection-module-cpm-endpoint-component-prevents-microsoft-cluster-failover
You can choose to solve the problem according to the scheme of modifying the registry given by the trend website.
Or upgrade trend VSAPI scan engine to version after 9.5.All VSAPI scan engines prior to 9.5will have this resource deadlock problem with the cluster.
The above is the analysis process of this resource deadlock case, hope to be able to bring help for interested friends!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.