Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of APD and PDL in vmware

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

Editor to share with you the example analysis of APD and PDL in vmware. I hope you will get something after reading this article. Let's discuss it together.

The situation of APD and PDL is a relatively thorny problem in virtualization operation and maintenance, which needs to be handled carefully.

All path anomalies (APD):

The data store is displayed as unavailable in Storage view.

The storage adapter indicates that the "operational status" of the device is inactive or error

Permanent device loss (PDL)

The data store is displayed as unavailable in Storage view

The storage adapter indicates that the "operational status" of the device is "communication interrupt"

APD parsing:

In vSphere 4.x, if all paths to the device fail, a full path anomaly (APD) condition occurs. Since there is no indication of whether this is a permanent or temporary device loss, the ESXi host keeps retrying to establish a connection. An APD condition usually occurs when you incorrectly cancel the provision of LUN from an ESXi/ESX host. The ESXi/ESX host still thinks the device is available and will retry all SCSI commands indefinitely. This has an impact on the administrative agent because it does not respond to its commands until the device is reaccessible. This will cause the ESXi/ESX host to become inaccessible / unresponsive in vCenter Server.

In vSphere 5.x/6.x, a clear distinction has been made between devices that are permanently lost (PDL) and those that have a temporary problem of full path anomalies (APD) due to unknown reasons.

For example, in the VMkernel log, if the storage device logs the SCSI-aware code H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 or Logical Unit Not Supported to the ESXi 5.x/6.x host, it indicates that the ESXi host is permanently inaccessible to the device or that the device is in a permanent device loss (PDL) state. The ESXi host no longer attempts to reestablish the connection or issue commands to the device.

Devices that encounter unrecoverable hardware errors are also recognized as being in a permanent device loss (PDL) state.

"if the PDL SCSI-aware code is not returned from the device (when the storage array cannot be contacted, or the storage array you have does not return a supported PDL SCSI code), the device is in a full path exception (APD) state, and the ESXi host continues to send the ESXi O request until the host receives a response."

Because the ESXi host is unable to determine whether the device loss is permanent (PDL) or temporary (APD), it will retry SCSI Imaco indefinitely, including:

User domain IZAGO (hostd management agent)

Virtual machine client IPUBO

Note: if you issue an Iamp O request from the client, the operating system will time out and abort the Iamp O.

Because of the nature of the APD condition, there is no easy way to recover.

The APD condition needs to be resolved at the storage array / fabric layer to restore connectivity to the host.

All affected ESXi hosts may need to be rebooted to remove any residual references to affected devices in the APD state.

Note:

VMotion migration cannot be performed on unaffected virtual machines because the management agent may be affected by APD conditions and the ESXi host may become unmanaged. As a result, rebooting the affected ESXi host forcibly disrupts all unaffected virtual machines on that host.

With vSphere HA, vSphere 6.0 and later have introduced a powerful new feature called virtual machine component protection (VMCP). "VMCP protects virtual machines from storage-related events, especially permanent device loss (PDL) and all path exception (APD) events."

Note: when the APD event occurs, the LUN connected to the ESXi may remain inaccessible after the LUN path is restored. The 140s APD timeout may still expire even after the storage path is restored. In the / var/log/vmkernel.log file, you will encounter the following events in turn: the device enters the APD state. The device exits the APD state. The detection signal on the device was restored and the file system operation failed due to timeout or not found or busy. Although the device has previously exited the APD state, the APD timeout expires. This condition is related to one or more of the following behaviors: the virtual machine is inaccessible. The host is not responding. Even if the path is restored and available, the storage is still offline. VSphere Client does not display the data store even if the virtual machine is still on the data store. One or more of the following events may trigger APD events: upstream fibre Channel or Ethernet switching link failures affect all paths to the storage array storage array failure or reboot storage array firmware updates (some vendors)

Of course, this behavior does not occur in all APD events. In most cases, the LUN and datastore will exit the APD timeout as expected.

Reason:

The cause of this problem is a failure during APD processing. When this problem occurs, the LUN path is available and online during the APD event, but the APD timer continues to count until LUN enters the APD timeout state. After the initial APD event, the data store is inaccessible as long as the active workload is associated with the data store.

"when this problem is encountered, the virtual machine must be terminated to restore the data store." HA, if enabled, should restore these virtual machines on other hosts. If the management agent must be restarted, hosts cannot be managed through vCenter Server for the time being.

Analysis of planned PDL and unplanned PDL:

A planned PDL occurs when an attempt is made to remove a device provided to an ESXi host. You must first uninstall the data storage and then detach the device so that the storage device can be unavailable on the storage array. For more information about how to properly unprovide LUN in ESXi 5.x, see how to uninstall LUN from an ESXi host or detach a data storage device (2072353).

If the storage device is unprovisioned unexpectedly from the storage array without unmounting and detaching on the ESXi host, an unplanned PDL occurs.

In ESXi 5.5, VMware provides a feature called "automatic removal" to automatically remove devices during unplanned PDL. For more information, see PDL AutoRemove feature in vSphere 5.5 (2059622).

To clear the unplanned PDL:

All running virtual machines in the data store must be powered off and unregistered from vCenter Server.

"from vSphere Client, go to the configuration tab of the ESXi host and click Storage."

Right-click the data store you want to remove, and then click uninstall.

The confirm uninstall data store window appears. If the prerequisites are met, the OK button is displayed.

If you see the following error when uninstalling LUN:

Failed to call data store refresh for object on vCenter Server

(Call datastore refresh for object on vCenter server failed)

You may have provided a snapshot LUN. To resolve this issue, remove the snapshot LUN on the array side.

Perform a rescan on all ESXi hosts that are visible to the LUN.

Note: if there is an active reference to the device or pending iCandle O, the ESXi host will still list the device after the rescan. Check for virtual machines, templates, ISO images, floppy disk images, and bare device mappings that may still have active references to the device or data store.

"if the LUN is still in use and is available again, go to each host, right-click the LUN, and click Mount."

Note: one possible reason for unplanned PDL is that LUN does not have enough space, which makes it inaccessible.

Vc 6.0Solutions:

If virtual machine component protection (VMCP) is enabled, vSphere HA can detect data store accessibility failures and provide automatic recovery for affected virtual machines.

VMCP protects against data storage accessibility failures that can affect virtual machines that are running on hosts in the vSphere HA cluster. When a data store accessibility failure occurs, the affected host can no longer access the storage path of a specific data store. You can determine how vSphere HA will respond to such failures, from creating event alerts to restarting the virtual machine on another host.

Note:

When using the virtual machine component protection feature, the version of the ESXi host must be version 6.0 or later.

Fault type

There are two types of data store accessibility failures:

PDL

PDL (permanent device loss) is an unrecoverable loss of accessibility that occurs when a storage device reports that the host can no longer access the data store. This condition cannot be restored if the virtual machine is not powered off.

APD

APD (all path anomalies) indicates a temporary or unknown loss of accessibility, or any other unrecognized delay in Imax O processing. This type of accessibility problem is recoverable.

Configure VMCP

Configure virtual machine component protection in vSphere Web Client. Go to the configuration tab and click vSphere availability and Edit. In case of failure and response, you can choose a data store in the PDL state or a data store in the APD state. The level of storage protection you can choose and the virtual machine repair operations available vary depending on the type of database accessibility failure.

PDL failure

"under the data store in the PDL state, you can choose to publish events or power down the virtual machine and restart the virtual machine."

APD failure

Responding to APD events is more complex and correspondingly finer to configure. You can choose to publish events, power off the virtual machine and restart the virtual machine-conservative restart policy or power off the virtual machine and restart the virtual machine-aggressive restart policy

There are several cycles of time scheduling for APD and PDL, which are:

APD description:

0s-at this point APD will activate the time counter

The 140s APD-ESXi host life APDTimeout then executes the NON VM Icano activation Fast Fail action on the failed device. The cycle of this Timeout can be modified

After the time of 140s-320s APD-APD Timeout arrives, the Timeout of VMCP has already arrived. If the failed storage device returns to normal before then, you can configure the Response for APD recovery after APD timeout configuration option to ensure that the VM is not forcibly reset

320s APD-VMCP Timeout, while activating Response for Datastore with All Paths Down (APD)

PDL description:

0s PDL-VMs will immediately restart on a normal ESXi host

The Timeout time for VMCP will be 320 seconds, which includes the default 140 seconds for APD. The configuration of the VMCP component can be activated by checking the Protect against Storage Connectivity Loss option in the vSphereHA setting option

The configuration options for VMCP are as follows:

VM restartpriority-VM restart priority setting

Response for Host Isolation-response of the host when it is quarantined

Response for Datastore with Permanent Device Losss (PDL)-three configuration options, namely Disabled, Issue events (do not activate the processing action, only send notification messages), and Power off and restart VMs (attempt to restart the failed Vms)

Response for Datastore with All Path Down (APD)-four configuration options, namely Disabled, Issue events (do not activate processing actions, only send notification messages), Power off and restart (conservative) (affected Vms will be dropped by Kill, and then restarted on a normally connected ESXi host. If the failed host is unable to communicate with the Master host, it will not be activated), Power off and restart VMs (aggressive) (the affected Vms will be dropped by Kill, regardless of whether there is a host that can reboot the Vms. Regardless of whether the Master host exists, whether it can communicate with other hosts and whether there are sufficient resources)

Response for APD recovery after APD timeout-this option indicates what happens when the storage device returns to normal after APDTimeout (140s) and before VMCP Timeout (320s). It has two available configuration options, namely: Disabled and Reset VMs (Vms will be forced to reset the host on which it was located before the APD occurs)

Note:

If you disable the Host Monitoring or Virtual Machine restart priority setting, VMCP will not be able to perform a virtual machine restart. "however, storage health can still be monitored and events can be published."

APD's solution complements:

This issue has been resolved in ESXi 6.0 Update 1 (available from VMware Downloads). For more information, see VMware ESXi 6.0 Update 1 Release Notes.

If you cannot upgrade, there is no other measure to guarantee that you will not encounter this problem during the APD event. However, when this problem occurs, there are two expedient measures to resume production.

To resolve this issue temporarily, use one of the following options:

1. Perform the process of terminating all unfinished LUN. For information about unplanned PDL, see Cannot remount a datastore after an unplanned permanent device loss (PDL) (2014155). 2. Note: you may also need to restart the ESXi management agent. For more information, see Restarting the Management agents on an ESXi or ESX host (1003490). Reboot all hosts whose volume is in the "APD timeout" state.

Other additions:

Cerebral fissure

When there is a brain crack in the cluster, it is misunderstood that the other party cannot operate because it is unable to communicate with each other, so both the master server and the backup server will start floating IP and related services. if the external connection of the two servers is not short-term, then it is bound to cause some users to access the primary server and others to access the backup server. In addition, if two servers share a storage device, when brain cracking occurs, the two servers will mount the storage device at the same time and access the same files at the same time, so if the shared storage equipment lacks a good locking mechanism, it is more likely that the files on the storage device will be damaged by reading and writing at the same time. It is more likely to lead to inconsistent information written in the hard disk, resulting in later data errors, or even damage to the entire database, with unimaginable consequences.

At present, I know the following countermeasures to deal with the "brain crack" of the HA system:

1) add redundant heartbeats, such as double lines. Minimize the chance of "brain fissure".

2) enable disk locks. The service side is locking the shared disk, and when a "brain crack" occurs, the other party can not take away the shared disk resources at all. But there is also a big problem with using locked disks. If the party who occupies the shared disk does not take the initiative to "unlock" it, the other side will never get the shared disk. In reality, if the service node suddenly crashes or crashes, it is impossible to execute the unlock command. Backup nodes can not take over shared resources and application services. So someone designed a "smart" lock in HA. That is, the party in service enables the disk lock only when it is found that the heartbeat cable is all disconnected (not noticing the opposite end). It's not usually locked.

3) set up arbitration mechanism. For example, if you set up a reference IP (such as gateway IP), when the jumper is completely disconnected, both nodes ping the reference IP. If it is not accessible, it means that the breakpoint is on the local side. Not only the "heartbeat", but also the local network link of the external "service" is broken, even if it is useless to start (or continue) the application service, then take the initiative to give up the competition and ping the reference IP to start the service. To be more secure, the party that does not have access to ping simply restarts itself with reference to IP, in order to completely release the shared resources that may still be occupied.

After reading this article, I believe you have some understanding of "sample Analysis of APD and PDL in vmware". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report