VMware vSphere 5.1 Cluster in-depth Analysis (28) 02/13 Update SLTechnology News&Howtos

VMware vSphere 5.1 Cluster in-depth Analysis (28)

2026-02-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

VMware vSphere

5.1

Clustering Deepdive

HA.DRS.Storage DRS.Stretched Clusters

Duncan Epping & Frank Denneman

Translate By Tim2009 / translator: Tim2009

Catalogue

About the author

Knowledge point

Preface

Part I vSphere High availability

Chapter 1 introduces vSphere high availability

Chapter II High availability components

Chapter III basic concepts

Chapter 4 restart the virtual machine

Chapter 5 increase high availability flexibility (network redundancy)

Chapter VI access Control

Chapter VII Virtual Machine and Application Monitoring

Chapter 8 Integration

Chapter 9 Summary

Part II: vSphere DRS (distributed Resource scheduling)

Chapter 1 introduction to vSphere DRS

Chapter II vMotion and EVC

Chapter III DRS dynamic quota

Chapter IV Resource Pool and Control

Chapter V DRS Computing recommendation

Chapter VI DRS recommendation Wizard

Chapter VII introduction to DPM

Chapter 8 DPM Computing recommendation

Chapter 9 DPM recommendation Wizard

Chapter 10 Summary

Part III vSphere Storage DRS

Chapter 1 introduction to vSphere Storage DRS

Chapter 2 Storage DRS algorithm

Chapter III Storage Imax O Control (SIOC)

Chapter IV data Storage configuration

Chapter V data Storage Architecture and Design

Chapter 6 impact on Storage vMotion

Chapter VII relevance

Chapter 8 data Storage and maintenance Mode

Chapter 9 Summary

Part IV expansion of cluster architecture

Chapter 1 extension of Cluster Architecture

Chapter II vSphere configuration

Chapter III troubleshooting

Chapter IV Summary

Chapter V Appendix

Part IV expansion of cluster architecture

Chapter II vSphere configuration

In this case, our focus is on extending the relationship between vSphere HA,vSphere DRS and storage DRS in a clustered environment, and considerations around these vSphere components that are often overlooked and underestimated in terms of design and operation. Historically, much emphasis has been placed on the storage tier, with little regard to how workloads are configured and managed.

As we mentioned earlier, the key drivers of expanding clusters are workload balancing and disaster avoidance. How can we be sure that our environment is in a reasonable balance without affecting availability or significantly reducing operational overhead? How do we establish configuration requirements and continuous management processes, and how do we regularly verify that we still meet our needs? Failure to define and comply with requirements can make the environment chaotic and unpredictable, and you will want it to help you. In fact, ignoring the process can result in additional downtime in a failure event.

Each of these three VMware vSphere functions has specific configuration requirements that enhance the resilience of your environment and the availability of workloads. Through this section, architectural recommendations will be generated based on problems found in various scenarios during testing. Each failure scenario test is recorded in the following chapters, keep in mind that these failure scenarios directly apply the configuration of these instances, and your environment may suffer additional failures based on your implementation and configuration options.

VSphere HA characteristics

Our instance environment has four hosts and a unified extended storage solution. When all sites fail is a scenario where elastic architecture needs to be considered, we recommend turning on access control (Admission Control), which is the main driver of many extended cluster environments, and it is recommended that there is sufficient capacity to allow site-wide failures, however, the two sites will allocate hosts equally to ensure that all workloads can be restarted through HA, and it is recommended to configure an access control policy of 50%.

We recommend using a percentage-based strategy to provide architectural flexibility and reduce operational overhead, although there is no need to change the percentage of new hosts to join the environment, and there is no integration rate deviation, resulting in the risk of using virtual machine-level reserved resources, see Chapter 6 for more details.

HA uses heartbeat detection mechanism to verify the status of the host. As explained in Chapter 3, there are two heartbeat detection mechanisms; called network and data storage heartbeat detection, the main mechanism of HA to verify the host in network heartbeat detection. Data storage heartbeat detection is another mechanism to determine the status of the host through HA once the network heartbeat detection fails.

If the host does not receive any heartbeat detection, it detects whether it is only isolated from other hosts or completely isolated from the network. This process includes the default gateway of the Ping host, or one or more manually set isolation addresses to replace the host gateway, thus enhancing the reliability of isolation detection. We recommend that you specify at least two additional isolated addresses and each address goes to the local network, even if the connection between sites fails, turning on HA verifies complete network isolation and provides redundancy to allow an IP failure.

However, if the host is quarantined, vSphere HA triggers a response, which, as previously explained, is called an isolated response, which is triggered when the connection between the host and the management network is disconnected to ensure proper management of the virtual machine. The isolation response is discussed in depth in Chapter 3, which is used to make decisions when needed, depending on the use of different storage and physical networks. We refer to the decisions supported in Chapter 4, Table 3.

In our test environment, some of these addresses will belong to the Frimley data center, and the other will belong to the Bluefine data center. The screenshot shows how to configure multiple isolated address instances. VSphereHA advanced settings use das.isolationaddress. More details on how to configure can be found in KB article 1002117.

In order for the vSphere HA datastore heartbeat to work properly in any failure scenario, we recommend increasing the number of datastore heartbeats to 2-4, with a minimum heartbeat of 2 and a maximum of 5, and recommendation 4 in an extended cluster environment, which will provide full local redundancy. It is also recommended that you define 4 specified data stores as priority data store heartbeats, select 2 from one site, and then select 2 from another site. This allows vSphere HA's data store heartbeat even if there is a connection failure between sites. These data stores can be useful if there is still a partial network at the site after a connection failure between sites.

The number of data storage heartbeats can be set through HA advanced das.heartbeatDsPerHost. To increase.

We recommend using "Select any cluster data store that is considered to be included in my preferences", which will allow vSphere HA to select any four designed data stores, and our manual selection becomes unavailable because if our recommended four heartbeat connections between sites fail, vCenter will end up on one site so that hosts at the other site do not have the opportunity to HA to change the data storage heartbeat. The screenshot of this setting is as follows.

Figure 163: data storage heartbeat

VSphere 5.0U1 permanent device loss (PDL) Enhancement

In vSphere version 5.0U1, conditions for permanent device loss (PDL) are introduced-allowing virtual machines on data storage to automatically fail over, and we will show a PDL environment in one of the failure scenarios, communicating through the array controller through a specified SCSI code to ESXi, these conditions state that a device (LUN) will be unavailable and may be permanently unavailable. When the storage administrator sets the LUN offline, the instance scenario communicates through the array. When the access to the LUN is withdrawn, during the failure of the inconsistent environment, it is used to determine that the ESXi takes appropriate action. It should be noted that when all storage fails, permanent disk loss may occur, and there is no possibility of communication between the array and the ESXi host. This state is identified by the ESXi host as all paths disconnected (APD).

It is important to realize that the next setup applies only to the PDL environment, not the APD environment, and in our failure environment, we will demonstrate the different behavior of the two environments.

To allow vSphere HA to respond to the PDL environment, two advanced settings are introduced in vSphere U1. The first host setting is disk.terminateVMOnPDL Default. This setting is configured in / etc/vmware/settings, from the default setting to "True". Note that this is a per-host setting. The host needs to restart this setting to take effect. When the data storage enters the PDL state, this setting ensures that the virtual machine is killed. Once the virtual machine is killed in the PDL environment, the data storage is initialized. If the files of the virtual machine are not on the same data store and the PDL exists on one of the data stores, the virtual machine may not be restarted through HA. This issue has been fixed in vSphere 5.1.To ensure that the PDL environment can be migrated through HA, we recommend setting disk.terminateVMonPDL Default to "True" and placing the virtual machine files on a single data store. The virtual machine is just killed and the virtual machine can be restored to activity. A virtual machine that is running a dense memory load and the data store does not produce Icano may return to an active state.

The second setting is the advanced setting of vSphere HA, called das.maskCleanShutdown Enabled. This setting is described in vSphere 5.0U1, which is disabled by default and requires that your HA cluster be set to "True". This setting allows HA to trigger the restart of a virtual machine that is automatically killed in the PDL environment. HA cannot tell whether the virtual machine is killed by PDL or shut down by the administrator. Setting the flag "True" assumes the former. Note that during APD, the user shutdown will be marked by behavior.

We recommend setting das.maskCleanShutdown Enabled to "True". In order to limit the downtime of the virtual machine on the data store in the PDL environment, when the das.maskCleanShutdown Enabled is not set to "True", the PDL environment also exists, and disk.terminateVMonPDL Default is set to "True". After killing the virtual machine, the virtual machine restart will not occur, and HA will assume that the virtual machine is powered off (or shut down) by the administrator.

VSphere DRS

VSphere DRS is used to distribute cluster load in many environments. VSphere DRS provides many other features to help scale the environment. We recommend enabling vSphere DRS to allow load balancing between hosts in the cluster. VSphere DRS load balancing calculation is based on CPU and memory usage. Similarly, storage and network resource utilization and traffic must be taken care of to avoid unexpected storage and network traffic overhead in the extended cluster environment. We recommend that vSphere DRS association rules be implemented to allow logical and predictable separation of virtual machines, which will help us improve availability of AD,DNS virtual machines responsible for architecting services, which will help ensure that these services are separated across sites.

VSphere DRS association rules also help prevent storage from unwanted downtime and network traffic overload. We recommend adjusting the storage configuration association rules for vSphere VM-Host. We mean to set VM-Host association rules so that virtual machines tend to run on hosts running at the same site, while the primary read / write nodes of the data storage array are configured. For example, in our test configuration, the virtual machine stored in the Frimley-01 data store sets that the VM-Host association rules are biased towards hosts in the Frimley datacenter. This ensures that the virtual machine is not disconnected from the storage system when the network connection between sites fails. The configuration of VM-Host association rules depends on these recommendations, ensuring that the virtual machine stays local to the primary data store. Coincidentally, all read LUN O comes from local virtual machines at their sites. Note: different storage vendors use different techniques to describe the relationship between LUN and array or controller. In this chapter we will use the general term "Storage Site Affinity", which means local read and write access to LUN.

We recommend the implementation of "should rules", which can be conflicted in the event of a HA failure, and the availability of the service should always outweigh performance. In the case of "Must rule", HA will not conflict with the set rules and may cause service disruption in the event of a site or host failure. In a data center failure scenario, "Must rules" will make it impossible to restart virtual machines for vSphere HA, and they do not have association rule requests to allow virtual machines to be turned on on hosts in other data centers. VSphere DRS communicates these rules with HA, storing them in the compatibility list that allows startup. VSphere DRS also notes that in some cases, if a large number of hosts are unbalanced and aggressively recommended settings, it will conflict with "Should rule". Although very rare, we recommend monitoring rules that conflict with availability and performance for your workload.

We recommend manually defining a group of hosts to create a site and adding virtual machines to this site based on data storage association rules. In our scenario, only a limited number of virtual machines are published. We recommend using vCenter Orchestrator or Power CLI to define site association rules automatically. If automatic is not selected, we recommend using a general naming convention to simplify the creation of these groups. We recommend that these groups be verified regularly. To ensure that virtual machines belonging to the group have the correct site association rules.

The next screenshot describes the configuration for this scenario, and in the first screenshot, all virtual machines should be kept in the virtual group local to Bluefin.

Figure 164:DRS Group-Virtual Machine

Next, create a local Bluefin host group that includes all hosts.

Figure 165:DRS Group-Host

Finally, Bluefin created a new rule locally, defining the "should run on" rule for connecting to the host group virtual group.

Figure 166:VM-Host rules

Both sides should be done locally, which directly leads to 4 combinations of 2 rules.

Figure 167: results-Management rules

Adjust relevance rule conflicts

DRS assigns a high priority to adjust associative rule conflicts, and during the invocation, the primary goal of DRS is to adjust any conflicts and generate migration recommendations for virtual machines on the cluster host group host list, which have higher priority than load balancing, so virtual machine migration will begin before load balancing.

DRS is called by default every 5 minutes, but if the cluster detects a change DRS is still triggered, and when the host reconnects to the cluster, DRS is called and generates recommendations to adjust any identified conflicts. Our tests show that DRS generates recommendations to adjust association rule conflicts within 30 seconds after the host reconnects to the cluster. Note that DRS limits the total throughput of the vMotion network, which means that multiple calls may be required before all management rule conflicts are adjusted.

VSphere Storage DRS

When the defined performance or capacity threshold is exceeded, Storage DRS considers enabling aggressive single data storage from an administrator's point of view and balancing virtual machines and disks. Storage DRS ensures that sufficient disk resources are available in your workload. We recommend enabling storage DRS.

Storage DRS uses storage vMotion's data storage within the data storage cluster to migrate virtual machines. Because the underlying extended storage system uses synchronous replication, a migration or a series of migrations will affect repeated traffic, resulting in network resource contention when the disk is moved, which may cause the virtual machine to be temporarily unavailable. From a site perspective, if virtual machines do not migrate their disks together, migrating roaming data stores in the same access configuration may also result in additional Imax O latency. For example, if a virtual machine on a Frimley host has disks to migrate to Bluefin's data store, it will continue to operate, but may degrade performance. Virtual machine read disks are subject to an increase in site B read virtual iSCSI IP latency and to site-to-site latency.

To be able to control when the migration occurs, we recommend that you configure the storage DRS manual mode, which allows manual validation of each proposal and allows recommendations to be applied at off-peak times, while reaping the benefits of operation and the efficiency of initializing locations.

We recommend that you create a data storage cluster based on the storage configuration following the storage site association, and the data storage associated with site A cannot be mixed with the data storage at site B. This will allow consistency of actions and ease the creation and persistence of DRS VM-Host association rules. Therefore, when the virtual machines between the data storage cluster and the defined storage site association boundary are migrated, it is recommended to ensure that all vSphere DRS VM-Host association rules are updated. We recommend that you adjust the naming conventions for data storage clusters and VM-Host association rules to simplify the configuration and management process.

The naming convention was used in our tests to give data storage and data storage clusters a special site name, thus simplifying the DRS host association for publishing virtual machines on the site. The specific storage in our sites "Bluefin" and "Frimley" is shown in the figure below. Note that the vCenter mapping feature cannot be used to view the current site association of the storage, nor can it display the objects of the data storage cluster.

Figure 168: data storage cluster architecture

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.