This document supports the version of each product listed and
supports all subsequent versions until the document is
replaced by a new edition. To check for more recent editions
of this document, see http://www.vmware.com/support/pubs.
EN-001810-02
vSphere Availability
You can find the most up-to-date technical documentation on the VMware Web site at:
http://www.vmware.com/support/
The VMware Web site also provides the latest product updates.
If you have comments about this documentation, submit your feedback to:
Fault Tolerance Requirements, Limits, and Licensing 46
Fault Tolerance Interoperability 47
Preparing Your Cluster and Hosts for Fault Tolerance 49
Using Fault Tolerance 51
Best Practices for Fault Tolerance 55
Legacy Fault Tolerance 57
Index61
VMware, Inc. 3
vSphere Availability
4 VMware, Inc.
About vSphere Availability
vSphere Availability describes solutions that provide business continuity, including how to establish
vSphere® High Availability (HA) and vSphere Fault Tolerance.
Intended Audience
This information is for anyone who wants to provide business continuity through the vSphere HA and Fault
Tolerance solutions. The information in this book is for experienced Windows or Linux system
administrators who are familiar with virtual machine technology and data center operations.
VMware, Inc. 5
vSphere Availability
6 VMware, Inc.
Updated Information
This vSphere Availability is updated with each release of the product or when necessary.
This table provides the update history of the vSphere Availability.
RevisionDescription
EN-001810-02 Change to wording about dedicated FT network under Fault Tolerance Requirements. See “Fault
Tolerance Requirements, Limits, and Licensing,” on page 46.
EN-001810-01 New note about ESXi host version needed for VM Component Protection feature. See “VM Component
Protection,” on page 19.
EN-001810-00 Initial release.
VMware, Inc. 7
vSphere Availability
8 VMware, Inc.
Business Continuity and Minimizing
Downtime1
Downtime, whether planned or unplanned, brings with it considerable costs. However, solutions to ensure
higher levels of availability have traditionally been costly, hard to implement, and difficult to manage.
VMware software makes it simpler and less expensive to provide higher levels of availability for important
applications. With vSphere, organizations can easily increase the baseline level of availability provided for
all applications as well as provide higher levels of availability more easily and cost effectively. With
vSphere, you can:
Provide higher availability independent of hardware, operating system, and applications.
n
Reduce planned downtime for common maintenance operations.
n
Provide automatic recovery in cases of failure.
n
vSphere makes it possible to reduce planned downtime, prevent unplanned downtime, and recover rapidly
from outages.
This chapter includes the following topics:
“Reducing Planned Downtime,” on page 9
n
“Preventing Unplanned Downtime,” on page 10
n
“vSphere HA Provides Rapid Recovery from Outages,” on page 10
n
“vSphere Fault Tolerance Provides Continuous Availability,” on page 11
n
Reducing Planned Downtime
Planned downtime typically accounts for over 80% of data center downtime. Hardware maintenance, server
migration, and firmware updates all require downtime for physical servers. To minimize the impact of this
downtime, organizations are forced to delay maintenance until inconvenient and difficult-to-schedule
downtime windows.
vSphere makes it possible for organizations to dramatically reduce planned downtime. Because workloads
in a vSphere environment can be dynamically moved to different physical servers without downtime or
service interruption, server maintenance can be performed without requiring application and service
downtime. With vSphere, organizations can:
Eliminate downtime for common maintenance operations.
n
Eliminate planned maintenance windows.
n
Perform maintenance at any time without disrupting users and services.
n
VMware, Inc.
9
vSphere Availability
The vSphere vMotion® and Storage vMotion functionality in vSphere makes it possible for organizations to
reduce planned downtime because workloads in a VMware environment can be dynamically moved to
different physical servers or to different underlying storage without service interruption. Administrators
can perform faster and completely transparent maintenance operations, without being forced to schedule
inconvenient maintenance windows.
Preventing Unplanned Downtime
While an ESXi host provides a robust platform for running applications, an organization must also protect
itself from unplanned downtime caused from hardware or application failures. vSphere builds important
capabilities into data center infrastructure that can help you prevent unplanned downtime.
These vSphere capabilities are part of virtual infrastructure and are transparent to the operating system and
applications running in virtual machines. These features can be configured and utilized by all the virtual
machines on a physical system, reducing the cost and complexity of providing higher availability. Key
availability capabilities are built into vSphere:
Shared storage. Eliminate single points of failure by storing virtual machine files on shared storage,
n
such as Fibre Channel or iSCSI SAN, or NAS. The use of SAN mirroring and replication features can be
used to keep updated copies of virtual disk at disaster recovery sites.
Network interface teaming. Provide tolerance of individual network card failures.
In addition to these capabilities, the vSphere HA and Fault Tolerance features can minimize or eliminate
unplanned downtime by providing rapid recovery from outages and continuous availability, respectively.
vSphere HA Provides Rapid Recovery from Outages
vSphere HA leverages multiple ESXi hosts configured as a cluster to provide rapid recovery from outages
and cost-effective high availability for applications running in virtual machines.
vSphere HA protects application availability in the following ways:
It protects against a server failure by restarting the virtual machines on other hosts within the cluster.
n
It protects against application failure by continuously monitoring a virtual machine and resetting it in
n
the event that a failure is detected.
It protects against datastore accessibility failures by restarting affected virtual machines on other hosts
n
which still have access to their datastores.
It protects virtual machines against network isolation by restarting them if their host becomes isolated
n
on the management or Virtual SAN network. This protection is provided even if the network has
become partitioned.
Unlike other clustering solutions, vSphere HA provides the infrastructure to protect all workloads with the
infrastructure:
You do not need to install special software within the application or virtual machine. All workloads are
n
protected by vSphere HA. After vSphere HA is configured, no actions are required to protect new
virtual machines. They are automatically protected.
You can combine vSphere HA with vSphere Distributed Resource Scheduler (DRS) to protect against
n
failures and to provide load balancing across the hosts within a cluster.
10 VMware, Inc.
Chapter 1 Business Continuity and Minimizing Downtime
vSphere HA has several advantages over traditional failover solutions:
Minimal setup
Reduced hardware cost
and setup
Increased application
availability
DRS and vMotion
integration
After a vSphere HA cluster is set up, all virtual machines in the cluster get
failover support without additional configuration.
The virtual machine acts as a portable container for the applications and it
can be moved among hosts. Administrators avoid duplicate configurations
on multiple machines. When you use vSphere HA, you must have sufficient
resources to fail over the number of hosts you want to protect with vSphere
HA. However, the vCenter Server system automatically manages resources
and configures clusters.
Any application running inside a virtual machine has access to increased
availability. Because the virtual machine can recover from hardware failure,
all applications that start at boot have increased availability without
increased computing needs, even if the application is not itself a clustered
application. By monitoring and responding to VMware Tools heartbeats and
restarting nonresponsive virtual machines, it protects against guest operating
system crashes.
If a host fails and virtual machines are restarted on other hosts, DRS can
provide migration recommendations or migrate virtual machines for
balanced resource allocation. If one or both of the source and destination
hosts of a migration fail, vSphere HA can help recover from that failure.
vSphere HA provides a base level of protection for your virtual machines by restarting virtual machines in
the event of a host failure. vSphere Fault Tolerance provides a higher level of availability, allowing users to
protect any virtual machine from a host failure with no loss of data, transactions, or connections.
Fault Tolerance provides continuous availability by ensuring that the states of the Primary and Secondary
VMs are identical at any point in the instruction execution of the virtual machine.
If either the host running the Primary VM or the host running the Secondary VM fails, an immediate and
transparent failover occurs. The functioning ESXi host seamlessly becomes the Primary VM host without
losing network connections or in-progress transactions. With transparent failover, there is no data loss and
network connections are maintained. After a transparent failover occurs, a new Secondary VM is respawned
and redundancy is re-established. The entire process is transparent and fully automated and occurs even if
vCenter Server is unavailable.
VMware, Inc. 11
vSphere Availability
12 VMware, Inc.
Creating and Using vSphere HA
Clusters2
vSphere HA clusters enable a collection of ESXi hosts to work together so that, as a group, they provide
higher levels of availability for virtual machines than each ESXi host can provide individually. When you
plan the creation and usage of a new vSphere HA cluster, the options you select affect the way that cluster
responds to failures of hosts or virtual machines.
Before you create a vSphere HA cluster, you should know how vSphere HA identifies host failures and
isolation and how it responds to these situations. You also should know how admission control works so
that you can choose the policy that fits your failover needs. After you establish a cluster, you can customize
its behavior with advanced options and optimize its performance by following recommended best practices.
NOTE You might get an error message when you try to use vSphere HA. For information about error
messages related to vSphere HA, see the VMware knowledge base article at
http://kb.vmware.com/kb/1033634.
This chapter includes the following topics:
“How vSphere HA Works,” on page 13
n
“vSphere HA Admission Control,” on page 23
n
“vSphere HA Interoperability,” on page 29
n
“Creating and Configuring a vSphere HA Cluster,” on page 32
n
“Best Practices for vSphere HA Clusters,” on page 40
n
How vSphere HA Works
vSphere HA provides high availability for virtual machines by pooling the virtual machines and the hosts
they reside on into a cluster. Hosts in the cluster are monitored and in the event of a failure, the virtual
machines on a failed host are restarted on alternate hosts.
When you create a vSphere HA cluster, a single host is automatically elected as the master host. The master
host communicates with vCenter Server and monitors the state of all protected virtual machines and of the
slave hosts. Different types of host failures are possible, and the master host must detect and appropriately
deal with the failure. The master host must distinguish between a failed host and one that is in a network
partition or that has become network isolated. The master host uses network and datastore heartbeating to
determine the type of failure.
Sphere HA Clusters (http://link.brightcove.com/services/player/bcpid2296383276001?
bctid=ref:vSphereHAClusters)
VMware, Inc. 13
vSphere Availability
Master and Slave Hosts
When you add a host to a vSphere HA cluster, an agent is uploaded to the host and configured to
communicate with other agents in the cluster. Each host in the cluster functions as a master host or a slave
host.
When vSphere HA is enabled for a cluster, all active hosts (those not in standby or maintenance mode, or
not disconnected) participate in an election to choose the cluster's master host. The host that mounts the
greatest number of datastores has an advantage in the election. Only one master host typically exists per
cluster and all other hosts are slave hosts. If the master host fails, is shut down or put in standby mode, or is
removed from the cluster a new election is held.
The master host in a cluster has a number of responsibilities:
Monitoring the state of slave hosts. If a slave host fails or becomes unreachable, the master host
n
identifies which virtual machines need to be restarted.
Monitoring the power state of all protected virtual machines. If one virtual machine fails, the master
n
host ensures that it is restarted. Using a local placement engine, the master host also determines where
the restart should be done.
Managing the lists of cluster hosts and protected virtual machines.
n
Acting as vCenter Server management interface to the cluster and reporting the cluster health state.
n
The slave hosts primarily contribute to the cluster by running virtual machines locally, monitoring their
runtime states, and reporting state updates to the master host. A master host can also run and monitor
virtual machines. Both slave hosts and master hosts implement the VM and Application Monitoring
features.
One of the functions performed by the master host is to orchestrate restarts of protected virtual machines. A
virtual machine is protected by a master host after vCenter Server observes that the virtual machine's power
state has changed from powered off to powered on in response to a user action. The master host persists the
list of protected virtual machines in the cluster's datastores. A newly elected master host uses this
information to determine which virtual machines to protect.
NOTE If you disconnect a host from a cluster, all of the virtual machines registered to that host are
unprotected by vSphere HA.
Host Failure Types and Detection
The master host of a vSphere HA cluster is responsible for detecting the failure of slave hosts. Depending on
the type of failure detected, the virtual machines running on the hosts might need to be failed over.
In a vSphere HA cluster, three types of host failure are detected:
Failure- A host stops functioning.
n
Isolation- A host becomes network isolated.
n
Partition- A host loses network connectivity with the master host.
n
The master host monitors the liveness of the slave hosts in the cluster. This communication is done through
the exchange of network heartbeats every second. When the master host stops receiving these heartbeats
from a slave host, it checks for host liveness before declaring the host to have failed. The liveness check that
the master host performs is to determine whether the slave host is exchanging heartbeats with one of the
datastores. See “Datastore Heartbeating,” on page 21. Also, the master host checks whether the host
responds to ICMP pings sent to its management IP addresses.
14 VMware, Inc.
Chapter 2 Creating and Using vSphere HA Clusters
If a master host is unable to communicate directly with the agent on a slave host, the slave host does not
respond to ICMP pings, and the agent is not issuing heartbeats it is considered to have failed. The host's
virtual machines are restarted on alternate hosts. If such a slave host is exchanging heartbeats with a
datastore, the master host assumes that it is in a network partition or network isolated and so continues to
monitor the host and its virtual machines. See “Network Partitions,” on page 21.
Host network isolation occurs when a host is still running, but it can no longer observe traffic from vSphere
HA agents on the management network. If a host stops observing this traffic, it attempts to ping the cluster
isolation addresses. If this also fails, the host declares itself as isolated from the network.
The master host monitors the virtual machines that are running on an isolated host and if it observes that
they power off, and the master host is responsible for the virtual machines, it restarts them.
NOTE If you ensure that the network infrastructure is sufficiently redundant and that at least one network
path is available at all times, host network isolation should be a rare occurrence.
Determining Responses to Host Issues
If a host fails and its virtual machines must be restarted, you can control the order in which the virtual
machines are restarted with the VM restart priority setting. You can also configure how vSphere HA
responds if hosts lose management network connectivity with other hosts by using the host isolation
response setting. Other factors are also considered when vSphere HA restarts a virtual machine after a
failure.
The following settings apply to all virtual machines in the cluster in the case of a host failure or isolation.
You can also configure exceptions for specific virtual machines. See “Customize an Individual Virtual
Machine,” on page 40.
VM Restart Priority
VM restart priority determines the relative order in which virtual machines are allocated resources after a
host failure. Such virtual machines are assigned to hosts with unreserved capacity, with the highest priority
virtual machines placed first and continuing to those with lower priority until all virtual machines have
been placed or no more cluster capacity is available to meet the reservations or memory overhead of the
virtual machines. A host then restarts the virtual machines assigned to it in priority order. If there are
insufficient resources, vSphere HA waits for more unreserved capacity to become available, for example,
due to a host coming back online, and then retries the placement of these virtual machines. To reduce the
chance of this situation occurring, configure vSphere HA admission control to reserve more resources for
failures. Admission control allows you to control the amount of cluster capacity that is reserved by virtual
machines, which is unavailable to meet the reservations and memory overhead of other virtual machines if
there is a failure.
The values for this setting are Disabled, Low, Medium (the default), and High. The Disabled setting is
ignored by the vSphere HA VM/Application monitoring feature because this feature protects virtual
machines against operating system-level failures and not virtual machine failures. When an operating
system-level failure occurs, the operating system is rebooted by vSphere HA, and the virtual machine is left
running on the same host. You can change this setting for individual virtual machines.
NOTE A virtual machine reset causes a hard reboot of the guest operating system, but does not power cycle
the virtual machine.
The restart priority settings for virtual machines vary depending on user needs. Assign higher restart
priority to the virtual machines that provide the most important services.
For example, in the case of a multitier application, you might rank assignments according to functions
hosted on the virtual machines.
High. Database servers that provide data for applications.
n
VMware, Inc. 15
vSphere Availability
n
n
If a host fails, vSphere HA attempts to register to an active host the affected virtual machines that were
powered on and have a restart priority setting of Disabled, or that were powered off.
Host Isolation Response
Host isolation response determines what happens when a host in a vSphere HA cluster loses its
management network connections, but continues to run. You can use the isolation response to have vSphere
HA power off virtual machines that are running on an isolated host and restart them on a nonisolated host.
Host isolation responses require that Host Monitoring Status is enabled. If Host Monitoring Status is
disabled, host isolation responses are also suspended. A host determines that it is isolated when it is unable
to communicate with the agents running on the other hosts, and it is unable to ping its isolation addresses.
The host then executes its isolation response. The responses are Power off and restart VMs or Shutdown and
restart VMs. You can customize this property for individual virtual machines.
NOTE If a virtual machine has a restart priority setting of Disabled, no host isolation response is made.
To use the Shutdown and restart VMs setting, you must install VMware Tools in the guest operating system
of the virtual machine. Shutting down the virtual machine provides the advantage of preserving its state.
Shutting down is better than powering off the virtual machine, which does not flush most recent changes to
disk or commit transactions. Virtual machines that are in the process of shutting down take longer to fail
over while the shutdown completes. Virtual Machines that have not shut down in 300 seconds, or the time
specified in the advanced option das.isolationshutdowntimeout, are powered off.
Medium. Application servers that consume data in the database and provide results on web pages.
Low. Web servers that receive user requests, pass queries to application servers, and return results to
users.
After you create a vSphere HA cluster, you can override the default cluster settings for Restart Priority and
Isolation Response for specific virtual machines. Such overrides are useful for virtual machines that are used
for special tasks. For example, virtual machines that provide infrastructure services like DNS or DHCP
might need to be powered on before other virtual machines in the cluster.
A virtual machine "split-brain" condition can occur when a host becomes isolated or partitioned from a
master host and the master host cannot communicate with it using heartbeat datastores. In this situation, the
master host cannot determine that the host is alive and so declares it dead. The master host then attempts to
restart the virtual machines that are running on the isolated or partitioned host. This attempt succeeds if the
virtual machines remain running on the isolated/partitioned host and that host lost access to the virtual
machines' datastores when it became isolated or partitioned. A split-brain condition then exists because
there are two instances of the virtual machine. However, only one instance is able to read or write the
virtual machine's virtual disks. VM Component Protection can be used to prevent this split-brain condition.
When you enable VMCP with the aggressive setting, it monitors the datastore accessibility of powered-on
virtual machines, and shuts down those that lose access to their datastores.
To recover from this situation, ESXi generates a question on the virtual machine that has lost the disk locks
for when the host comes out of isolation and cannot reacquire the disk locks. vSphere HA automatically
answers this question, allowing the virtual machine instance that has lost the disk locks to power off,
leaving just the instance that has the disk locks.
16 VMware, Inc.
Chapter 2 Creating and Using vSphere HA Clusters
Factors Considered for Virtual Machine Restarts
After a failure, the cluster's master host attempts to restart affected virtual machines by identifying a host
that can power them on. When choosing such a host, the master host considers a number of factors.
File accessibility
Virtual machine and
host compatibility
Resource reservations
Host limits
Feature constraints
If no hosts satisfy the preceding considerations, the master host issues an event stating that there are not
enough resources for vSphere HA to start the VM and tries again when the cluster conditions have changed.
For example, if the virtual machine is not accessible, the master host tries again after a change in file
accessibility.
Before a virtual machine can be started, its files must be accessible from one
of the active cluster hosts that the master can communicate with over the
network
If there are accessible hosts, the virtual machine must be compatible with at
least one of them. The compatibility set for a virtual machine includes the
effect of any required VM-Host affinity rules. For example, if a rule only
permits a virtual machine to run on two hosts, it is considered for placement
on those two hosts.
Of the hosts that the virtual machine can run on, at least one must have
sufficient unreserved capacity to meet the memory overhead of the virtual
machine and any resource reservations. Four types of reservations are
considered: CPU, Memory, vNIC, and Virtual flash. Also, sufficient network
ports must be available to power on the virtual machine.
In addition to resource reservations, a virtual machine can only be placed on
a host if doing so does not violate the maximum number of allowed virtual
machines or the number of in-use vCPUs.
If the advanced option has been set that requires vSphere HA to enforce VM
to VM anti-affinity rules, vSphere HA does not violate this rule. Also,
vSphere HA does not violate any configured per host limits for fault tolerant
virtual machines.
Limits for Virtual Machine Restart Attempts
If the vSphere HA master agent's attempt to restart a VM, which involves registering it and powering it on,
fails, this restart is retried after a delay. vSphere HA attempts these restarts for a maximum number of
attempts (6 by default), but not all restart failures count against this maximum.
For example, the most likely reason for a restart attempt to fail is because either the VM is still running on
another host, or because vSphere HA tried to restart the VM too soon after it failed. In this situation, the
master agent delays the retry attempt by twice the delay imposed after the last attempt, with a 1 minute
minimum delay and a 30 minute maximum delay. Thus if the delay is set to 1 minute, there is an initial
attempt at T=0, then additional attempts made at T=1 (1 minute), T=3 (3 minutes), T=7 (7 minutes), T=15 (15
minutes), and T=30 (30 minutes). Each such attempt is counted against the limit and only six attempts are
made by default.
Other restart failures result in countable retries but with a different delay interval. An example scenario is
when the host chosen to restart virtual machine loses access to one of the VM's datastores after the choice
was made by the master agent. In this case, a retry is attempted after a default delay of 2 minutes. This
attempt also counts against the limit.
Finally, some retries are not counted. For example, if the host on which the virtual machine was to be
restarted fails before the master agent issues the restart request, the attempt is retried after 2 minutes but
this failure does not count against the maximum number of attempts.
VMware, Inc. 17
vSphere Availability
Virtual Machine Restart Notifications
vSphere HA generates a cluster event when a failover operation is in progress for virtual machines in the
cluster. The event also displays a configuration issue in the Cluster Summary tab which reports the number
of virtual machines that are being restarted. There are four different categories of such VMs.
n
n
n
n
These virtual machine counts are dynamically updated whenever a change is observed in the number of
VMs for which a restart operation is underway. The configuration issue is cleared when vSphere HA has
restarted all VMs or has given up trying.
In vSphere 5.5 or earlier, a per-VM event is triggered for an unsuccessful attempt to restart the virtual
machine. This event is disabled by default in vSphere 6.x and can be enabled by setting the vSphere HA
advanced option das.config.fdm.reportfailoverfailevent to 1.
VMs being placed: vSphere HA is in the process of trying to restart these VMs
VMs awaiting a retry: a previous restart attempt failed, and vSphere HA is waiting for a timeout to
expire before trying again.
VMs requiring additional resources: insufficient resources are available to restart these VMs. vSphere
HA retries when more resources become available, for example a host comes back online.
Inaccessible Virtual SAN VMs: vSphere HA cannot restart these Virtual SAN VMs because they are not
accessible. It retries when there is a change in accessibility.
VM and Application Monitoring
VM Monitoring restarts individual virtual machines if their VMware Tools heartbeats are not received
within a set time. Similarly, Application Monitoring can restart a virtual machine if the heartbeats for an
application it is running are not received. You can enable these features and configure the sensitivity with
which vSphere HA monitors non-responsiveness.
When you enable VM Monitoring, the VM Monitoring service (using VMware Tools) evaluates whether
each virtual machine in the cluster is running by checking for regular heartbeats and I/O activity from the
VMware Tools process running inside the guest. If no heartbeats or I/O activity are received, this is most
likely because the guest operating system has failed or VMware Tools is not being allocated any time to
complete tasks. In such a case, the VM Monitoring service determines that the virtual machine has failed
and the virtual machine is rebooted to restore service.
Occasionally, virtual machines or applications that are still functioning properly stop sending heartbeats. To
avoid unnecessary resets, the VM Monitoring service also monitors a virtual machine's I/O activity. If no
heartbeats are received within the failure interval, the I/O stats interval (a cluster-level attribute) is checked.
The I/O stats interval determines if any disk or network activity has occurred for the virtual machine during
the previous two minutes (120 seconds). If not, the virtual machine is reset. This default value (120 seconds)
can be changed using the advanced option das.iostatsinterval.
To enable Application Monitoring, you must first obtain the appropriate SDK (or be using an application
that supports VMware Application Monitoring) and use it to set up customized heartbeats for the
applications you want to monitor. After you have done this, Application Monitoring works much the same
way that VM Monitoring does. If the heartbeats for an application are not received for a specified time, its
virtual machine is restarted.
You can configure the level of monitoring sensitivity. Highly sensitive monitoring results in a more rapid
conclusion that a failure has occurred. While unlikely, highly sensitive monitoring might lead to falsely
identifying failures when the virtual machine or application in question is actually still working, but
heartbeats have not been received due to factors such as resource constraints. Low sensitivity monitoring
results in longer interruptions in service between actual failures and virtual machines being reset. Select an
option that is an effective compromise for your needs.
18 VMware, Inc.
Chapter 2 Creating and Using vSphere HA Clusters
The default settings for monitoring sensitivity are described in Table 2-1. You can also specify custom values
for both monitoring sensitivity and the I/O stats interval by selecting the Custom checkbox.
Table 2‑1. VM Monitoring Settings
SettingFailure Interval (seconds)Reset Period
High301 hour
Medium6024 hours
Low1207 days
After failures are detected, vSphere HA resets virtual machines. The reset ensures that services remain
available. To avoid resetting virtual machines repeatedly for nontransient errors, by default, virtual
machines will be reset only three times during a certain configurable time interval. After virtual machines
have been reset three times, vSphere HA makes no further attempts to reset the virtual machines after
subsequent failures until after the specified time has elapsed. You can configure the number of resets using
the Maximum per-VM resets custom setting.
NOTE The reset statistics are cleared when a virtual machine is powered off then back on, or when it is
migrated using vMotion to another host. This causes the guest operating system to reboot, but is not the
same as a 'restart' in which the power state of the virtual machine is changed.
If a virtual machine has a datastore accessibility failure (either All Paths Down or Permanent Device Loss),
the VM Monitoring service suspends resetting it until the failure has been addressed.
VM Component Protection
If VM Component Protection (VMCP) is enabled, vSphere HA can detect datastore accessibility failures and
provide automated recovery for affected virtual machines.
VMCP provides protection against datastore accessibility failures that can affect a virtual machine running
on a host in a vSphere HA cluster. When a datastore accessibility failure occurs, the affected host can no
longer access the storage path for a specific datastore. You can determine the response that vSphere HA will
make to such a failure, ranging from the creation of event alarms to virtual machine restarts on other hosts.
NOTE When you use the VM Component Protection feature, your ESXi hosts must be version 6.0 or higher.
Types of Failure
There are two types of datastore accessibility failure:
PDL
APD
PDL (Permanent Device Loss) is an unrecoverable loss of accessibility that
occurs when a storage device reports the datastore is no longer accessible by
the host. This condition cannot be reverted without powering off virtual
machines.
APD (All Paths Down) represents a transient or unknown accessibility loss
or any other unidentified delay in I/O processing. This type of accessibility
issue is recoverable.
VMware, Inc. 19
t=0
t=140st=140 + 3m
APD_START
APD_TIMEOUT
If APD clears,
reset VM
HA
terminate and
failover VM
vSphere Availability
Configuring VMCP
VM Component Protection is enabled and configured in the vSphere Web Client. To enable this feature, you
must select the Protect against Storage Connectivity Loss checkbox in the edit cluster settings wizard. The
storage protection levels you can choose and the virtual machine remediation actions available differ
depending on the type of database accessibility failure.
PDL failures
A virtual machine is automatically failed over to a new host unless you have
configured VMCP only to Issue events.
APD events
The response to APD events is more complex and accordingly the
configuration is more fine-grained.
After the user-configured Delay for VM failover for APD period has
elapsed, the action taken depends on the policy you selected. An event will
be issued and the virtual machine is restarted conservatively or aggressively.
The conservative approach does not terminate the virtual machine if the
success of the failover is unknown, for example in a network partition. The
aggressive approach does terminate the virtual machine under these
conditions. Neither approach terminates the virtual machine if there are
insufficient resources in the cluster for the failover to succeed.
If APD recovers before the user-configured Delay for VM failover for APD
period has elapsed, you can choose to reset the affected virtual machines,
which recovers the guest applications that were impacted by the IO failures.
NOTE If either the Host Monitoring or VM Restart Priority settings are disabled, VMCP cannot perform
virtual machine restarts. Storage health can still be monitored and events can be issued, however.
For more information on configuring VMCP, see “Configure Virtual Machine Responses,” on page 35.
VMCP Recovery Timeline
The following timeline graphically demonstrates how VMCP recovers from a storage failure.
T=0s: A storage failure is detected. vSphere HA starts the recovery process. For a PDL event, the
n
workflow immediately starts and VMs are restarted on healthy hosts in the cluster. If the storage loss is
due to an APD event, the APD Timeout timer starts (the default is 140 seconds).
T=140s: The host declares an APD Timeout and begins to fail non-VM I/O to the unresponsive storage
n
device.
Between T=140s and 320s: This is the time period defined by the Delay for VM failover for APD,
n
which is 3 minutes by default. The guest applications might become unstable after losing access to
storage for an extended period of time. If an APD is cleared in this time period, the option to reset the
VMs is available.
20 VMware, Inc.
Loading...
+ 44 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.