VMware ESXI - 6.0.1 User Manual

vSphere Availability

Update 1

ESXi 6.0

vCenter Server 6.0

This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition. To check for more recent editions of this document, see http://www.vmware.com/support/pubs.

EN-001810-02

vSphere Availability

You can find the most up-to-date technical documentation on the VMware Web site at:

http://www.vmware.com/support/

The VMware Web site also provides the latest product updates.

If you have comments about this documentation, submit your feedback to:

docfeedback@vmware.com

VMware, Inc.

3401 Hillview Ave. Palo Alto, CA 94304 www.vmware.com

2 VMware, Inc.

About vSphere Availability 5

Updated Information 7

Business Continuity and Minimizing Downtime 9

Reducing Planned Downtime 9

Preventing Unplanned Downtime 10

vSphere HA Provides Rapid Recovery from Outages 10

vSphere Fault Tolerance Provides Continuous Availability 11

Creating and Using vSphere HA Clusters 13

How vSphere HA Works 13

vSphere HA Admission Control 23

vSphere HA Interoperability 29

Creating and Configuring a vSphere HA Cluster 32

Best Practices for vSphere HA Clusters 40

Providing Fault Tolerance for Virtual Machines 45

How Fault Tolerance Works 45

Fault Tolerance Use Cases 46

Fault Tolerance Requirements, Limits, and Licensing 46

Fault Tolerance Interoperability 47

Preparing Your Cluster and Hosts for Fault Tolerance 49

Using Fault Tolerance 51

Best Practices for Fault Tolerance 55

Legacy Fault Tolerance 57

Index 61

VMware, Inc. 3

vSphere Availability

4 VMware, Inc.

About vSphere Availability

vSphere Availability describes solutions that provide business continuity, including how to establish vSphere® High Availability (HA) and vSphere Fault Tolerance.

Intended Audience

This information is for anyone who wants to provide business continuity through the vSphere HA and Fault Tolerance solutions. The information in this book is for experienced Windows or Linux system administrators who are familiar with virtual machine technology and data center operations.

VMware, Inc. 5

vSphere Availability

6 VMware, Inc.

Updated Information

This vSphere Availability is updated with each release of the product or when necessary.

This table provides the update history of the vSphere Availability.

Revision Description

EN-001810-02 Change to wording about dedicated FT network under Fault Tolerance Requirements. See “Fault

Tolerance Requirements, Limits, and Licensing,” on page 46.

EN-001810-01 New note about ESXi host version needed for VM Component Protection feature. See “VM Component

Protection,” on page 19.

EN-001810-00 Initial release.

VMware, Inc. 7

vSphere Availability

8 VMware, Inc.

Business Continuity and Minimizing

Downtime 1

Downtime, whether planned or unplanned, brings with it considerable costs. However, solutions to ensure higher levels of availability have traditionally been costly, hard to implement, and difficult to manage.

VMware software makes it simpler and less expensive to provide higher levels of availability for important applications. With vSphere, organizations can easily increase the baseline level of availability provided for all applications as well as provide higher levels of availability more easily and cost effectively. With vSphere, you can:

Provide higher availability independent of hardware, operating system, and applications.

Reduce planned downtime for common maintenance operations.

Provide automatic recovery in cases of failure.

vSphere makes it possible to reduce planned downtime, prevent unplanned downtime, and recover rapidly from outages.

This chapter includes the following topics:

“Reducing Planned Downtime,” on page 9

“Preventing Unplanned Downtime,” on page 10

“vSphere HA Provides Rapid Recovery from Outages,” on page 10

“vSphere Fault Tolerance Provides Continuous Availability,” on page 11

Reducing Planned Downtime

Planned downtime typically accounts for over 80% of data center downtime. Hardware maintenance, server migration, and firmware updates all require downtime for physical servers. To minimize the impact of this downtime, organizations are forced to delay maintenance until inconvenient and difficult-to-schedule downtime windows.

vSphere makes it possible for organizations to dramatically reduce planned downtime. Because workloads in a vSphere environment can be dynamically moved to different physical servers without downtime or service interruption, server maintenance can be performed without requiring application and service downtime. With vSphere, organizations can:

Eliminate downtime for common maintenance operations.

Eliminate planned maintenance windows.

Perform maintenance at any time without disrupting users and services.

VMware, Inc.

vSphere Availability

The vSphere vMotion® and Storage vMotion functionality in vSphere makes it possible for organizations to reduce planned downtime because workloads in a VMware environment can be dynamically moved to different physical servers or to different underlying storage without service interruption. Administrators can perform faster and completely transparent maintenance operations, without being forced to schedule inconvenient maintenance windows.

Preventing Unplanned Downtime

While an ESXi host provides a robust platform for running applications, an organization must also protect itself from unplanned downtime caused from hardware or application failures. vSphere builds important capabilities into data center infrastructure that can help you prevent unplanned downtime.

These vSphere capabilities are part of virtual infrastructure and are transparent to the operating system and applications running in virtual machines. These features can be configured and utilized by all the virtual machines on a physical system, reducing the cost and complexity of providing higher availability. Key availability capabilities are built into vSphere:

Shared storage. Eliminate single points of failure by storing virtual machine files on shared storage,

such as Fibre Channel or iSCSI SAN, or NAS. The use of SAN mirroring and replication features can be used to keep updated copies of virtual disk at disaster recovery sites.

Network interface teaming. Provide tolerance of individual network card failures.

Storage multipathing. Tolerate storage path failures.

In addition to these capabilities, the vSphere HA and Fault Tolerance features can minimize or eliminate unplanned downtime by providing rapid recovery from outages and continuous availability, respectively.

vSphere HA Provides Rapid Recovery from Outages

vSphere HA leverages multiple ESXi hosts configured as a cluster to provide rapid recovery from outages and cost-effective high availability for applications running in virtual machines.

vSphere HA protects application availability in the following ways:

It protects against a server failure by restarting the virtual machines on other hosts within the cluster.

It protects against application failure by continuously monitoring a virtual machine and resetting it in

the event that a failure is detected.

It protects against datastore accessibility failures by restarting affected virtual machines on other hosts

which still have access to their datastores.

It protects virtual machines against network isolation by restarting them if their host becomes isolated

on the management or Virtual SAN network. This protection is provided even if the network has become partitioned.

Unlike other clustering solutions, vSphere HA provides the infrastructure to protect all workloads with the infrastructure:

You do not need to install special software within the application or virtual machine. All workloads are

protected by vSphere HA. After vSphere HA is configured, no actions are required to protect new virtual machines. They are automatically protected.

You can combine vSphere HA with vSphere Distributed Resource Scheduler (DRS) to protect against

failures and to provide load balancing across the hosts within a cluster.

10 VMware, Inc.

Chapter 1 Business Continuity and Minimizing Downtime

vSphere HA has several advantages over traditional failover solutions:

Minimal setup

Reduced hardware cost and setup

Increased application availability

DRS and vMotion integration

After a vSphere HA cluster is set up, all virtual machines in the cluster get failover support without additional configuration.

The virtual machine acts as a portable container for the applications and it can be moved among hosts. Administrators avoid duplicate configurations on multiple machines. When you use vSphere HA, you must have sufficient resources to fail over the number of hosts you want to protect with vSphere HA. However, the vCenter Server system automatically manages resources and configures clusters.

Any application running inside a virtual machine has access to increased availability. Because the virtual machine can recover from hardware failure, all applications that start at boot have increased availability without increased computing needs, even if the application is not itself a clustered application. By monitoring and responding to VMware Tools heartbeats and restarting nonresponsive virtual machines, it protects against guest operating system crashes.

If a host fails and virtual machines are restarted on other hosts, DRS can provide migration recommendations or migrate virtual machines for balanced resource allocation. If one or both of the source and destination hosts of a migration fail, vSphere HA can help recover from that failure.

vSphere Fault Tolerance Provides Continuous Availability

vSphere HA provides a base level of protection for your virtual machines by restarting virtual machines in the event of a host failure. vSphere Fault Tolerance provides a higher level of availability, allowing users to protect any virtual machine from a host failure with no loss of data, transactions, or connections.

Fault Tolerance provides continuous availability by ensuring that the states of the Primary and Secondary VMs are identical at any point in the instruction execution of the virtual machine.

If either the host running the Primary VM or the host running the Secondary VM fails, an immediate and transparent failover occurs. The functioning ESXi host seamlessly becomes the Primary VM host without losing network connections or in-progress transactions. With transparent failover, there is no data loss and network connections are maintained. After a transparent failover occurs, a new Secondary VM is respawned and redundancy is re-established. The entire process is transparent and fully automated and occurs even if vCenter Server is unavailable.

VMware, Inc. 11

vSphere Availability

12 VMware, Inc.

Creating and Using vSphere HA

Clusters 2

vSphere HA clusters enable a collection of ESXi hosts to work together so that, as a group, they provide higher levels of availability for virtual machines than each ESXi host can provide individually. When you plan the creation and usage of a new vSphere HA cluster, the options you select affect the way that cluster responds to failures of hosts or virtual machines.

Before you create a vSphere HA cluster, you should know how vSphere HA identifies host failures and isolation and how it responds to these situations. You also should know how admission control works so that you can choose the policy that fits your failover needs. After you establish a cluster, you can customize its behavior with advanced options and optimize its performance by following recommended best practices.

NOTE You might get an error message when you try to use vSphere HA. For information about error messages related to vSphere HA, see the VMware knowledge base article at

http://kb.vmware.com/kb/1033634.

This chapter includes the following topics:

“How vSphere HA Works,” on page 13

“vSphere HA Admission Control,” on page 23

“vSphere HA Interoperability,” on page 29

“Creating and Configuring a vSphere HA Cluster,” on page 32

“Best Practices for vSphere HA Clusters,” on page 40

How vSphere HA Works

vSphere HA provides high availability for virtual machines by pooling the virtual machines and the hosts they reside on into a cluster. Hosts in the cluster are monitored and in the event of a failure, the virtual machines on a failed host are restarted on alternate hosts.

When you create a vSphere HA cluster, a single host is automatically elected as the master host. The master host communicates with vCenter Server and monitors the state of all protected virtual machines and of the slave hosts. Different types of host failures are possible, and the master host must detect and appropriately deal with the failure. The master host must distinguish between a failed host and one that is in a network partition or that has become network isolated. The master host uses network and datastore heartbeating to determine the type of failure.

Sphere HA Clusters (http://link.brightcove.com/services/player/bcpid2296383276001?

bctid=ref:vSphereHAClusters)

VMware, Inc. 13

vSphere Availability

Master and Slave Hosts

When you add a host to a vSphere HA cluster, an agent is uploaded to the host and configured to communicate with other agents in the cluster. Each host in the cluster functions as a master host or a slave host.

When vSphere HA is enabled for a cluster, all active hosts (those not in standby or maintenance mode, or not disconnected) participate in an election to choose the cluster's master host. The host that mounts the greatest number of datastores has an advantage in the election. Only one master host typically exists per cluster and all other hosts are slave hosts. If the master host fails, is shut down or put in standby mode, or is removed from the cluster a new election is held.

The master host in a cluster has a number of responsibilities:

Monitoring the state of slave hosts. If a slave host fails or becomes unreachable, the master host

identifies which virtual machines need to be restarted.

Monitoring the power state of all protected virtual machines. If one virtual machine fails, the master

host ensures that it is restarted. Using a local placement engine, the master host also determines where the restart should be done.

Managing the lists of cluster hosts and protected virtual machines.

Acting as vCenter Server management interface to the cluster and reporting the cluster health state.

The slave hosts primarily contribute to the cluster by running virtual machines locally, monitoring their runtime states, and reporting state updates to the master host. A master host can also run and monitor virtual machines. Both slave hosts and master hosts implement the VM and Application Monitoring features.

One of the functions performed by the master host is to orchestrate restarts of protected virtual machines. A virtual machine is protected by a master host after vCenter Server observes that the virtual machine's power state has changed from powered off to powered on in response to a user action. The master host persists the list of protected virtual machines in the cluster's datastores. A newly elected master host uses this information to determine which virtual machines to protect.

NOTE If you disconnect a host from a cluster, all of the virtual machines registered to that host are unprotected by vSphere HA.

Host Failure Types and Detection

The master host of a vSphere HA cluster is responsible for detecting the failure of slave hosts. Depending on the type of failure detected, the virtual machines running on the hosts might need to be failed over.

In a vSphere HA cluster, three types of host failure are detected:

Failure- A host stops functioning.

Isolation- A host becomes network isolated.

Partition- A host loses network connectivity with the master host.

The master host monitors the liveness of the slave hosts in the cluster. This communication is done through the exchange of network heartbeats every second. When the master host stops receiving these heartbeats from a slave host, it checks for host liveness before declaring the host to have failed. The liveness check that the master host performs is to determine whether the slave host is exchanging heartbeats with one of the datastores. See “Datastore Heartbeating,” on page 21. Also, the master host checks whether the host responds to ICMP pings sent to its management IP addresses.

14 VMware, Inc.

Chapter 2 Creating and Using vSphere HA Clusters

If a master host is unable to communicate directly with the agent on a slave host, the slave host does not respond to ICMP pings, and the agent is not issuing heartbeats it is considered to have failed. The host's virtual machines are restarted on alternate hosts. If such a slave host is exchanging heartbeats with a datastore, the master host assumes that it is in a network partition or network isolated and so continues to monitor the host and its virtual machines. See “Network Partitions,” on page 21.

Host network isolation occurs when a host is still running, but it can no longer observe traffic from vSphere HA agents on the management network. If a host stops observing this traffic, it attempts to ping the cluster isolation addresses. If this also fails, the host declares itself as isolated from the network.

The master host monitors the virtual machines that are running on an isolated host and if it observes that they power off, and the master host is responsible for the virtual machines, it restarts them.

NOTE If you ensure that the network infrastructure is sufficiently redundant and that at least one network path is available at all times, host network isolation should be a rare occurrence.

Determining Responses to Host Issues

If a host fails and its virtual machines must be restarted, you can control the order in which the virtual machines are restarted with the VM restart priority setting. You can also configure how vSphere HA responds if hosts lose management network connectivity with other hosts by using the host isolation response setting. Other factors are also considered when vSphere HA restarts a virtual machine after a failure.

The following settings apply to all virtual machines in the cluster in the case of a host failure or isolation. You can also configure exceptions for specific virtual machines. See “Customize an Individual Virtual

Machine,” on page 40.

VM Restart Priority

VM restart priority determines the relative order in which virtual machines are allocated resources after a host failure. Such virtual machines are assigned to hosts with unreserved capacity, with the highest priority virtual machines placed first and continuing to those with lower priority until all virtual machines have been placed or no more cluster capacity is available to meet the reservations or memory overhead of the virtual machines. A host then restarts the virtual machines assigned to it in priority order. If there are insufficient resources, vSphere HA waits for more unreserved capacity to become available, for example, due to a host coming back online, and then retries the placement of these virtual machines. To reduce the chance of this situation occurring, configure vSphere HA admission control to reserve more resources for failures. Admission control allows you to control the amount of cluster capacity that is reserved by virtual machines, which is unavailable to meet the reservations and memory overhead of other virtual machines if there is a failure.

The values for this setting are Disabled, Low, Medium (the default), and High. The Disabled setting is ignored by the vSphere HA VM/Application monitoring feature because this feature protects virtual machines against operating system-level failures and not virtual machine failures. When an operating system-level failure occurs, the operating system is rebooted by vSphere HA, and the virtual machine is left running on the same host. You can change this setting for individual virtual machines.

NOTE A virtual machine reset causes a hard reboot of the guest operating system, but does not power cycle the virtual machine.

The restart priority settings for virtual machines vary depending on user needs. Assign higher restart priority to the virtual machines that provide the most important services.

For example, in the case of a multitier application, you might rank assignments according to functions hosted on the virtual machines.

High. Database servers that provide data for applications.

VMware, Inc. 15

vSphere Availability

If a host fails, vSphere HA attempts to register to an active host the affected virtual machines that were powered on and have a restart priority setting of Disabled, or that were powered off.

Host Isolation Response

Host isolation response determines what happens when a host in a vSphere HA cluster loses its management network connections, but continues to run. You can use the isolation response to have vSphere HA power off virtual machines that are running on an isolated host and restart them on a nonisolated host. Host isolation responses require that Host Monitoring Status is enabled. If Host Monitoring Status is disabled, host isolation responses are also suspended. A host determines that it is isolated when it is unable to communicate with the agents running on the other hosts, and it is unable to ping its isolation addresses. The host then executes its isolation response. The responses are Power off and restart VMs or Shutdown and restart VMs. You can customize this property for individual virtual machines.

NOTE If a virtual machine has a restart priority setting of Disabled, no host isolation response is made.

To use the Shutdown and restart VMs setting, you must install VMware Tools in the guest operating system of the virtual machine. Shutting down the virtual machine provides the advantage of preserving its state. Shutting down is better than powering off the virtual machine, which does not flush most recent changes to disk or commit transactions. Virtual machines that are in the process of shutting down take longer to fail over while the shutdown completes. Virtual Machines that have not shut down in 300 seconds, or the time specified in the advanced option das.isolationshutdowntimeout, are powered off.

Medium. Application servers that consume data in the database and provide results on web pages.

Low. Web servers that receive user requests, pass queries to application servers, and return results to users.

After you create a vSphere HA cluster, you can override the default cluster settings for Restart Priority and Isolation Response for specific virtual machines. Such overrides are useful for virtual machines that are used for special tasks. For example, virtual machines that provide infrastructure services like DNS or DHCP might need to be powered on before other virtual machines in the cluster.

A virtual machine "split-brain" condition can occur when a host becomes isolated or partitioned from a master host and the master host cannot communicate with it using heartbeat datastores. In this situation, the master host cannot determine that the host is alive and so declares it dead. The master host then attempts to restart the virtual machines that are running on the isolated or partitioned host. This attempt succeeds if the virtual machines remain running on the isolated/partitioned host and that host lost access to the virtual machines' datastores when it became isolated or partitioned. A split-brain condition then exists because there are two instances of the virtual machine. However, only one instance is able to read or write the virtual machine's virtual disks. VM Component Protection can be used to prevent this split-brain condition. When you enable VMCP with the aggressive setting, it monitors the datastore accessibility of powered-on virtual machines, and shuts down those that lose access to their datastores.

To recover from this situation, ESXi generates a question on the virtual machine that has lost the disk locks for when the host comes out of isolation and cannot reacquire the disk locks. vSphere HA automatically answers this question, allowing the virtual machine instance that has lost the disk locks to power off, leaving just the instance that has the disk locks.

16 VMware, Inc.

Chapter 2 Creating and Using vSphere HA Clusters

Factors Considered for Virtual Machine Restarts

After a failure, the cluster's master host attempts to restart affected virtual machines by identifying a host that can power them on. When choosing such a host, the master host considers a number of factors.

File accessibility

Virtual machine and host compatibility

Resource reservations

Host limits

Feature constraints

If no hosts satisfy the preceding considerations, the master host issues an event stating that there are not enough resources for vSphere HA to start the VM and tries again when the cluster conditions have changed. For example, if the virtual machine is not accessible, the master host tries again after a change in file accessibility.

Before a virtual machine can be started, its files must be accessible from one of the active cluster hosts that the master can communicate with over the network

If there are accessible hosts, the virtual machine must be compatible with at least one of them. The compatibility set for a virtual machine includes the effect of any required VM-Host affinity rules. For example, if a rule only permits a virtual machine to run on two hosts, it is considered for placement on those two hosts.

Of the hosts that the virtual machine can run on, at least one must have sufficient unreserved capacity to meet the memory overhead of the virtual machine and any resource reservations. Four types of reservations are considered: CPU, Memory, vNIC, and Virtual flash. Also, sufficient network ports must be available to power on the virtual machine.

In addition to resource reservations, a virtual machine can only be placed on a host if doing so does not violate the maximum number of allowed virtual machines or the number of in-use vCPUs.

If the advanced option has been set that requires vSphere HA to enforce VM to VM anti-affinity rules, vSphere HA does not violate this rule. Also, vSphere HA does not violate any configured per host limits for fault tolerant virtual machines.

Limits for Virtual Machine Restart Attempts

If the vSphere HA master agent's attempt to restart a VM, which involves registering it and powering it on, fails, this restart is retried after a delay. vSphere HA attempts these restarts for a maximum number of attempts (6 by default), but not all restart failures count against this maximum.

For example, the most likely reason for a restart attempt to fail is because either the VM is still running on another host, or because vSphere HA tried to restart the VM too soon after it failed. In this situation, the master agent delays the retry attempt by twice the delay imposed after the last attempt, with a 1 minute minimum delay and a 30 minute maximum delay. Thus if the delay is set to 1 minute, there is an initial attempt at T=0, then additional attempts made at T=1 (1 minute), T=3 (3 minutes), T=7 (7 minutes), T=15 (15 minutes), and T=30 (30 minutes). Each such attempt is counted against the limit and only six attempts are made by default.

Other restart failures result in countable retries but with a different delay interval. An example scenario is when the host chosen to restart virtual machine loses access to one of the VM's datastores after the choice was made by the master agent. In this case, a retry is attempted after a default delay of 2 minutes. This attempt also counts against the limit.

Finally, some retries are not counted. For example, if the host on which the virtual machine was to be restarted fails before the master agent issues the restart request, the attempt is retried after 2 minutes but this failure does not count against the maximum number of attempts.

VMware, Inc. 17

vSphere Availability

Virtual Machine Restart Notifications

vSphere HA generates a cluster event when a failover operation is in progress for virtual machines in the cluster. The event also displays a configuration issue in the Cluster Summary tab which reports the number of virtual machines that are being restarted. There are four different categories of such VMs.

These virtual machine counts are dynamically updated whenever a change is observed in the number of VMs for which a restart operation is underway. The configuration issue is cleared when vSphere HA has restarted all VMs or has given up trying.

In vSphere 5.5 or earlier, a per-VM event is triggered for an unsuccessful attempt to restart the virtual machine. This event is disabled by default in vSphere 6.x and can be enabled by setting the vSphere HA advanced option das.config.fdm.reportfailoverfailevent to 1.

VMs being placed: vSphere HA is in the process of trying to restart these VMs

VMs awaiting a retry: a previous restart attempt failed, and vSphere HA is waiting for a timeout to expire before trying again.

VMs requiring additional resources: insufficient resources are available to restart these VMs. vSphere HA retries when more resources become available, for example a host comes back online.

Inaccessible Virtual SAN VMs: vSphere HA cannot restart these Virtual SAN VMs because they are not accessible. It retries when there is a change in accessibility.

VM and Application Monitoring

VM Monitoring restarts individual virtual machines if their VMware Tools heartbeats are not received within a set time. Similarly, Application Monitoring can restart a virtual machine if the heartbeats for an application it is running are not received. You can enable these features and configure the sensitivity with which vSphere HA monitors non-responsiveness.

When you enable VM Monitoring, the VM Monitoring service (using VMware Tools) evaluates whether each virtual machine in the cluster is running by checking for regular heartbeats and I/O activity from the VMware Tools process running inside the guest. If no heartbeats or I/O activity are received, this is most likely because the guest operating system has failed or VMware Tools is not being allocated any time to complete tasks. In such a case, the VM Monitoring service determines that the virtual machine has failed and the virtual machine is rebooted to restore service.

Occasionally, virtual machines or applications that are still functioning properly stop sending heartbeats. To avoid unnecessary resets, the VM Monitoring service also monitors a virtual machine's I/O activity. If no heartbeats are received within the failure interval, the I/O stats interval (a cluster-level attribute) is checked. The I/O stats interval determines if any disk or network activity has occurred for the virtual machine during the previous two minutes (120 seconds). If not, the virtual machine is reset. This default value (120 seconds) can be changed using the advanced option das.iostatsinterval.

To enable Application Monitoring, you must first obtain the appropriate SDK (or be using an application that supports VMware Application Monitoring) and use it to set up customized heartbeats for the applications you want to monitor. After you have done this, Application Monitoring works much the same way that VM Monitoring does. If the heartbeats for an application are not received for a specified time, its virtual machine is restarted.

You can configure the level of monitoring sensitivity. Highly sensitive monitoring results in a more rapid conclusion that a failure has occurred. While unlikely, highly sensitive monitoring might lead to falsely identifying failures when the virtual machine or application in question is actually still working, but heartbeats have not been received due to factors such as resource constraints. Low sensitivity monitoring results in longer interruptions in service between actual failures and virtual machines being reset. Select an option that is an effective compromise for your needs.

18 VMware, Inc.

Chapter 2 Creating and Using vSphere HA Clusters

The default settings for monitoring sensitivity are described in Table 2-1. You can also specify custom values for both monitoring sensitivity and the I/O stats interval by selecting the Custom checkbox.

Table 2‑1. VM Monitoring Settings

Setting Failure Interval (seconds) Reset Period

High 30 1 hour

Medium 60 24 hours

Low 120 7 days

After failures are detected, vSphere HA resets virtual machines. The reset ensures that services remain available. To avoid resetting virtual machines repeatedly for nontransient errors, by default, virtual machines will be reset only three times during a certain configurable time interval. After virtual machines have been reset three times, vSphere HA makes no further attempts to reset the virtual machines after subsequent failures until after the specified time has elapsed. You can configure the number of resets using the Maximum per-VM resets custom setting.

NOTE The reset statistics are cleared when a virtual machine is powered off then back on, or when it is migrated using vMotion to another host. This causes the guest operating system to reboot, but is not the same as a 'restart' in which the power state of the virtual machine is changed.

If a virtual machine has a datastore accessibility failure (either All Paths Down or Permanent Device Loss), the VM Monitoring service suspends resetting it until the failure has been addressed.

VM Component Protection

If VM Component Protection (VMCP) is enabled, vSphere HA can detect datastore accessibility failures and provide automated recovery for affected virtual machines.

VMCP provides protection against datastore accessibility failures that can affect a virtual machine running on a host in a vSphere HA cluster. When a datastore accessibility failure occurs, the affected host can no longer access the storage path for a specific datastore. You can determine the response that vSphere HA will make to such a failure, ranging from the creation of event alarms to virtual machine restarts on other hosts.

NOTE When you use the VM Component Protection feature, your ESXi hosts must be version 6.0 or higher.

Types of Failure

There are two types of datastore accessibility failure:

PDL

APD

PDL (Permanent Device Loss) is an unrecoverable loss of accessibility that occurs when a storage device reports the datastore is no longer accessible by the host. This condition cannot be reverted without powering off virtual machines.

APD (All Paths Down) represents a transient or unknown accessibility loss or any other unidentified delay in I/O processing. This type of accessibility issue is recoverable.

VMware, Inc. 19

t=0

t=140s t=140 + 3m

APD_START

APD_TIMEOUT

If APD clears,

reset VM

terminate and

failover VM

vSphere Availability

Configuring VMCP

VM Component Protection is enabled and configured in the vSphere Web Client. To enable this feature, you must select the Protect against Storage Connectivity Loss checkbox in the edit cluster settings wizard. The storage protection levels you can choose and the virtual machine remediation actions available differ depending on the type of database accessibility failure.

PDL failures

A virtual machine is automatically failed over to a new host unless you have configured VMCP only to Issue events.

APD events

The response to APD events is more complex and accordingly the configuration is more fine-grained.

After the user-configured Delay for VM failover for APD period has elapsed, the action taken depends on the policy you selected. An event will be issued and the virtual machine is restarted conservatively or aggressively. The conservative approach does not terminate the virtual machine if the success of the failover is unknown, for example in a network partition. The aggressive approach does terminate the virtual machine under these conditions. Neither approach terminates the virtual machine if there are insufficient resources in the cluster for the failover to succeed.

If APD recovers before the user-configured Delay for VM failover for APD period has elapsed, you can choose to reset the affected virtual machines, which recovers the guest applications that were impacted by the IO failures.

NOTE If either the Host Monitoring or VM Restart Priority settings are disabled, VMCP cannot perform virtual machine restarts. Storage health can still be monitored and events can be issued, however.

For more information on configuring VMCP, see “Configure Virtual Machine Responses,” on page 35.

VMCP Recovery Timeline

The following timeline graphically demonstrates how VMCP recovers from a storage failure.

T=0s: A storage failure is detected. vSphere HA starts the recovery process. For a PDL event, the

workflow immediately starts and VMs are restarted on healthy hosts in the cluster. If the storage loss is due to an APD event, the APD Timeout timer starts (the default is 140 seconds).

T=140s: The host declares an APD Timeout and begins to fail non-VM I/O to the unresponsive storage

device.

Between T=140s and 320s: This is the time period defined by the Delay for VM failover for APD,

which is 3 minutes by default. The guest applications might become unstable after losing access to storage for an extended period of time. If an APD is cleared in this time period, the option to reset the VMs is available.

20 VMware, Inc.

+ 44 hidden pages

VMware ESXI - 6.0.1 User Manual

Specifications and Main Features

Frequently Asked Questions

User Manual

Contents

About vSphere Availability

Updated Information

Reducing Planned Downtime

Preventing Unplanned Downtime

vSphere HA Provides Rapid Recovery from Outages

vSphere Fault Tolerance Provides Continuous Availability

How vSphere HA Works

Master and Slave Hosts

Host Failure Types and Detection

Determining Responses to Host Issues

VM and Application Monitoring

VM Component Protection