This guide describes hitless failover and hitless upgrade, including:
• Causes and Behaviors of MSM Failover on page 1
• Summary of Supported Features on page 3
• Overview of Hitless Failover on page 3
• Configuring Hitless Failover on page 5
• Configuring ESRP for Hitless Failover on page 8
• Overview of Hitless Upgrade on page 10
• Performing a Hitless Upgrade on page 11
T-sync is a term used to describe the hitless failover and hitless upgrade features available on the
BlackDiamond
management control from the master MSM-3 to the slave MSM-3 without causing traffic to be dropped.
Hitless upgrade allows an ExtremeWare
without taking it out of service or losing traffic.
To configure hitless failover or hitless upgrade, you must install MSM-3 modules in your BlackDiamond
chassis; MSM64i modules do not support hitless failover or hitless upgrade.
If you enable T-sync and normally use scripts to configure your switch, Extreme Networks recommends
using the
NOTE
To use the T-sync features available on the MSM-3 modules, you must install and run ExtremeWare
This section describes the events that cause an MSM failover and the behavior of the system after
failover occurs.
The following events cause an MSM failover:
• Operator command
Part Number: 121071-00 Rev 011
Causes and Behaviors of MSM Failover
• Software exception
• Watchdog timeout
• Keepalive failure
• Diagnostic failure
• Hot-removal of the master MSM
• Hard-reset of the master MSM
NOTE
Operator command and software exception support hitless failover.
Operator Command and Software Exception. Of the listed events, only operator command and
software exception result in a hitless failover. The remaining sections of this guide describe T-sync,
including:
• Supported features
• How to configure the T-sync features
• The behavior surrounding hitless failover and hitless upgrade
Watchdog Timeout and Keepalive Failure. Both the watchdog timeout and the keepalive failure are
long duration events, thus they are not hitless. If one of these events occur:
• All saved operational state information is discarded
• The failed master is hard reset
• The slave uses its own flash configuration file
Diagnostic Failure, Hot-removal, or Hard-reset of the Master MSM. If the master MSM-3
experiences a diagnostic failure or you hot-remove it, a “partial” hitless failover function is performed
and some traffic flows will not experience traffic hits. The switch cannot perform a completely hitless
failover because it lost hardware that it uses during normal operation.
To understand how traffic is affected when MSM-3 hardware is lost, a brief explanation of the switch
fabric is given. Each MSM-3 has switching logic that provides bandwidth to each I/O module. When
two MSM-3s are present, both provide bandwidth so that twice the amount of bandwidth is available.
For each traffic flow that requires inter-module data movement, the I/O module chooses an MSM-3 to
switch the data for that flow. When an MSM-3 is lost, the remaining MSM-3 eventually instructs the I/O
module that all inter-module traffic is to use the switching logic of the remaining MSM-3. In the time
between the loss of an MSM-3 and the reprogramming of the I/O module, traffic destined for the lost
MSM-3 switching logic is dropped.
The I/O module also switches some traffic flows directly between its own ports without MSM-3
involvement.
If you hot-remove the master MSM-3, only half of the switch fabric remains operational. The slave
becomes the master and reprograms each I/O module to send all traffic through it’s own switch fabric
logic. In the time between the failure and the reprogramming of the I/O module, traffic destined for the
removed MSM-3’s switching logic is lost. After the new master recovers, it reprograms the I/O module
so that all traffic uses the available MSM-3 switching logic.
If you hard-reset the master MSM-3 (using the recessed reset button on the MSM-3), all of the master’s
switch programming is lost. As a result, traffic that the I/O module forwards to the master is also lost.
2Hitless Failover and Hitless Upgrade User Guide
After a failover occurs, the new master reprograms the “reset” MSM-3’s switch fabric and the switching
logic of both MSM-3s is available again. In this case, the “Cause of last MSM failover” displayed by the
show msm-failover command indicates “removal,” and a “partial” hitless failover has occurred.
A “partial” hitless failover preserves:
• Data flows in the hardware and software, layer 2 protocol states, configurations, etc.
• All of the software states and the hardware states that are not interrupted by the diagnostic failure,
hot-removal, or hard-reset.
After a failover caused by hot-removal or diagnostic failure, the I/O modules are reprogrammed to use
only the switching logic of the remaining MSM-3. After a failover caused by a hard-reset of the master
MSM-3, the reset MSM-3’s switch fabric is reprogrammed and placed into full operation. Thus, a data
hit of several seconds occurs for flows that were directed to the failed MSM-3. For flows that were
directed to the currently active MSM-3, or for inter-module flows, there is no hit.
NOTE
Hitless upgrade of configuration is not suppor ted on MSM-3.
Summary of Supported Features
Summary of Suppor ted Features
This section describes the features supported by T-sync. If the information in the release notes differ
from the information in this guide, follow the release notes.
• Preserves unsaved configurations across a failover
• Load sharing
• Learned MAC address
• ARP
• STP
• EAPSv1
• IP FDB entries
• Access lists
• ESRP
• SNMP trap failover
• Configuration via the web, CLI, and SNMP
NOTE
T-sync does not support EAPSv2.
Overview of Hitless Failover
When you install two MSM-3 modules in a BlackDiamond chassis, one MSM-3 assumes the role of
master and the other assumes the role of slave. The master executes the switch’s management function,
Hitless Failover and Hitless Upgrade User Guide3
Overview of Hitless Failover
and the slave acts in a standby role. Hitless failover is a mechanism to transfer switch management
control from the master to the slave.
When there is a software exception in the master, the slave may be configured to take over as the
master. Without T-sync, a software exception results in a traffic “hit” because the hardware is
reinitialized and all FDB information is lost. The modules require seconds to complete the initialization,
but it may take minutes to relearn the forwarding information from the network. With T-sync, it is
possible for this transition to occur without interrupting existing unicast traffic flows.
During failover, the master passes control of all system management functions to the slave. In addition,
hitless failover preserves layer 2 data and layer 3 unicast flows for recently routed packets. When a
hitless failover event occurs, the failover timer begins and all previously established traffic flows
continue to function without packet loss. Hitless failover also preserves the:
• Master’s active configuration (both saved and unsaved)
• Forwarding and resolution database entries (layer 2, layer 3, and ARP)
• Loop redundancy and protocol states (STP, EAPS, ESRP, and others)
• Load shared ports
• Access control lists
NOTE
Hitless failover does not preserve the full route table, routing protocol databases for OSPF, BGP, RIP,
etc., or ICMP traffic.
Hitless Failover Concepts
T-sync preserves the current active configuration across a hitless failover. When you first boot up your
BlackDiamond switch, it uses the master MSM-3 configuration. During the initialization of the slave, the
master’s active configuration is relayed to the slave. As you make configuration changes to the master,
the master relays those individual changes to the slave. When a failover occurs, the slave continues to
use the master’s configuration. Regardless of the number of failovers, the active configuration remains
in effect provided the slave can process it.
NOTE
It is important to save any switch configuration changes that you make. Configuration changes made in
real-time must be saved on the master MSM-3 to guarantee hitles s failover and hitless upgrade
operation. Failure to save the configuration may result in an unstable environment after the hitless
failover or upgrade operation is complete.
If a hitless failover occurs before you can save the changes, the changes are still in effect on the new
master MSM-3. The asterisk appears in front of the command line if unsaved configuration changes are
present after a hitless failover. To save your changes after a hitless failover, use the
save command.
NOTE
If you have a BlackDiamond 6816 switch populated with four MSM-3 modules, the MSMs in slots C and
D provide extra switch bandwidth; they do not participate in switch management functions.
4Hitless Failover and Hitless Upgrade User Guide
Configuring Hitless Failover
Configuring Hitless Failover
You can configure failover so that one of the following occurs:
• All links are forced to be in a down state (nothing is preserved)
• Only the configuration is preserved
• Only the link up/down state is preserved
• The configuration and link up/down states are preserved
• The configuration, link up/down states, and layer 2 FDB and states (STP, EAPS, and ESRP) are
preserved
• The configuration, link up/down states, layer 2 FDB and states, and the layer 3 FDB and ARP table
are preserved
Hitless failover operation utilizes the last two options. To enable hitless failover, see the following
section, “Enabling Hitless Failover.”
You can also configure ESRP hitless failover behavior. See “Configuring ESRP for Hitless Failover” on
page 8 for more information.
To use the hitless failover feature, you must have a BlackDiamond 6800 series chassis installed with
MSM-3 modules running ExtremeWare 7.1.1 or later and BootROM 8.1 or later.
Enabling Hitless Failover
To enable hitless failover, you need to:
• Configure the system recovery level to automatically reboot after a software exception
• Enable the slave MSM-3 to “inherit” its configuration from the master MSM-3
• Configure the external ports to remain active when a failover occurs
• Enable the preservation of layer 2 and/or layer 3 state in the slave MSM-3
NOTE
If you have an active Telnet session and initiate a hitless failover on that switch, the session disconnects
when failover occurs.
Configuring the System Recovery Level
You must configure the slave MSM-3 to take over control of the switch if there is a software exception
on the master. To configure the slave to assume the role of master, use the following command: