Table of contents
PCI / PCI EXPRESS Error Recovery White Paper
Executive Summary ................................................................................................................................2
Problem Statement................................................................................................................................. 2
Historical Evolution of PCI/PCI-Express Error Recovery (ER).........................................................................2
Why PCI/PCI-Express Error Recovery .......................................................................................................2
Types of Error Recovery..........................................................................................................................5
How PCI / PCI-Express Error Recovery Works ...........................................................................................6
Support of PCI/PCI-Express Error Handling and Error Recovery on HP-UX.....................................................8
Summary ..............................................................................................................................................8
References ............................................................................................................................................8
Glossary...............................................................................................................................................8
For more information
.............................................................................................................................................................. 10
Executive Summary
In the context of software applications, Reliability, Availability, and Serviceability (RAS) mean failures in
the underlying processes and hardware components must not cause any interruptions in the overall system
operation. Service transactions are adversely impacted during system failure and performance is affected
because of the service down time in the failed system. Recovering from the failure early,
managing the remaining working components of the system, or both, with minimal impact to the
business services, is the key for an optimized system that meets the RAS criteria.
In computer systems, PCITMfailures constitute a significant percentage of errors. The PCI Error
Recovery (PCI ER) feature enables the detection of PCI bus parity errors, isolation of the failed I/O path,
and recovery of cards from errors. Enabling the PCI ER feature avoids system crash, decreases system
downtime, and supports single system high availability.
On HP-UX 11i v2 OS legacy systems, user intervention is required to attempt recovery from PCI errors. The
olrad (1 M) command and the Attention Button can be used to recover and restore the slot, card, and
driver to a usable state, without taking the system down. Whereas, HP-UX 11i v3 has the ability to
automatically recover from PCI errors and restore the slot/driver without user intervention.
Problem Statement
Without PCI/PCI EXPRESS®(PCIe®) error recovery, I/O paths operate in Hardfail mode. While
operating in this mode, a PCI I/O error/a rope error / PCIe errors, causes an MCA on a PIO read and
brings the system down. Using the PCI ER feature, PCI I/O paths can be set to SoftFail mode if the platform
and all the adapter drivers support this feature.
Historical Evolution of PCI/PCI-Express Error Recovery (ER)
On systems running HP-UX 11i v1, the PCI bus errors are handled and the cards are recovered
manually. This feature was shipped as a site specific patch.
On systems running HP-UX 11i v2, the PCI bus errors are handled and the cards are recovered
manually. The product was shipped as an optional product bundle (PCIErrorHandling) on the
SupportPack media starting with AR0806 release.
On systems running HP-UX 11i v3, the PCI bus errors are handled and the cards are recovered
automatically. This feature is part of the Base Operating Environment.
Why PCI/PCI-Express Error Recovery
Interruptions in online transaction processing system and enterprise resource planning service caused from
a failed application or system or hardware component, can be costly and disruptive. The impact of service
downtime continues to grow as companies move toward a real-time business model. Moreover, as
companies become more connected and response times shorten, the cost of service downtime continues to
increase. For these reasons, businesses invest large amounts of money in maintaining the RAS of
servers in the IT infrastructure. The PCI ER feature can enable a drastic reduction in downtime of the
system caused by PCI errors.
Figure 1 and Figure 2 compares the system behavior, when a PCI error occurs, with and without error
recovery feature.
Figure 1: Without PCI Error
Recovery
HP-UX
(Up and
Running)
Server
I/O Card 1
Status: Available
I/O Card 2
Status: Available
I/O Card 3
Status: Available and has
encountered PCI error.
Causes an HPMC/MCA
HP-UX
Server
(Down)
PCI bus is
I/O Card 1
Status: Not available
I/O Card 2
Status: Not available
I/O Card 3
Status: Not available
Figure 2: With PCI Error Recovery
I/O Card 1
Status: Available
HP-UX
Server
(Up and
Running)
HP-UX
Server
(Up and
Running)
I/O Card 2
Status: Available
I/O Card 3
Status: Available and has
encountered PCI error.
Recovering I/O Card 3 using PCI Error
Recovery
I/O Card 1
Status: Available
I/O Card 2
Status: Available
PCI bus
HP-UX
Server
(Up and
Running)
I/O Card 3
Status: Driver suspended.
Card not available due to PCI error.
After Successful Card Recovery
I/O Card 1
Status: Available
I/O Card 2
Status: Available
I/O Card 3
Status: Driver resumed.
Card available