HP PCI Error Handling and Recovery White Paper

Table of contents
PCI / PCI EXPRESS Error Recovery White Paper
Executive Summary ................................................................................................................................2
Problem Statement................................................................................................................................. 2
Historical Evolution of PCI/PCI-Express Error Recovery (ER).........................................................................2
Why PCI/PCI-Express Error Recovery .......................................................................................................2
Types of Error Recovery..........................................................................................................................5
How PCI / PCI-Express Error Recovery Works ...........................................................................................6
Support of PCI/PCI-Express Error Handling and Error Recovery on HP-UX.....................................................8
References ............................................................................................................................................8
Glossary...............................................................................................................................................8
For more information
.............................................................................................................................................................. 10
2
Executive Summary
In the context of software applications, Reliability, Availability, and Serviceability (RAS) mean failures in the underlying processes and hardware components must not cause any interruptions in the overall system operation. Service transactions are adversely impacted during system failure and performance is affected because of the service down time in the failed system. Recovering from the failure early, managing the remaining working components of the system, or both, with minimal impact to the business services, is the key for an optimized system that meets the RAS criteria.
In computer systems, PCITMfailures constitute a significant percentage of errors. The PCI Error Recovery (PCI ER) feature enables the detection of PCI bus parity errors, isolation of the failed I/O path, and recovery of cards from errors. Enabling the PCI ER feature avoids system crash, decreases system downtime, and supports single system high availability.
On HP-UX 11i v2 OS legacy systems, user intervention is required to attempt recovery from PCI errors. The olrad (1 M) command and the Attention Button can be used to recover and restore the slot, card, and driver to a usable state, without taking the system down. Whereas, HP-UX 11i v3 has the ability to automatically recover from PCI errors and restore the slot/driver without user intervention.
Problem Statement
Without PCI/PCI EXPRESS®(PCIe®) error recovery, I/O paths operate in Hardfail mode. While operating in this mode, a PCI I/O error/a rope error / PCIe errors, causes an MCA on a PIO read and brings the system down. Using the PCI ER feature, PCI I/O paths can be set to SoftFail mode if the platform and all the adapter drivers support this feature.
Historical Evolution of PCI/PCI-Express Error Recovery (ER)
On systems running HP-UX 11i v1, the PCI bus errors are handled and the cards are recovered manually. This feature was shipped as a site specific patch. On systems running HP-UX 11i v2, the PCI bus errors are handled and the cards are recovered manually. The product was shipped as an optional product bundle (PCIErrorHandling) on the SupportPack media starting with AR0806 release. On systems running HP-UX 11i v3, the PCI bus errors are handled and the cards are recovered automatically. This feature is part of the Base Operating Environment.
Why PCI/PCI-Express Error Recovery
Interruptions in online transaction processing system and enterprise resource planning service caused from a failed application or system or hardware component, can be costly and disruptive. The impact of service downtime continues to grow as companies move toward a real-time business model. Moreover, as companies become more connected and response times shorten, the cost of service downtime continues to increase. For these reasons, businesses invest large amounts of money in maintaining the RAS of servers in the IT infrastructure. The PCI ER feature can enable a drastic reduction in downtime of the system caused by PCI errors.
Figure 1 and Figure 2 compares the system behavior, when a PCI error occurs, with and without error recovery feature.
3
Figure 1: Without PCI Error
PCI b
us
disconnected
Recovery
HP-UX
(Up and
Running)
Server
I/O Card 1 Status: Available
I/O Card 2 Status: Available
I/O Card 3 Status: Available and has encountered PCI error.
Causes an HPMC/MCA
HP-UX
Server
(Down)
PCI bus is
I/O Card 1 Status: Not available
I/O Card 2 Status: Not available
I/O Card 3 Status: Not available
4
Figure 2: With PCI Error Recovery
PCI b
us
isolated
I/O Card 1 Status: Available
HP-UX
Server
(Up and
Running)
HP-UX
Server
(Up and
Running)
I/O Card 2 Status: Available
I/O Card 3 Status: Available and has encountered PCI error.
Recovering I/O Card 3 using PCI Error Recovery
I/O Card 1 Status: Available
I/O Card 2 Status: Available
PCI bus
HP-UX
Server
(Up and
Running)
I/O Card 3 Status: Driver suspended. Card not available due to PCI error.
After Successful Card Recovery
I/O Card 1 Status: Available
I/O Card 2 Status: Available
I/O Card 3 Status: Driver resumed. Card available
Loading...
+ 7 hidden pages