not limited to, the implied warranties of merchantability and fitness for a particular purpose.
Hewlett-Packard shall not be liable for errors contained herein or for incidental or consequential
damages in connection with the furnishing, performance, or use of this material.
This document contains proprietary information, which is protected by copyright. No part of this
document may be photocopied, reproduced, or translated into another language without the prior
written consent of Hewlett-Packard. The information contained in this document is subject to change
without notice.
Microsoft, MS-DOS, Windows, and Windows NT are trademarks of Microsoft Corporation in the U.S.
and/or other countries.
All other product names mentioned herein may be trademarks of their respective companies.
Hewlett-Packard Company shall not be liable for technical or editorial errors or omissions contained
herein. The information is provided “as is” without warranty of any kind and is subject to change
without notice. The warranties for Hewlett-Packard Company products are set forth in the express
limited warranty statements accompanying such products. Nothing herein should be construed as
constituting an additional warranty.
Printed in the U.S.A.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Second Edition (August 2002)
Part Number: EK–G80TR–SA. B01
3–6 Template 12—Backup Battery Failure Event Sense Data Response Format . . 3–9
3–7 Template 13—Subsystem Built-In Self Test Failure Event Sense Data Response
3–8 Template 14—Memory System Failure Event Sense Data Response Format 3–12
3–9 Template 41—Device Services Non-Transfer Error Event Sense Data Response
NOTE: Text set off in this manner presents commentary, sidelights, or interesting points of
information.
Symbols on Equipment
Any enclosed surface or area of the equipment marked with these
symbols indicates the presence of electrical shock hazards. Enclosed
area contains no operator serviceable parts.
WARNING: To reduce the risk of injury from electrical shock hazards, do
not open this enclosure.
Any RJ-45 receptacle marked with these symbols indicates a network
interface connection.
WARNING: To reduce the risk of electrical shock, fire, or damage to the
equipment, do not plug telephone or telecommunications connectors into
this receptacle.
Any surface or area of the equipment marked with these symbols
indicates the presence of a hot surface or hot component. Contact with
this surface could result in injury.
WARNING: To reduce the risk of injury from a hot component, allow the
surface to cool before touching.
Power supplies or systems marked with these symbols indicate the
presence of multiple sources of power.
WARNING: To reduce the risk of injury from electrical shock,
remove all power cords to completely disconnect power from the
power supplies and systems.
WARNING: To reduce the risk of personal injury or damage to the equipment, be
sure that:
•The leveling jacks are extended to the floor.
•The full weight of the rack rests on the leveling jacks.
•In single rack installations, the stabilizing feet are attached to the rack.
•In multiple rack installations, the racks are coupled.
•Only one rack component is extended at any time. A rack may become
unstable if more than one rack component is extended for any reason.
About this Guide
Any product or assembly marked with these symbols indicates that the
component exceeds the recommended weight for one individual to
handle safely.
WARNING: To reduce the risk of personal injury or damage to the
equipment, observe local occupational health and safety requirements
and guidelines for manually handling material.
Getting Help
If you still have a question after reading this guide, contact service representatives or
visit our website.
StorageWorks Technical Support
In North America, call StorageWorks technical support at 1-800-OK-COMPAQ,
available 24 hours a day, 7 days a week.
NOTE: For continuous quality improvement, calls may be recorded or monitored.
Outside North America, call StorageWorks technical support at the nearest location.
Telephone numbers for worldwide technical support are listed on the StorageWorks
website: http://www.compaq.com
Be sure to have the following information available before calling:
•Technical support registration number (if applicable)
The StorageWorks website has the latest information on this product, as well as the
latest drivers. Access the StorageWorks website at: http://www.compaq.com/storage
From this website, select the appropriate product or solution.
StorageWorks Authorized Reseller
For the name of your nearest StorageWorks Authorized Reseller:
•In the United States, call 1-800-345-1518.
•In Canada, call 1-800-263-5868.
•Elsewhere, see the StorageWorks website for locations and telephone numbers.
This chapter provides guidelines for troubleshooting the controller, cache module, and
external cache battery (ECB). See enclosure documentation for information on
troubleshooting enclosure hardware, such as the power supplies, cooling fans, and
environmental monitoring unit (EMU).
Typical Installation Troubleshooting Checklist
The following checklist identifies many of the problems that occur in a typical
installation. After identifying a problem, use Table 1–1 to confirm the diagnosis and
fix the problem.
If an initial diagnosis points to several possible causes, use the tools described in this
chapter and then those in Chapter 2 to further refine the diagnosis. If a problem cannot
be diagnosed using the checklist and tools, contact a StorageWorks authorized service
provider for additional support.
To troubleshoot the controller and supporting modules, complete the following:
1
1. Check the power to the enclosure and enclosure components.
•Are power cords connected properly?
•Is power within specifications?
2. Check the component cables.
•Are bus cables to the controllers connected properly?
•For BA370 enclosures, are ECB cables connected properly?
3. Check each program card to make sure the card is fully seated.
4. Check the operator control panel (OCP) and devices for LED codes.
See “Flashing OCP Pattern Display Reporting” on page 1–13 and “Solid OCP
Pattern Display Reporting” on page 1–15, to interpret the LED codes.
5. Connect a local terminal to the controller and check the controller configuration
with the following command:
SHOW THIS_CONTROLLER FULL
Make sure that the ACS version loaded is correct and that pertinent patches are
installed. Also, check the status of the cache module and the supporting ECB.
In a dual redundant configuration, check the “other controller” with the following
command:
SHOW OTHER_CONTROLLER FULL
6. Use the fault management utility (FMU) to check for Last Failure or “memory
system failure” entries.
Show these codes and translate the Last Failure Codes they contain. See
Chapter 2, “Displaying Failure Entries” and “Translating Event Codes” sections.
If the controller failed to the extent that the controller cannot support a local
terminal for FMU, check the host error log for the Instance or Last Failure Codes.
See Chapter 5 and Chapter 6 to interpret the event codes.
7. Check device status with the following command:
SHOW DEVICES FULL
Look for errors such as “misconfigured device” or “No device at this PTL.” If a
device reports misconfigured or missing, check the device status with the
following command:
SHOW device-name
8. Check storageset status with the following command:
SHOW STORAGESETS FULL
Make sure that all storagesets are normal (or normalizing if the storageset is a
RAIDset or mirrorset). Check again for misconfigured or missing devices using
step 7.
9. Check unit status with the following command:
SHOW UNITS FULL
Make sure that all units are available or online. If the controller reports a unit as
unavailable or offline, recheck the storageset the unit belongs to with the
following command:
If the controller reports that a unit has lost data or is unwriteable, recheck the
status of the devices that make up the storageset. If the devices are operating
normally, recheck the status of the cache module. If the unit reports a media
format error, recheck the status of the storageset and storageset devices.
Troubleshooting Table
After diagnosing a problem, use Table 1–1 to resolve the problem.
Table 1–1: Troubleshooting Guidelines (Sheet 1 of 10)
SymptomPossible CauseInvestigationRemedy
Reset button not lit.No power to
subsystem.
Failed controller.If the previous
Reset button lit
steadily; other
LEDs also lit.
Various.See OCP LED Codes.Follow repair action
Check power to
subsystem and power
supplies on controller
enclosure.
BA370 enclosure
only: Make sure that
all cooling fans are
installed. If one or
more fans are missing
or all are inoperative
for more than 8
minutes, the EMU
shuts down the
subsystem.
BA370 enclosure
only: Determine if the
standby power switch
on the PVA was
pressed for more than
5 seconds.
remedies fail to
resolve the problem,
check OCP LED
codes.
Replace cord or
(BA370 enclosure
only) AC input box.
Turn off power switch
on AC input box.
Replace cooling fan.
Restore power to
subsystem.
Table 1–1: Troubleshooting Guidelines (Sheet 3 of 10)
SymptomPossible CauseInvestigationRemedy
Node ID is all zeros.SHOW_THIS to see if
node ID is all zeros.
Set node ID using
the node ID (bar
code) that is located
on the frame in
which the controller
sits. See SET
THIS_CONTROLLE
R NODE_ID in the
controller CLI
reference guide.
Also, be sure to copy
in the right direction.
If cabled to the new
controller, use SET
FAILOVER COPY=
OTHER_CONTROL
LER. If cabled to the
old controller, use
SET FAILOVER
COPY=THIS_CONT
ROLLER.
Nonmirrored
cache: controller
reports failed
DIMM in Cache A
or B.
Improperly installed
DIMM.
Remove cache
module and make
sure that the DIMM is
fully seated in the slot.
Failed DIMM.If the previous remedy
Reseat DIMM.
Replace DIMM.
fails to resolve the
problem, check for
OCP LED codes.
Mirrored cache:
“this controller”
reports DIMM 1 or
2 failed in Cache A
or B.
Improperly installed
DIMM in “this
controller” cache
module.
Failed DIMM in “this
controller” cache
module.
Remove cache
module and make
sure that DIMMs are
installed properly.
If the previous remedy
fails to resolve the
problem, check for
Table 1–1: Troubleshooting Guidelines (Sheet 4 of 10)
SymptomPossible CauseInvestigationRemedy
Mirrored cache:
“this controller”
reports DIMM 3 or
4 failed in Cache A
or B.
Improperly installed
DIMM in “other
controller” cache
module.
Failed DIMM in
“other controller”
cache module.
Remove cache
module and make
sure that the DIMMs
are installed properly.
If the previous remedy
fails to resolve the
problem, check for
Reseat DIMM.
Replace DIMM in
“other controller”
cache module.
OCP LED codes.
Mirrored cache:
controller reports
battery not
present.
Memory module
was installed before
the cache module
was connected to
an ECB.
BA370 enclosure:
ECB cable not
connected to cache
module.
Model 2200
enclosure: ECB not
installed or seated
BA370 enclosure:
Connect ECB cable
to cache module,
then restart both
controllers by
pushing their reset
buttons
simultaneously.
properly in backplane.
Model 2200
enclosure: install or
reseat ECB.
Mirrored cache:
controller reports
cache or mirrored
cache has failed.
Primary data and
the mirrored copy
data are not
identical.
SHOW
THIS_CONTROLLER
indicates that the
cache or mirrored
cache has failed.
Spontaneous FMU
message displays:
“Primary cache
declared failed - data
inconsistent with
mirror,” or “Mirrored
cache declared failed
- data inconsistent
Enter the
SHUTDOWN
command on
controllers that
report the problem.
(This command
flushes the cache
contents to
synchronize the
primary and mirrored
data.) Restart the
controllers that were
shut down.
Table 1–1: Troubleshooting Guidelines (Sheet 5 of 10)
SymptomPossible CauseInvestigationRemedy
Invalid cache.Mirrored-cache
mode discrepancy.
This discrepancy
might occur after
installing a new
controller. The
existing cache
module is set for
mirrored caching,
but the new
controller is set for
unmirrored caching.
This discrepancy
might also occur if
the new controller is
set for mirrored
SHOW
THIS_CONTROLLER
indicates “invalid
cache.”
Spontaneous FMU
message displays:
“Cache modules
inconsistent with
mirror mode.”
Table 1–1: Troubleshooting Guidelines (Sheet 6 of 10)
SymptomPossible CauseInvestigationRemedy
Cache module
might erroneously
contain unflushed
write-back data.
This might occur
after installing a
new controller. The
existing cache
module might
indicate that the
cache module
contains unflushed
write-back data, but
the new controller
expects to find no
data in the existing
SHOW
THIS_CONTROLLER
indicates “invalid
cache.”
No spontaneous FMU
message.
Connect a terminal
to the maintenance
port on the controller
reporting the error,
and clear the error
with the following
command—all on
one line:
CLEAR_ERRORS
THIS_CONTROLLE
R INVALID_CACHE
DESTROY_UNFLUS
HED_ DATA. See the
controller CLI
reference guide for
more information.
cache module.
This error might
also occur if
installing a new
cache module for a
controller that
expects write-back
data in the cache.
Table 1–1: Troubleshooting Guidelines (Sheet 10 of 10)
SymptomPossible CauseInvestigationRemedy
Host log file or
maintenance
terminal indicates
that a forced error
occurred when the
controller was
reconstructing a
RAIDset or
mirrorset.
Unrecoverable read
errors might have
occurred when the
controller was
reconstructing the
storageset. Errors
occur if another
member fails while
the controller is
reconstructing the
storageset.
Host requested data
from a normalizing
storageset that did
not contain the
data.
Conduct a read scan
of the storageset
using the appropriate
utility from the host
operating system,
such as the “dd” utility
for a TRU64 UNIX
host.
Use the SHOW
storageset-name
command to see if all
storageset members
are “normal.”
Rebuild the
storageset, then
restore storageset
data from a backup
source. While the
controller is
reconstructing the
storageset, monitor
the host error log
activity or
spontaneous event
reports on the
maintenance
terminal for any
unrecoverable
errors. If
unrecoverable errors
persist, note the
device on which they
occurred, and
replace the device
before proceeding.
Wait for normalizing
members to become
normal, then resume
I/O to them.
Significant Event Reporting
Controller fault management software reports information about significant events
that occur. These events are reported by:
Some events cause controller operation to halt; others allow the controller to remain
operable. Both types of events are detailed in the following sections.
Reporting Events That Cause Controller Operation to Halt
Events that cause the controller to halt operations are reported in three possible ways:
•a
FLASHING OCP pattern display
•a
SOLID OCP pattern display
•Last Failure reporting
Use Table 1–2 to interpret
FLASHING OCP patterns and Table 1–3 to interpret SOLID (ON)
OCP patterns. In the Error column of the solid OCP patterns, there are two separate
descriptions. The first denotes the actual error message that appears on the terminal,
and the second provides a more detailed explanation of the designated error.
Use the following legend to interpret both tables as indicated:
= reset button F
n
o
= reset button O
l
= LED FLASHING (in Table 1–2) or ON (in TABLE 1–3)
m
= LED O
NOTE: If the reset button is FLASHING and an LED is ON, either the devices on the bus that
corresponds to the LED do not match the controller configuration, or an error occurred in one of
the devices on that bus.
Also, a single LED that is turned O
FF
LASHING (in Table 1–2) or ON (in TABLE 1–3)
FF
N indicates a failure of the drive on that bus.
Flashing OCP Pattern Display Reporting
Certain events can cause a FLASHING display of the OCP LEDs. Each event and the
resulting pattern are described in Table 1–2.
IMPORTANT: Remember that a solid black pattern represents a FLASHING display. A white
pattern indicates OFF.
All LEDs F
Table 1–2: FLASHING OCP Pattern Displays and Repair Actions (Sheet 1 of 3)
LASH at the same time and at the same rate.
OCP
Pattern
CodeErrorRepair Action
nmmmmml1Program card EDC error.Replace program card.
Legend:
Information related to the solid OCP patterns is automatically displayed on the
maintenance terminal (unless disabled with the FMU) using %FLL formatting, as
detailed in the following examples:
%FLL--HSG> --13-MAY-2001 04:32:26 (time not set)-- OCP
Code: 26
Memory module is missing.
Table 1–3: Solid OCP Pattern Displays and Repair Actions (Sheet 1 of 6)
OCP
Pattern
ommmmmm0Catastrophic controller or
nmmmmmm0No program card detected or
nlmmlml25Recursive Bugcheck detected.
Legend:
■ = reset button O
CodeErrorRepair Action
Check power. If good, reset
power failure.
controller. If problem persists,
reseat controller module and
reset controller. If problem is
still evident, replace controller
module.
Make sure that the program
kill asserted by other controller.
Controller unable to read
program card.
card is properly seated while
resetting the controller. If the
error persists, try the card with
another controller; or replace
the card. Otherwise, replace
the controller that reported the
error.
Reset the controller. If this fault
The same bugcheck has
occurred three times within 10
minutes, and controller
operation has halted.
pattern is displayed repeatedly,
follow the repair actions
associated with the Last Failure
code that is repeatedly
terminating controller
execution.
Table 1–3: Solid OCP Pattern Displays and Repair Actions (Sheet 3 of 6)
OCP
Pattern
nlmllmm2CEnclosure I/O termination
CodeErrorRepair Action
Make sure that all of the
power out of range.
Faulty or missing I/O module
causes enclosure I/O
termination power to be out of
enclosure device SCSI buses
have an I/O module. If problem
persists, replace the failed I/O
module.
range.
nlmllml2DMaster enclosure SCSI buses
are not all set to ID 0.
Set the PVA ID to 0 for the
enclosure with the controllers.
If the problem persists, try the
following repair actions:
1. Replace the PVA module.
2. Replace the EMU.
3. Remove all devices.
4. Replace the enclosure.
nlmlllm2EMultiple enclosures have the
same SCSI ID.
More than one enclosure has
the same SCSI ID.
Reconfigure the PVA ID to
uniquely identify each
enclosure in the subsystem.
The enclosure with the
controllers must be set to PVA
ID 0; additional enclosures
must use PVA IDs 2 and 3. If
the error continues after PVA
settings are unique, replace
each PVA module one at a
time. Check the enclosure if the
problem remains.