not limited to, the implied warranties of merchantability and fitness for a particular purpose.
Hewlett-Packard shall not be liable for errors contained herein or for incidental or consequential
damages in connection with the furnishing, performance, or use of this material.
This document contains proprietary information, which is protected by copyright. No part of this
document may be photocopied, reproduced, or translated into another language without the prior
written consent of Hewlett-Packard. The information contained in this document is subject to change
without notice.
Microsoft, MS-DOS, Windows, and Windows NT are trademarks of Microsoft Corporation in the U.S.
and/or other countries.
All other product names mentioned herein may be trademarks of their respective companies.
Hewlett-Packard Company shall not be liable for technical or editorial errors or omissions contained
herein. The information is provided “as is” without warranty of any kind and is subject to change
without notice. The warranties for Hewlett-Packard Company products are set forth in the express
limited warranty statements accompanying such products. Nothing herein should be construed as
constituting an additional warranty.
Printed in the U.S.A.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Second Edition (August 2002)
Part Number: EK–G80TR–SA. B01
3–6 Template 12—Backup Battery Failure Event Sense Data Response Format . . 3–9
3–7 Template 13—Subsystem Built-In Self Test Failure Event Sense Data Response
3–8 Template 14—Memory System Failure Event Sense Data Response Format 3–12
3–9 Template 41—Device Services Non-Transfer Error Event Sense Data Response
NOTE: Text set off in this manner presents commentary, sidelights, or interesting points of
information.
Symbols on Equipment
Any enclosed surface or area of the equipment marked with these
symbols indicates the presence of electrical shock hazards. Enclosed
area contains no operator serviceable parts.
WARNING: To reduce the risk of injury from electrical shock hazards, do
not open this enclosure.
Any RJ-45 receptacle marked with these symbols indicates a network
interface connection.
WARNING: To reduce the risk of electrical shock, fire, or damage to the
equipment, do not plug telephone or telecommunications connectors into
this receptacle.
Any surface or area of the equipment marked with these symbols
indicates the presence of a hot surface or hot component. Contact with
this surface could result in injury.
WARNING: To reduce the risk of injury from a hot component, allow the
surface to cool before touching.
Power supplies or systems marked with these symbols indicate the
presence of multiple sources of power.
WARNING: To reduce the risk of injury from electrical shock,
remove all power cords to completely disconnect power from the
power supplies and systems.
WARNING: To reduce the risk of personal injury or damage to the equipment, be
sure that:
•The leveling jacks are extended to the floor.
•The full weight of the rack rests on the leveling jacks.
•In single rack installations, the stabilizing feet are attached to the rack.
•In multiple rack installations, the racks are coupled.
•Only one rack component is extended at any time. A rack may become
unstable if more than one rack component is extended for any reason.
About this Guide
Any product or assembly marked with these symbols indicates that the
component exceeds the recommended weight for one individual to
handle safely.
WARNING: To reduce the risk of personal injury or damage to the
equipment, observe local occupational health and safety requirements
and guidelines for manually handling material.
Getting Help
If you still have a question after reading this guide, contact service representatives or
visit our website.
StorageWorks Technical Support
In North America, call StorageWorks technical support at 1-800-OK-COMPAQ,
available 24 hours a day, 7 days a week.
NOTE: For continuous quality improvement, calls may be recorded or monitored.
Outside North America, call StorageWorks technical support at the nearest location.
Telephone numbers for worldwide technical support are listed on the StorageWorks
website: http://www.compaq.com
Be sure to have the following information available before calling:
•Technical support registration number (if applicable)
The StorageWorks website has the latest information on this product, as well as the
latest drivers. Access the StorageWorks website at: http://www.compaq.com/storage
From this website, select the appropriate product or solution.
StorageWorks Authorized Reseller
For the name of your nearest StorageWorks Authorized Reseller:
•In the United States, call 1-800-345-1518.
•In Canada, call 1-800-263-5868.
•Elsewhere, see the StorageWorks website for locations and telephone numbers.
This chapter provides guidelines for troubleshooting the controller, cache module, and
external cache battery (ECB). See enclosure documentation for information on
troubleshooting enclosure hardware, such as the power supplies, cooling fans, and
environmental monitoring unit (EMU).
Typical Installation Troubleshooting Checklist
The following checklist identifies many of the problems that occur in a typical
installation. After identifying a problem, use Table 1–1 to confirm the diagnosis and
fix the problem.
If an initial diagnosis points to several possible causes, use the tools described in this
chapter and then those in Chapter 2 to further refine the diagnosis. If a problem cannot
be diagnosed using the checklist and tools, contact a StorageWorks authorized service
provider for additional support.
To troubleshoot the controller and supporting modules, complete the following:
1
1. Check the power to the enclosure and enclosure components.
•Are power cords connected properly?
•Is power within specifications?
2. Check the component cables.
•Are bus cables to the controllers connected properly?
•For BA370 enclosures, are ECB cables connected properly?
3. Check each program card to make sure the card is fully seated.
4. Check the operator control panel (OCP) and devices for LED codes.
See “Flashing OCP Pattern Display Reporting” on page 1–13 and “Solid OCP
Pattern Display Reporting” on page 1–15, to interpret the LED codes.
5. Connect a local terminal to the controller and check the controller configuration
with the following command:
SHOW THIS_CONTROLLER FULL
Make sure that the ACS version loaded is correct and that pertinent patches are
installed. Also, check the status of the cache module and the supporting ECB.
In a dual redundant configuration, check the “other controller” with the following
command:
SHOW OTHER_CONTROLLER FULL
6. Use the fault management utility (FMU) to check for Last Failure or “memory
system failure” entries.
Show these codes and translate the Last Failure Codes they contain. See
Chapter 2, “Displaying Failure Entries” and “Translating Event Codes” sections.
If the controller failed to the extent that the controller cannot support a local
terminal for FMU, check the host error log for the Instance or Last Failure Codes.
See Chapter 5 and Chapter 6 to interpret the event codes.
7. Check device status with the following command:
SHOW DEVICES FULL
Look for errors such as “misconfigured device” or “No device at this PTL.” If a
device reports misconfigured or missing, check the device status with the
following command:
SHOW device-name
8. Check storageset status with the following command:
SHOW STORAGESETS FULL
Make sure that all storagesets are normal (or normalizing if the storageset is a
RAIDset or mirrorset). Check again for misconfigured or missing devices using
step 7.
9. Check unit status with the following command:
SHOW UNITS FULL
Make sure that all units are available or online. If the controller reports a unit as
unavailable or offline, recheck the storageset the unit belongs to with the
following command:
If the controller reports that a unit has lost data or is unwriteable, recheck the
status of the devices that make up the storageset. If the devices are operating
normally, recheck the status of the cache module. If the unit reports a media
format error, recheck the status of the storageset and storageset devices.
Troubleshooting Table
After diagnosing a problem, use Table 1–1 to resolve the problem.
Table 1–1: Troubleshooting Guidelines (Sheet 1 of 10)
SymptomPossible CauseInvestigationRemedy
Reset button not lit.No power to
subsystem.
Failed controller.If the previous
Reset button lit
steadily; other
LEDs also lit.
Various.See OCP LED Codes.Follow repair action
Check power to
subsystem and power
supplies on controller
enclosure.
BA370 enclosure
only: Make sure that
all cooling fans are
installed. If one or
more fans are missing
or all are inoperative
for more than 8
minutes, the EMU
shuts down the
subsystem.
BA370 enclosure
only: Determine if the
standby power switch
on the PVA was
pressed for more than
5 seconds.
remedies fail to
resolve the problem,
check OCP LED
codes.
Replace cord or
(BA370 enclosure
only) AC input box.
Turn off power switch
on AC input box.
Replace cooling fan.
Restore power to
subsystem.
Table 1–1: Troubleshooting Guidelines (Sheet 3 of 10)
SymptomPossible CauseInvestigationRemedy
Node ID is all zeros.SHOW_THIS to see if
node ID is all zeros.
Set node ID using
the node ID (bar
code) that is located
on the frame in
which the controller
sits. See SET
THIS_CONTROLLE
R NODE_ID in the
controller CLI
reference guide.
Also, be sure to copy
in the right direction.
If cabled to the new
controller, use SET
FAILOVER COPY=
OTHER_CONTROL
LER. If cabled to the
old controller, use
SET FAILOVER
COPY=THIS_CONT
ROLLER.
Nonmirrored
cache: controller
reports failed
DIMM in Cache A
or B.
Improperly installed
DIMM.
Remove cache
module and make
sure that the DIMM is
fully seated in the slot.
Failed DIMM.If the previous remedy
Reseat DIMM.
Replace DIMM.
fails to resolve the
problem, check for
OCP LED codes.
Mirrored cache:
“this controller”
reports DIMM 1 or
2 failed in Cache A
or B.
Improperly installed
DIMM in “this
controller” cache
module.
Failed DIMM in “this
controller” cache
module.
Remove cache
module and make
sure that DIMMs are
installed properly.
If the previous remedy
fails to resolve the
problem, check for
Table 1–1: Troubleshooting Guidelines (Sheet 4 of 10)
SymptomPossible CauseInvestigationRemedy
Mirrored cache:
“this controller”
reports DIMM 3 or
4 failed in Cache A
or B.
Improperly installed
DIMM in “other
controller” cache
module.
Failed DIMM in
“other controller”
cache module.
Remove cache
module and make
sure that the DIMMs
are installed properly.
If the previous remedy
fails to resolve the
problem, check for
Reseat DIMM.
Replace DIMM in
“other controller”
cache module.
OCP LED codes.
Mirrored cache:
controller reports
battery not
present.
Memory module
was installed before
the cache module
was connected to
an ECB.
BA370 enclosure:
ECB cable not
connected to cache
module.
Model 2200
enclosure: ECB not
installed or seated
BA370 enclosure:
Connect ECB cable
to cache module,
then restart both
controllers by
pushing their reset
buttons
simultaneously.
properly in backplane.
Model 2200
enclosure: install or
reseat ECB.
Mirrored cache:
controller reports
cache or mirrored
cache has failed.
Primary data and
the mirrored copy
data are not
identical.
SHOW
THIS_CONTROLLER
indicates that the
cache or mirrored
cache has failed.
Spontaneous FMU
message displays:
“Primary cache
declared failed - data
inconsistent with
mirror,” or “Mirrored
cache declared failed
- data inconsistent
Enter the
SHUTDOWN
command on
controllers that
report the problem.
(This command
flushes the cache
contents to
synchronize the
primary and mirrored
data.) Restart the
controllers that were
shut down.
Table 1–1: Troubleshooting Guidelines (Sheet 5 of 10)
SymptomPossible CauseInvestigationRemedy
Invalid cache.Mirrored-cache
mode discrepancy.
This discrepancy
might occur after
installing a new
controller. The
existing cache
module is set for
mirrored caching,
but the new
controller is set for
unmirrored caching.
This discrepancy
might also occur if
the new controller is
set for mirrored
SHOW
THIS_CONTROLLER
indicates “invalid
cache.”
Spontaneous FMU
message displays:
“Cache modules
inconsistent with
mirror mode.”
Table 1–1: Troubleshooting Guidelines (Sheet 6 of 10)
SymptomPossible CauseInvestigationRemedy
Cache module
might erroneously
contain unflushed
write-back data.
This might occur
after installing a
new controller. The
existing cache
module might
indicate that the
cache module
contains unflushed
write-back data, but
the new controller
expects to find no
data in the existing
SHOW
THIS_CONTROLLER
indicates “invalid
cache.”
No spontaneous FMU
message.
Connect a terminal
to the maintenance
port on the controller
reporting the error,
and clear the error
with the following
command—all on
one line:
CLEAR_ERRORS
THIS_CONTROLLE
R INVALID_CACHE
DESTROY_UNFLUS
HED_ DATA. See the
controller CLI
reference guide for
more information.
cache module.
This error might
also occur if
installing a new
cache module for a
controller that
expects write-back
data in the cache.
Table 1–1: Troubleshooting Guidelines (Sheet 10 of 10)
SymptomPossible CauseInvestigationRemedy
Host log file or
maintenance
terminal indicates
that a forced error
occurred when the
controller was
reconstructing a
RAIDset or
mirrorset.
Unrecoverable read
errors might have
occurred when the
controller was
reconstructing the
storageset. Errors
occur if another
member fails while
the controller is
reconstructing the
storageset.
Host requested data
from a normalizing
storageset that did
not contain the
data.
Conduct a read scan
of the storageset
using the appropriate
utility from the host
operating system,
such as the “dd” utility
for a TRU64 UNIX
host.
Use the SHOW
storageset-name
command to see if all
storageset members
are “normal.”
Rebuild the
storageset, then
restore storageset
data from a backup
source. While the
controller is
reconstructing the
storageset, monitor
the host error log
activity or
spontaneous event
reports on the
maintenance
terminal for any
unrecoverable
errors. If
unrecoverable errors
persist, note the
device on which they
occurred, and
replace the device
before proceeding.
Wait for normalizing
members to become
normal, then resume
I/O to them.
Significant Event Reporting
Controller fault management software reports information about significant events
that occur. These events are reported by:
Some events cause controller operation to halt; others allow the controller to remain
operable. Both types of events are detailed in the following sections.
Reporting Events That Cause Controller Operation to Halt
Events that cause the controller to halt operations are reported in three possible ways:
•a
FLASHING OCP pattern display
•a
SOLID OCP pattern display
•Last Failure reporting
Use Table 1–2 to interpret
FLASHING OCP patterns and Table 1–3 to interpret SOLID (ON)
OCP patterns. In the Error column of the solid OCP patterns, there are two separate
descriptions. The first denotes the actual error message that appears on the terminal,
and the second provides a more detailed explanation of the designated error.
Use the following legend to interpret both tables as indicated:
= reset button F
n
o
= reset button O
l
= LED FLASHING (in Table 1–2) or ON (in TABLE 1–3)
m
= LED O
NOTE: If the reset button is FLASHING and an LED is ON, either the devices on the bus that
corresponds to the LED do not match the controller configuration, or an error occurred in one of
the devices on that bus.
Also, a single LED that is turned O
FF
LASHING (in Table 1–2) or ON (in TABLE 1–3)
FF
N indicates a failure of the drive on that bus.
Flashing OCP Pattern Display Reporting
Certain events can cause a FLASHING display of the OCP LEDs. Each event and the
resulting pattern are described in Table 1–2.
IMPORTANT: Remember that a solid black pattern represents a FLASHING display. A white
pattern indicates OFF.
All LEDs F
Table 1–2: FLASHING OCP Pattern Displays and Repair Actions (Sheet 1 of 3)
LASH at the same time and at the same rate.
OCP
Pattern
CodeErrorRepair Action
nmmmmml1Program card EDC error.Replace program card.
Legend:
Information related to the solid OCP patterns is automatically displayed on the
maintenance terminal (unless disabled with the FMU) using %FLL formatting, as
detailed in the following examples:
%FLL--HSG> --13-MAY-2001 04:32:26 (time not set)-- OCP
Code: 26
Memory module is missing.
Table 1–3: Solid OCP Pattern Displays and Repair Actions (Sheet 1 of 6)
OCP
Pattern
ommmmmm0Catastrophic controller or
nmmmmmm0No program card detected or
nlmmlml25Recursive Bugcheck detected.
Legend:
■ = reset button O
CodeErrorRepair Action
Check power. If good, reset
power failure.
controller. If problem persists,
reseat controller module and
reset controller. If problem is
still evident, replace controller
module.
Make sure that the program
kill asserted by other controller.
Controller unable to read
program card.
card is properly seated while
resetting the controller. If the
error persists, try the card with
another controller; or replace
the card. Otherwise, replace
the controller that reported the
error.
Reset the controller. If this fault
The same bugcheck has
occurred three times within 10
minutes, and controller
operation has halted.
pattern is displayed repeatedly,
follow the repair actions
associated with the Last Failure
code that is repeatedly
terminating controller
execution.
Table 1–3: Solid OCP Pattern Displays and Repair Actions (Sheet 3 of 6)
OCP
Pattern
nlmllmm2CEnclosure I/O termination
CodeErrorRepair Action
Make sure that all of the
power out of range.
Faulty or missing I/O module
causes enclosure I/O
termination power to be out of
enclosure device SCSI buses
have an I/O module. If problem
persists, replace the failed I/O
module.
range.
nlmllml2DMaster enclosure SCSI buses
are not all set to ID 0.
Set the PVA ID to 0 for the
enclosure with the controllers.
If the problem persists, try the
following repair actions:
1. Replace the PVA module.
2. Replace the EMU.
3. Remove all devices.
4. Replace the enclosure.
nlmlllm2EMultiple enclosures have the
same SCSI ID.
More than one enclosure has
the same SCSI ID.
Reconfigure the PVA ID to
uniquely identify each
enclosure in the subsystem.
The enclosure with the
controllers must be set to PVA
ID 0; additional enclosures
must use PVA IDs 2 and 3. If
the error continues after PVA
settings are unique, replace
each PVA module one at a
time. Check the enclosure if the
problem remains.
Table 1–3: Solid OCP Pattern Displays and Repair Actions (Sheet 6 of 6)
OCP
Pattern
nllllll3FDAEMON diagnostic failed
Legend:
■ = reset button O
CodeErrorRepair Action
Verify that cache module is
hard in non-fault tolerant mode.
DAEMON diagnostic detected
critical hardware component
failure; controller can no longer
operate.
N❏ = reset button OFF● = LED ON❍ = LED OFF
present. If the error persists,
replace controller.
Last Failure Reporting
Last failures are automatically displayed on the maintenance terminal (unless disabled
via the FMU) using %LFL formatting. The example below shows a Last Failure
report:
%LFL--HSG> --13-MAY-2001 04:39:45 (time not set)-- Last Failure Code:
20090010
Power On Time: 0. Years, 14. Days, 19. Hours, 58. Minutes, 42. Seconds
Controller Model: HSG80
Serial Number: AA12345678 Hardware Version: 0000(00)
Software Version: V087P(FF)
Informational Report
Instance Code: 0102030A
Last Failure Code: 20090010 (No Last Failure Parameters)
Additional information is available in Last Failure Entry: 1.
In addition, Last Failures are reported to the host error log using Template 01,
following a restart of the controller. See Chapter 4 for a more detailed explanation of
this template.
Reporting Events That Allow Controller Operation to Continue
Events that do not cause controller operation to halt are displayed in one of two ways:
Spontaneous event logs are automatically displayed on the maintenance terminal
(unless disabled with the FMU) using %EVL formatting, as illustrated in the
following examples:
%EVL--HSG> --13-OCT-2000 04:32:47 (time not set)-- Instance Code: 0102030A (not yet
reported to host)
Template: 1.(01)
Power On Time: 0. Years, 14. Days, 19. Hours, 58. Minutes, 43. Seconds
Controller Model: HSG80
Serial Number: AA12345678 Hardware Version: 0000(00)
Software Version: V087P(FF)
Informational Report
Instance Code: 0102030A
Last Failure Code: 011C0011
Last Failure Parameter[0.] 0000003F
%EVL--HSG> --13-OCT-2000 04:32:47 (time not set)-- Instance Code: 82042002 (not yet
reported to host)
Template: 13.(13)
Power On Time: 0. Years, 14. Days, 19. Hours, 58. Minutes, 43. Seconds
Controller Model: HSG80
Serial Number: AA12345678 Hardware Version: 0000(00)
Software Version: V087P(FF)
Header type: 00 Header flags: 00
Test entity number: 0F Test number Demand/Failure: F8 Command: 01
Error Code: 0008 Return Code: 0005 Address of Error: A0000000
Expected Error Data: 44FCFCFC Actual Error Data: FFFF01BB
Extra Status(1): 00000000 Extra Status(2): 00000000 Extra Status(3): 00000000
Instance Code: 82042002
HSG>
Spontaneous event logs are reported to the host error log using SCSI Sense Data
Templates 01, 04, 05, 11, 12, 13, 14, 41, 51, and 90. See Chapter 3 for a more detailed
explanation of templates.
CLI Event Reporting
CLI event reports are automatically displayed on the maintenance terminal (unless
disabled with the FMU) using %CER formatting, as shown in the following example:
%CER--HSG> --13-OCT-2000 04:32:20 (time not set)-- Previous controlleroperation stopped with display of solid fault code, OCP Code: 3F
HSG>
During startup, the controller automatically tests the device ports, host ports, cache
module, and value-added functions. If intermittent problems occur with one of these
components, run the controller diagnostic test in a continuous loop rather than
restarting the controller repeatedly.
Use the following steps to run the controller diagnostic test:
1. Connect a terminal to the controller maintenance port.
2. Start the self-test with one of the following commands:
SELFTEST THIS_CONTROLLER
SELFTEST OTHER_CONTROLLER
NOTE: The self-test runs until an error is detected or until the controller reset button is pressed.
If the self-test detects an error, the self-test saves information about the error and
produces an OCP LED code for a “daemon hard error.” Restart the controller to write
the error information to the host error log, then check the host error log for a “built-in
self-test failure” event report. This report will contain an instance code, located at
offset 32 through 35, that can be used to determine the cause of the error. See
Chapter 2, “Translating Event Codes” for help translating instance codes.
Troubleshooting Information
ECB Charging Diagnostics
Whenever restarting the controller, the diagnostic routines automatically check the
charge of each ECB battery. If the battery is fully charged, the controller reports the
battery as good and rechecks the battery every 24 hours. If the battery is charging, the
controller rechecks the battery every 4 minutes. A battery is reported as being either
above or below 50 percent capacity. A battery below 50 percent capacity is referred to
as low.
The 4-minute polling continues for the maximum allowable time to recharge the
battery—up to 10 hours for a BA370 enclosure, or 3.5 hours for a Model 2200
enclosure. If the battery does not charge sufficiently after the allotted time, the
controller declares the battery as failed.
Battery Hysteresis
When charging an ECB battery, write-back caching is allowed as long as a previous
downtime did not drain more than 50 percent battery capacity. When an ECB battery
is operating below 50 percent capacity, the battery is considered to be low and
write-back caching is disabled.
ECB battery capacity depends on the size of the cache module memory configuration
as shown in Table 1–4. For example, when the batteries are fully charged, an ECB can
preserve 512 MB of cache memory for 24 hours (1 day).
CAUTION: StorageWorks recommends replacing the ECB every 2 years to prevent
battery failure.
NOTE: If a UPS is used for backup power and set to DATACENTER_WIDE, the controller does
not check the battery. See the controller configuration planning guide, controller installation and
configuration guide and controller CLI reference guide for information about the UPS switches.
DIMM
Combinations
Capacity in Hours
(Days)
Caching Techniques
The cache module supports the following caching techniques to increase subsystem
read and write performance:
•Read caching
•Read-ahead caching
•Write-through caching
•Write-back caching
Read Caching
When the controller receives a read request from the host, the controller reads the data
from the disk drives, delivers the data to the host, and stores the data in the supporting
cache module. Subsequent reads for the same data will take this data from the
supporting cache module rather than access the data from the disk drives. This process
is called read caching.
Read caching can decrease the subsystem response time to many host read requests. If
the host requests some or all of the cached data, the controller satisfies the request
from the supporting cache module rather than from the disk drives. Read caching is
enabled by default for all storage units.
For more details, refer to the following CLI commands in the controller CLI reference
guide:
SET unit-number MAXIMUM_CACHED_TRANSFER=nn
SET unit-number MAX_READ_CACHED_TRANSFER_SIZE=nn
SET unit-number READ_CACHE
Read-Ahead Caching
Read-ahead caching begins when the controller has already processed a read request
and the controller receives a subsequent read request from the host. If the controller
does not find the data in the cache memory, the controller reads the data from the disk
drives and sends this data to the cache memory.
During read-ahead caching, the controller anticipates subsequent read requests and
begins to prefetch the next blocks of data from the disk drives as the controller sends
the requested read data to the host. These are parallel actions. The controller notifies
the host of the read completion, and subsequent sequential read requests are satisfied
from the cache memory. Read-ahead caching is enabled by default for all disk units.
Troubleshooting Information
Write-Through Caching
When the controller receives a write request from the host, the controller places the
data in the supporting cache module, writes the data to the disk drives, then notifies
the host when the write operation is complete. This process is called write-through
caching because the data actually passes through—and is stored in—the cache
memory along the way to the disk drives.
If read-caching is enabled for a storage unit, write-through caching is automatically
enabled.
Write-back caching improves the subsystem response time to write requests by
allowing the controller to declare the write operation “complete” as soon as the data
reaches the supporting cache memory. The controller performs the slower operation of
writing the data to the disk drives at a later time. For more details, refer to the
following CLI commands in the controller CLI reference guide:
SET unit-number MAXIMUM_CACHED_TRANSFER=nn
SET unit-number MAX_WRITE_CACHED_TRANSFER_SIZE=nn
SET unit-number WRITEBACK_CACHE
Write-back caching is enabled by default for all units. The controller will only provide
write-back caching to a unit if the cache memory is nonvolatile, as described in the
next section.
By default, the controller expects to use an ECB as the backup power source for the
cache module. However, if the subsystem is protected by a UPS, use one of the
following CLI commands to instruct the controller to use the UPS:
SET controller UPS=NODE_ONLY
or
SET controller UPS=DATACENTER_WIDE
Fault-Tolerance for Write-Back Caching
The cache module supports nonvolatile memory and dynamic cache policies to protect
the availability of cache module unwritten (write-back) data.
Nonvolatile Memory
The controller provides write-back caching for storage units as long as the controller
cache memory is connected to a nonvolatile backup power source, such as an ECB.
The cache module must be nonvolatile to preserve unwritten cache data during a
power failure. If the cache memory is not connected to a backup power supply, this
unwritten data will be lost during a power failure.
NOTE: Disaster-tolerant mirrorsets are not subject to this requirement.
By default, the controller expects to use an ECB as the backup power source for the
supporting cache module. However, if the subsystem is backed up using a UPS, two
options are available that tell the controller to use the UPS:
•For BA370 enclosures only: use both the ECB and the UPS together with the
following command:
•Use only the UPS as the backup power source with the following command:
SET controller UPS=DATACENTER_WIDE
NOTE: See the controller CLI reference guide for detailed descriptions of these commands.
Cache Policies Resulting from Cache Module Failures
If the controller detects a full or partial failure of the supporting cache module or ECB,
the controller automatically reacts to preserve the unwritten data in the supporting
cache module. Depending upon the severity of the failure, the controller chooses an
interim caching technique—also called the cache policy—until the cache module or
ECB is repaired or replaced.
Table 1–5 shows the cache policies resulting from a full or partial failure of cache
module A (Cache A) in a dual-redundant controller configuration. The consequences
shown in Table 1–5 are the same for Cache B failures.
Table 1–6 on page 1–29 shows the cache policies resulting from a full or partial failure
of the ECB connected to Cache A in a dual-redundant controller configuration. The
consequences shown in Table 1–6 are the opposite for an ECB failure connected to
Cache B.
•If the ECB is at least 50% charged, the ECB is still good and is charging.
Troubleshooting Information
•If the ECB is less than 50% charged, the ECB is low but still charging.
Table 1–5: Cache Policies—Cache Module Status (Sheet 1 of 3)
Cache Module
StatusCache Policy
Cache A Cache BUnmirrored CacheMirrored Cache
Good.Good.Data loss: None
Cache policy: Both controllers
support write-back caching.
loss of write-back data for
which the multibit error
occurred. Controller A detects
and reports the lost blocks.
Cache policy: Both controllers
support write-back caching.
Failover: None
Data loss: None
Cache policy: Both controllers
support write-back caching.
Failover: None
Data loss: None. Controller A
recovers lost write-back data
from the mirrored copy on Cache
B.
Cache policy: Both controllers
support write-back caching.
Failover: None
Troubleshooting Information
Table 1–5: Cache Policies—Cache Module Status (Sheet 2 of 3)
Cache Module
StatusCache Policy
Cache A Cache BUnmirrored CacheMirrored Cache
DIMM or
cache
memory
controller
chip
failure.
Good.Data loss: Write-back data that
was not written to media when
failure occurred was not
recovered.
Cache policy: Controller A
supports write-through caching
only; Controller B supports
write-back caching.
Failover: In transparent failover,
all units fail over to Controller B.
In multiple-bus failover with
host-assist, only those units
that use write-back caching,
such as RAIDsets and
mirrorsets, fail over to Controller
Data loss: Controller A recovers
all of write-back data from the
mirrored copy on Cache B.
Cache policy: Controller A
supports write-through caching
only; Controller B supports
write-back caching.
Failover: In transparent failover,
all units fail over to Controller B
and operate normally. In
multiple-bus failover with
host-assist, only those units that
use write-back caching, such as
RAIDsets and mirrorsets, fail
over to Controller B.
B. All units with lost data
become inoperative until they
are cleared using the CLEAR
unit-number LOST_DATA
command. Units that did not
lose data operate normally on
Controller B.
In single-controller
configurations, RAIDsets,
mirrorsets, and all units with
lost data become inoperative.
Although lost data errors can
be cleared on some units,
RAIDsets and mirrorsets
remain inoperative until the
memory on Cache A is repaired
or replaced.
Table 1–5: Cache Policies—Cache Module Status (Sheet 3 of 3)
Cache Module
StatusCache Policy
Cache A Cache BUnmirrored CacheMirrored Cache
Cache
Board
Failure.
Good.Same as for DIMM failure. Data loss: Controller A recovers
all of write-back data from the
mirrored copy on Cache B.
Cache policy: Both controllers
support write-through caching
only. Controller B cannot execute
mirrored writes because Cache
A cannot mirror Controller B
unwritten data.
Failover: None
Table 1–6: Resulting Cache Policies—ECB Status (Sheet 1 of 4)
•Both cache modules have an ECB connected and the UPS switch is set by the
following command:
SET controller NOUPS (no UPS is connected)
•Both cache modules either:
— Have an ECB connected, and the UPS switch is set by one of the following
commands:
SET controller NOUPS (no UPS is connected)
BA370 enclosure only: SET controller UPS=NODE_ONLY (a UPS is
connected)
— Do not have an ECB connected, and the UPS switch is set by the following
command:
SET controller UPS=DATACENTER_WIDE
NOTE: No unit errors are outstanding (for example, lost data or data that cannot be written to
devices).
•Both controllers are started and configured in failover mode.
For important considerations when configuring a subsystem for mirrored caching, see
the controller installation and configuration guide. To add or replace DIMMs in a
mirrored cache configuration, see the controller maintenance and service guide.
This chapter describes the utilities and exercisers available to help troubleshoot and
maintain the controllers, cache modules, and ECBs. These utilities and exercisers
include:
•Fault Management Utility (FMU)
•Video Terminal Display (VTDPY) Utility
•Disk Inline Exerciser (DILX)
•Format and Device Code Load Utility (HSUTIL)
•Configuration (CONFIG) Utility
•Code Load and Code Patch (CLCP) Utility
•Clone (CLONE) Utility
•Field Replacement Utility (FRUTIL)
•Change Volume Serial Number (CHVSN) Utility
Fault Management Utility (FMU)
The FMU provides a limited interface to the controller fault management software.
Use FMU to:
•Display the last failure and memory-system failure entries that the fault
management software stores in the controller nonvolatile memory.
•Translate many of the code values contained in event messages. For example,
entries might contain code values that indicate the cause of the event, the software
component that reported the event, or the repair action.
•Display the Instance Codes that identify and accompany significant events that do
•Display the Last Failure Codes that identify and accompany failure events that
cause the controller to halt operations. Last Failure Codes are sent to the host only
after the affected controller is restarted.
•Control the display characteristics of significant events and failures that the fault
management system displays on the maintenance terminal. See “Controlling the
Display of Significant Events and Failures” on page 2–5 for specific details on this
feature.
Displaying Failure Entries
The controller stores the 16 most recent last failure reports as entries in its nonvolatile
memory. The occurrence of any failure event halts operation of the controller on
which it occurred.
NOTE: Memory system failures are reported through the last failure mechanism but can be
displayed separately.
Use the following steps to display the last failure entries:
1. Connect a PC or a local terminal to the controller maintenance port.
2. Start FMU with the following command:
RUN FMU
3. Show one or more of the entries with the following command:
SHOW event_type entry# FULL
where:
•event-type is LAST_FAILURE or MEMORY_SYSTEM_FAILURE
•entry# is ALL, MOST_RECENT, or 1 through 16
•FULL displays additional information, such as the Intel i960 stack and
hardware component register sets (for example, the memory controller, FX,
host port, device ports, and so forth).
4. Exit FMU with the following command:
EXIT
The following example shows a last failure entry. The Informational Report—the
lower half of the entry—contains the last failure code, reporting component, and so
forth, that can be translated with FMU to learn more about the event.
inconsistencies, uninterpreted device errors, etc.) or an intentional
restart or shutdown of controller operation is indicated.
Last Failure Code: 20090010 (No Last Failure Parameters)
Last Failure Code: 20090010 Description:
This controller requested this controller to shutdown.
Reporting Component: 32.(20) Description:
Command Line interface
Reporting component's event number: 9.(09)
Restart Type: 1.(01) Description: No restart
Translating Event Codes
To translate the event codes in the fault management reports for spontaneous events
and failures, complete the following:
1. Connect a PC or a local terminal to the controller maintenance port.
2. Start FMU with the following command:
RUN FMU
3. Show one or more of the entries with the following command:
DESCRIBE code_type code#
where:
•code_type is one of those listed in Table 2–1
•code# is the alphanumeric value displayed in the entry
•code types marked with an asterisk (*) require multiple code numbers
(see Chapter 3 for types codes used in the various templates, Chapter 4 for
ASC, ASCQ, Repair Action, and Component ID codes, Chapter 5 for Instance
Codes, and Chapter 6 for Last Failure Codes)
The following examples show the FMU translation of a last failure code and an
instance code.
FMU>DESCRIBE LAST_FAILURE_CODE 206C0020
Last Failure Code: 206C0020
Description: Controller was forced to restart in order for new controller
code image to take effect.
Reporting Component: 32.(20)
Description: Command Line interface
Reporting component's event number: 108.(6C)
Restart Type: 2.(02)
Description: Automatic hardware restart
FMU>DESCRIBE INSTANCE 026e0001
Instance Code: 026E0001
Description: The device specified in the Device Locator field has been
reduced from the Mirrorset associated with the logical unit. The nominal
number of members in the mirrorset has been decreased by one. The reduced
device is now available for use.
Reporting Component: 2.(02)
Description: Value Added Services
Reporting component's event number: 110.(6E)
Event Threshold: 1.(01) Classification:
IMMEDIATE. Failure or potential failure of a component critical to proper
controller operation is indicated; immediate attention is required.
Controlling the Display of Significant Events and Failures
Use the SET command to control how the fault management software displays
significant events and failures.
Table 2–2 describes various SET commands that can be entered while running FMU.
These commands remain in effect only as long as the current FMU session remains
active, unless the PERMANENT qualifier is entered (the last entry in the table).
Table 2–2: FMU SET Commands (Sheet 1 of 3)
CommandResult
SET EVENT_LOGGING
SET
NOEVENT_LOGGING
SET LAST_FAILURE
LOGGING
SET NOLAST_FAILURE
LOGGING
SET log_type
REPAIR_ACTION
SET log_type
NOREPAIR_ACTION
Enable and disable the spontaneous display of significant
events to the local terminal; preceded by “%EVL” (see
example in Chapter 1). By default, logging is enabled (SET
EVENT_LOGGING).
When logging is enabled, the controller spontaneously
displays information about the events on the local terminal.
Spontaneous event logging is suspended during the
execution of CLI commands and operation of utilities on a
local terminal. Because these events are spontaneous, logs
are not stored by the controller.
Enable and disable the spontaneous display of last failure
events; preceded by “%LFL” (see example in Chapter 1). By
default, logging is enabled (SET LAST_FAILURE LOGGING).
The controller spontaneously displays information relevant to
the sudden termination of controller operation.
In cases of automatic hardware reset (for example, power
failure or pressing the controller reset button), the fault LED
log display is inhibited because automatic resets do not allow
sufficient time to complete the log display.
Enable and disable the inclusion of repair action information
for event logging or last failure logging. By default, repair
actions are not displayed for these log types (SET log_type
NOREPAIR_ACTION). If the display of repair actions is
enabled, the controller displays any of the recommended
repair actions associated with the event.
Enable and disable the automatic translation of event codes
that are contained in event logs or last failure logs. By default,
this descriptive text is not displayed (SET log_type
NOVERBOSE). See “Translating Event Codes” on page 2–3
for instructions to translate these codes manually.
SET PROMPT
SET NOPROMPT
Enable and disable the display of the CLI prompt string
following the log identifier “%EVL,” or “%LFL,” or “%FLL.” This
command is useful if the CLI prompt string is used to identify
the controllers in a dual-redundant configuration (see the
controller CLI reference guide for instructions to set the CLI
command string for a controller). If enabled, the CLI prompt
will be able to identify which controller sent the log to the local
terminal. By default, the prompt is set (SET PROMPT).
SET TIMESTAMP
SET NOTIMESTAMP
Enable and disable the display of the current date and time in
the first line of an event or last failure log. By default, the
timestamp is set (SET TIMESTAMP).
SET
FMU_REPAIR_ACTION
SET
FMU_NOREPAIR_ACTION
Enable and disable the inclusion of repair actions with SHOW
LAST_FAILURE and SHOW MEMORY_SYSTEM_FAILURE
commands. By default, the repair actions are not shown (SET
FMU_NOREPAIR_ACTION). If repair actions are enabled,
the command outputs display all of the recommended repair
actions associated with the instance or last failure codes
used to describe an event.
SET FMU_VERBOSE
SET FMU_NOVERBOSE
Enable and disable the inclusion of instance and last failure
code descriptive text with SHOW LAST_FAILURE and
SHOW MEMORY_SYSTEM_ FAILURE commands. By
default, this descriptive text is not displayed (SET
FMU_NOVERBOSE). If the descriptive text is enabled, it
identifies the fields and their numeric content that comprise
an event or last failure entry.
SET
CLI_EVENT_REPORTING
SET
NOCLI_EVENT_REPORTI
NG
Enable and disable the asynchronous errors reported at the
CLI prompt (for example, “swap signals disabled” or “shelf
(enclosure) has a bad power supply”); preceded by “%CER”
(see example in Chapter 1). By default, these errors are
reported (SET CLI_EVENT_REPORTING). These errors are
cleared with the CLEAR ERRORS_CLI command.
SHOW PARAMETERSDisplays the current settings associated with the SET
SET command
PERMANENT
Enable and disable the solid fault LED event log display on
the local terminal. Preceded by “%FLL.” By default, logging is
enabled (SET FAULT_LED_LOGGING).
When enabled, and a solid fault pattern is displayed in the
OCP LEDs, the fault pattern and its meaning are displayed on
the maintenance terminal. For many of the patterns,
additional information is also displayed to aid in problem
diagnosis.
In cases of automatic hardware reset (for example, power
failure or pressing the controller reset button), the fault LED
log display is inhibited because automatic resets do not allow
sufficient time to complete the log display.
command.
Preserves the SET command across controller resets.
Video Terminal Display (VTDPY) Utility
The VTDPY utility, through various screens, displays configuration and performance
information for the HSG80 storage subsystem and is used to check the subsystem for
communication problems. Information displayed includes:
•Processor utilization
•Virtual storage unit activity and configuration
•Cache performance
•Device activity and configuration
•Host port activity and configuration
•Local and remote controller activity in a Data Replication Manager configuration
NOTE: All VTDPY screen displays are 132 characters wide. However, for readability purposes,
the sample screens in this section are not complete screens as viewed on the terminal.
Restrictions with VTDPY
The following restrictions apply when using VTDPY:
•The VTDPY utility requires a serial maintenance terminal that supports ANSI
control sequences or a graphics display that emulates an ANSI-compatible
terminal.
•Only one VTDPY session can be run on a controller at a time.
•VTDPY does not display information for passthrough devices.
Running VTDPY
Use the following steps to run VTDPY:
1. Connect a serial maintenance terminal to the controller maintenance port.
IMPORTANT: The terminal must support ANSI control sequences.
2. Set the terminal to NOWRAP mode to prevent the top line of the display from
scrolling off of the screen.
3. Press Enter/Return to display the CLI prompt (CLI>).
4. Start VTDPY with the following command:
RUN VTDPY
Use the key sequences and commands listed in Table 2–3 to control VTDPY.
Table 2–3: VTDPY Key Sequences and Commands (Sheet 1 of 2)
CommandAction
Ctrl/CEnables command mode; after entering Ctrl/C, enter one of the
following commands and press Enter/Return:
CLEAR
DISPLAY CACHE
DISPLAY DEFAULT
DISPLAY DEVICE
DISPLAY HOST
DISPLAY REMOTE (ACS version 8.7P only)
DISPLAY RESOURCE
DISPLAY STATUS
EXIT or QUIT
HELP
INTERVAL seconds (to change update interval)
REFRESH or UPDATE
Table 2–3: VTDPY Key Sequences and Commands (Sheet 2 of 2)
CommandAction
Ctrl/GUpdates screen
Ctrl/OPauses (and resumes) screen updates
Ctrl/RRefreshes the current screen display
Ctrl/WRefreshes the current screen display
Ctrl/YExits VTDPY
Commands can be abbreviated to the minimum number of characters necessary to
identify the command. Enter a question mark (?) after a partial command to see the
values that can follow the supplied command.
For example: if DISP ? (DISP<space>?) is entered, the utility will list CACHE,
DEFAULT, and other possibilities.
Upon successfully executing a command—other than HELP—VTDPY exits
command mode. Pressing Enter/Return without a command also causes VTDPY to
exit command mode.
VTDPY Help
Utilities and Exercisers
Entering HELP at the VTDPY prompt (VTDPY>) displays information about
VTDPY commands and keyboard shortcuts. See Figure 2–1 below:
NOTE: The ^ symbol denotes the Ctrl key on the keyboard.
VTDPY> HELP
Available VTDPY commands:
^C - Prompt for commands
^G or ^Z - Update screen
^O - Pause/Resume screen updates
^Y - Terminate program
^R or ^W - refresh screen
DISPLAY CACHE - Use 132 column unit caching statistics display
DISPLAY DEFAULT - Use 132 column system performance display
DISPLAY DEVICE - Use 132 column device performance display
DISPLAY HOST - Use 132 column Host Ports statistics display
DISPLAY REMOTE - Use 132 column controller status display
DISPLAY RESOURCE - Use 132 column controller status display
DISPLAY STATUS - Use 132 column controller status display
CLEAR - Clears the host port event counters
EXIT - Terminate program (same as QUIT)
INTERVAL <seconds> - Change update interval
HELP - Display this help message
REFRESH - Refresh the current display
QUIT - Terminate program (same as EXIT)
UPDATE - Update Screen Display
Figure 2–1: VTDPY commands and shortcuts generated from the Help
command
VTDPY Display Screens
VTDPY displays storage subsystem information using the following display screens:
•Default Screen
•Controller Status Screen
•Cache Performance Screen
•Device Performance Screen
•Host Ports Statistics Screen
•Resource Statistics Screen
•Remote Status Screen
Choose any of the screens by entering DISPLAY at the VTDPY prompt, followed by
the screen name. For example: enter the following command at the VTDPY prompt:
DISPLAY CACHE
Each display screen is shown in the following sections. Screen interpretations are
presented following the various screens.
Figure 2–8: Sample of the VTDPY remote status screen (ACS version 8.7P only)
Interpreting VTDPY Screen Information
Refer to the sample VTDPY screens in the previous section as needed while the
various sections of these screens are interpreted in this section. The VTDPY screens
display information in the following screen subsections:
•Controller/Processor Utilization
Each screen subsection is described in the following sections.
Screen Header
The screen header is the first line of data on every display screen. The header shows
information about the overall performance of the HSG80 storage subsystem and is
further divided into the following four subsections:
•Controller ID data
•Subsystem performance data
•Controller uptime data
•Current date and time
The controller ID data appears as follows:
HSG80 S/N: xxxxxxxxxxxx SW: xxxxxxx HW: xx-xx
where:
— HSG80: string represents the controller model name and number.
Utilities and Exercisers
— S/N: depicts an alphanumeric serial number.
— SW: depicts a software version number.
— HW: depicts a hardware revision number.
The subsystem performance data appears as follows:
xxx.x% Idle xxxxxx KB/S xxxxx RQ/S
where:
— xxx.x% Idle displays the controller policy processor uptime.
— KB/S displays cumulative data transfer rate in kilobytes per second.
— RQ/S displays cumulative unit request rate in requests per second.
The controller uptime data shows the uptime of the HSG80 controller in days, hours
and minutes in the following format:
Some VTDPY displays contain common data fields, such as the DEFAULT, STATUS,
and DEVICE screens. Table 2–4 provides a description of common data fields on
DEFAULT and STATUS screens.
Table 2–4: VTDPY—Common Data Fields Column Definitions: Part 1
ColumnContents
PrThread priority
NameThread name or NULL (idle)
Stk/MaxAllocated stack size in 512 byte pages and maximum number of
stack pages actually used
TypThread type:
FNC=functional thread
DUP=device utility/exerciser (DUP) local program threads
StaStatus:
Bl=waiting for completion of a process currently running
Io=waiting for input or output
Rn=actively running
CPU%Percentage of central processing unit resource consumption
Other common VTDPY data fields in the DEFAULT and DEVICE screens are
described in Table 2–5.
Table 2–5: VTDPY—Common Data Fields Column Definitions: Part 2
ColumnContents
PortSCSI ports 1 through 6.
TargetSCSI targets 0 through 15. Single controllers occupy 7;
dual-redundant controllers occupy 6 and 7.
D=disk drive or CD-ROM drive
F=foreign device
H=this controller
h=other controller in dual-redundant configurations
P=passthrough device
?=unknown device type
space=no device at this port/target location
Unit Performance Data Fields
VTDPY displays virtual storage unit performance information in a block of tabular
data in the DEFAULT, STATUS, CACHE, and RESOURCE screens only. Each of
these screens displays the unit performance data in a different format, as follows:
Utilities and Exercisers
•DEFAULT screen uses the full format (see Figure 2–2).
•STATUS screen uses a brief format (see Figure 2–3).
•CACHE screen uses the maximum format (see Figure 2–4).
•RESOURCE screen also uses a brief format (see Figure 2–7).
Although these displays show unit performance in three different formats, the displays
share common data fields, with the brief format displaying the least information, the
full format supplying more information, and the maximum format displaying the
maximum amount of available information. See Table 2–6 for a description of each
field on these screens.
Table 2–6: VTDPY—Unit Performance Data Fields Column Definitions (Sheet 1
of 2)
ColumnContents
UnitKind of unit and unit number. Unit types include:
AAvailability of the unit:
SState of a virtual storage unit:
WWrite-protection state of the virtual storage device
D=disk drive or CD-ROM drive
I=invisible device
P=passthrough device
?=unknown device type
a=available to “other controller”
d=offline, unit disabled for servicing
e=online, unit mounted for exclusive access by a user
f=offline, media format error
i=offline, unit inoperative
m=offline, maintenance mode for diagnostic purposes
o=online, Host can access this unit through “this controller”
r=offline, rundown set with the SET NORUN command
v=offline, no volume mounted due to lack of media
x=online, Host can access this unit through “other controller”
z=currently not accessible to host due to a remote copy
condition (ACS version 8.7P only)
space=unknown availability
^=disk device spinning at correct speed
>=disk device spinning up
<=disk device spinning down
v=disk device stopped spinning
space=unknown spindle state or device is not a disk unit
W=for disk drives, indicating the device is hardware
Table 2–6: VTDPY—Unit Performance Data Fields Column Definitions (Sheet 2
of 2)
ColumnContents
CCaching state of the device:
a=read, write-back, and read-ahead caching enabled
b=read and write-back caching enabled
c=read and read-ahead caching enabled
p=read-ahead caching enabled
r=read caching only
w=write-back caching is enabled
space=caching disabled
KB/SAverage amount of data transferred to and from the unit during the last
update interval in kilobyte increments per second.
Rd%Percentage of data transferred between the host and the unit that was
read from the unit.
Wr%Percentage of data transferred between the host and the unit that was
written to the unit.
Cm%Percentage of data transferred between the host and the unit that was
compared. A compare operation can accompany a read or a write
operation, so this column is not the sum of columns Rd% and Wr%.
Ht%Cache-hit percentage for data transferred between the host and the
unit.
Ph%Partial cache hit percentage of data transferred between the host and
the unit.
MS%Cache miss percentage of data transferred between the host and the
unit.
PurgeNumber of blocks purged from the write-back cache during the last
update interval.
BlChdNumber of blocks added to the cache during the last update interval.
BlHitNumber of cached data blocks hit during the last update interval.
Device Performance Data Fields
VTDPY displays up to 42 devices in the device performance region (see Figure 2–5,
upper right) of the DEVICE screen only. See Table 2–7 for a description of each field.
Table 2–7: VTDPY—Device Performance Data Fields Column Definitions (Sheet
2 of 2)
ColumnContents
WrKB/SAverage write data transfer rate to the device in KB/s during the
previous update interval.
QueMaximum number of transfer requests waiting to be transferred to the
device during the last screen update interval.
TgMaximum number of requests queued to the device during the last
screen update interval. If the device does not support tagged queuing,
the maximum value is 1.
BRNumber of SCSI bus resets that occurred since VTDPY was started.
ERNumber of SCSI errors received. If the device is swapped or deleted,
then the value clears and resets to 0.
Device Port Performance Data Fields
VTDPY displays a device port performance region (see Figure 2–5, lower left) on the
DEVICE screen only. See Table 2–8 for a description of each field.
Table 2–8: VTDPY—Device Port Performance Data Fields Column Definitions
ColumnContents
PortSCSI device ports 1 through 6.
Rq/SAverage I/O request rate for the device during the last update
interval. Requests can be up to 32 KB and generated by host
requests or cache flush activity.
RdKB/SAverage read data transfer rate to the device in KB/s during the
previous update interval.
WrKB/SAverage write data transfer rate to the device in KB/s during the
previous update interval.
CRNumber of SCSI command resets that occurred since VTDPY was
started.
BRNumber of SCSI bus resets that occurred since VTDPY was
started.
TRNumber of SCSI target resets that occurred since VTDPY was
VTDPY displays host port configuration information in a block of tabular data in the
HOST screen only. The data is displayed for both host Port 1 and host Port 2
independently, although the format is the same for both.
Use the VTDPY>CLEAR command to clear the host display link error counters.
Table 2–9 outlines the “Known Hosts” portion of the Fibre Channel Host Status
Display. For a more detailed explanation of certain field labels and their definitions,
consult The Fibre Channel Physical and Signaling Interface Standard (also known as
the FC-PH specification).
Table 2–9: Fibre Channel Host Status Display—Known Host Connections
Field
LabelDescription
##Internal ID
NAMERefer to the SHOW CONNECTIONS command in controller CLI
reference guide.
BBBuffer-to-buffer credit
FrSzFrame size
ID/ALPAHost ID
PPort number (1 or 2)
SStatus:
N=online
F=offline
The following tables detail the remaining portions of the Fibre Channel Host Status
Display. Table 2–10 includes the labels that report the status of ports one and two, and
Table 2–11 describes the Link Error Counters.
Table 2–10: Fibre Channel Host Status Display—Port Status (Sheet 1 of 2)
Table 2–10: Fibre Channel Host Status Display—Port Status (Sheet 2 of 2)
Field
LabelDescription
TACHYO
N Status
This denotes the current state of the TACHYON or Fibre Channel
control chip. See “TACHYON Chip Status” on page 2–28 for more
detail.
Queue
Depth
Busy/QFu
ll Rsp
Table 2–11: Fibre Channel Host Status Display—Link Error Counters (Sheet 1 of
Queue depth shows the instantaneous number of commands at
the controller port.
This field represents the total number of QFull/Busy responses
sent by the port.
2)
Field
LabelDescription
Link
This field refers to the total number of link down/up transitions.
Downs
Soft InitsSoft initializations are the number of loop initializations caused by
this port.
Hard InitsHard initializations indicate the number of TACHYON chip resets.
Loss of
Signals
Bad Rx
Chars
Loss of signals show the number of times the Frame Manager
detected a low-to-high transition on the lnk_unuse signal.
This field represents the number of times the 8B/10B decode
detected an invalid 10-bit code. FC-PH denotes this value as
“Invalid Transmission Word during frame reception.” This field may
be non-zero after initialization. After initialization, the host should
read this value to determine the correct starting value for this error
count.
Loss of
Syncs
Loss of Sync denotes the number of times the loss of sync is
greater than RT_TOV.
Link FailsThis field indicates the number of times the Frame Manager
detected a NOS or other initialization protocol failure that caused a
transition to the Link Failure state.
Received
EOFa
Received EOFa refers to the number of frames containing an
EOFa delimiter that the TACHYON chip has received.
Table 2–11: Fibre Channel Host Status Display—Link Error Counters (Sheet 2 of
2)
Field
LabelDescription
Generate
d EOFa
Bad
CRCs
Protocol
Errors
Elastic
Errors
This field reveals the number of problem frames that the
TACHYON chip has received that caused the Frame Manager to
attach an EOFa delimiter. Frames that the TACHYON chip
discarded due to internal FIFO overflow are not included in this or
any other statistic.
Bad CRCs denotes the number of bad CRC frames that the
TACHYON chip has received.
This field indicates the number of protocol errors that the Frame
Manager has detected.
Elastic errors reveal the timing difference between the receive and
transmit clocks and usually indicate cable pulls.
TACHYON Chip Status
The number that appears in the TACHYON Status field represents the current state of
the TACHYON or Fibre Channel control chip. It consists of a two-digit hexadecimal
number, the first of which is explained in Table 2–12. The second digit is outlined in
Table 2–13. Refer to the Hewlett-Packard TACHYON user manual for a more detailed
explanation of the TACHYON chip definitions.
Table 2–12: First Digit on the TACHYON Chip
StateDefinitionState Definition
0MONITORING8INITIALIZING
1ARBITRATING9O_I INIT FINISH
2ARBITRATION WONaO_I PROTOCOL
3OPENbO_I LIP RECEIVED
4OPENEDcHOST CONTROL
5XMITTED CL0SEdLOOP FAIL
6RECEIVED CLOSEfOLD PORT
7TRANSFER
Use the REMOTE screen to check the runtime status of all remote copy sets.
Table 2–14 provides a description of the REMOTE screen column headings and
possible entries under each column.
NOTE: This feature is only supported in ACS version 8.7P.
Table 2–14: Remote Display Column Definitions— ACS Version 8.7P Only
(Sheet 1 of 3)
ColumnContents
COPY
SET
TARGETTarget connection name and target unit number
CConnection status:
INITInitiator unit number
Remote copy set name
U=connection Up (online)
D=connection Down (offline)
Table 2–14: Remote Display Column Definitions— ACS Version 8.7P Only
(Sheet 3 of 3)
ColumnContents
%CPYPercent of copy process completed
Device Port Configuration
VTDPY displays device port configuration information in a block of tabular data in
the DEFAULT and DEVICE screens only. The information is arranged in a grid with
the port numbers listed along the vertical axis and the targets on each port listed along
the horizontal axis. The word “Port” is spelled out vertically to denote the port
numbers. The screen shows the usage of each port/target combination with a code in
the array as shown below. Field information is explained Table 2–15.
VTDPY displays information on policy processor threads using a block of tabular data
in the DEFAULT and STATUS screens only. Thread data is located on the left side of
both screens (see Figure 2–2 and Figure 2–3) and contains fields described in
Table 2–16 and Table 2–17.
PrThread priority. The higher the number, the higher the priority.
NameThread name. For DUP Local Program threads, use the name in the
Name field to invoke the program.
Stk/MaxAllocated stack size in 512-byte pages. The Max column lists the
number of stack pages actually used.
TypThread type:
FNC=Functional thread. Those threads that are started
when the controller boots and never exits.
DUP=DUP local program threads. Those threads that are
only active when run either from a DUP connection or
through the command line interface RUN command.
NULL=a special type of thread that only executes when no
other thread is executable.
StaCurrent thread state:
Bl=The thread is blocked waiting for timer expiration,
resources, or a synchronization event.
Io=A DUP local program is blocked waiting for terminal
I/O completion.
Rn=The thread is currently executable.
CPU%Shows the percentage of execution time credited to each thread
since the last screen update. The values might not total 100% due
to rounding errors and the fact that there might not be enough room
to display all of the threads. An unexpected amount of time can be
credited to some threads because the controller firmware
architecture allows code from one thread to execute in the context
of another thread without a context switch.
Table 2–17: VTDPY Thread Descriptions (Sheet 1 of 2)
ThreadDescription
CLIA local program that provides an interface to the controller
command line interface thread.
CLIMAINCommand line interface (CLI).
CONFIGA local program that locates and adds devices to a configuration.
DILXA local program that exercises disk devices.
DIRECTA local program that returns a listing of available local programs.
DS_0A device error recovery management thread.
DS_1The thread that handles successful completion of physical device
requests.
DS_HBThe thread that manages the device and controller error indicator
lights and port reset buttons.
DUARTThe console terminal interface thread.
DUPThe DUP protocol thread.
FMTHRDThe thread that performs error log formatting and fault reporting
for the controller.
FOCThe thread that manages communication between the controllers
in a dual controller configuration.
HP_MAINHost port work queue handler. Handles all work from the host port
such as new I/O and completion of I/O.
MDATAThe thread that processes metadata for nontransportable disks.
NULLThe process that is scheduled when no other process can be run.
NVFOCThe thread that initiates state change requests for the other
controller in a dual controller configuration.
REMOTEThe thread that manages state changes initiated by the other
controller in a dual controller configuration.
RMGRThe thread that manages the data buffer pool.
RECONThe thread that rebuilds the parity blocks on RAID 5 storagesets
when needed and manages mirrorset copy operations when
necessary.
VAThe thread that provides logical unit services independent of the
Table 2–17: VTDPY Thread Descriptions (Sheet 2 of 2)
ThreadDescription
VTDPYA local program that provides a dynamic display of controller
configuration and performance information.
Resource Performance Statistics
VTDPY displays resource performance statistics using a block of tabular data in the
RESOURCE screen only. Resource name and statistical data is located along the left
side of the screen (see Figure 2–7). Table 2–18 defines the resource name and
statistical fields.
Table 2–18: Resource Performance Statistics Definitions (Sheet 1 of 2)
ColumnContents
Resource
Name
FreeCurrent resources not being used
NeedNumber of resources required for the specific transaction
WaitNumber of transactions waiting to be accomplished
BuffersNumber of cache data buffers available for holding data
VAXDsNumber of value-added transfer descriptors that manage the
WARPsNumber of write algorithm request packets that manage data for
RMDsNumber of RAID member data descriptors that manage data for
XBUFsNumber of XOR buffers used by the FX chip for XOR operations
ZBUFsNumber of zeroed XBUFs used by the FX chip for XOR
Disk
Read
DWDs
Disk
Write
DWDs
Name of the physical resource
actual device I/O operations within the controller
RAID level 5 writes
RAID level 5 writes
operations
Number of device work descriptors that process work requests for
disk reads
Number of device work descriptors that process work requests for
disk writes
and erase commands to randomly-chosen LBNs. The ratio of these commands can
be manually set, as well as the percentage of read and write data that is compared
throughout this test. This test takes 6 minutes.
•Data-transfer test. Tests throughput by starting at an LBN and transferring data
to the next unwritten LBN. This test takes 2 minutes.
Utilities and Exercisers
•Seek test. Stimulates head motion on the unit by issuing single-sector erase and
access commands. Each I/O uses a different track on each subsequent transfer.
The ratio of access and erase commands can be manually set. This test takes 2
minutes.
Table 2–20: Data Patterns for Phase 1: Write Test (Sheet 1 of 2)
NOTE: Use the command sequences shown in Table 2–19 to control the test.
DILX Error Codes
Table 2–21 explains the error codes that DILX might display during and after testing.
Table 2–21: DILX Error Codes
Error
CodeMessage and Explanation
1
2
3
4
Illegal Data Pattern Number found in data pattern header.
Explanation: DILX read data from the unit and discovered that the
data did not conform to the pattern that DILX had previously
written.
No write buffers correspond to data pattern.
Explanation: DILX read a legal data pattern from the unit, but
because no write buffers correspond to the pattern, the data must
be considered corrupt.
Read data does not match write buffer.
Explanation: DILX compared the read and write data and
discovered that they did not correspond.
Compare host data should have reported a compare error but did
not.
Explanation: A compare host data compare was issued in a way
that DILX expected to receive a compare error, but no error was
received.
Format and Device Code Load Utility (HSUTIL)
Use the HSUTIL utility to upgrade the firmware on disk drives in the subsystem and
to format disk drives. While formatting disk drives or installing new firmware,
HSUTIL might produce one or more of the messages shown in Table 2–22 (many of
the self-explanatory messages have been omitted from the table).
Table 2–22: HSUTIL Messages and Inquiries (Sheet 1 of 3)
MessageDescription
Insufficient resources.HSUTIL cannot find or perform the operation because internal
Table 2–22: HSUTIL Messages and Inquiries (Sheet 2 of 3)
MessageDescription
Unable to change
operation mode to
HSUTIL was unable to put the source single-disk drive unit into
maintenance mode to enable formatting or code load.
maintenance for unit.
Unit successfully
allocated.
HSUTIL has allocated the single-disk drive unit for code load
operation. At this point, the unit and the associated device are not
available for other subsystem operations.
Unable to allocate unit.HSUTIL could not allocate the single-disk drive unit. An
accompanying message explains the reason.
Unit is owned by
another sysop.
Unit is in maintenance
mode.
Exclusive access is
Device cannot be allocated because the device is being used by
another subsystem function or local program.
Device cannot be formatted or code loaded because the device is
being used by another subsystem function or local program.
Another subsystem function has reserved the unit shown.
declared for unit.
The other controller
has exclusive access
The companion controller has locked out this controller from
accessing the unit shown.
declared for unit.
The
RUNSTOP_SWITCH
The RUN\NORUN unit indicator for the unit shown is set to
NORUN; the disk cannot spin up.
is set to
RUN_DISABLED for
unit.
What BUFFER SIZE
(in BYTES) does the
drive require (2048,
4096, 8192) [8192]?
HSUTIL detects that an unsupported device has been selected as
the target device and the firmware image requires multiple SCSI
Write Buffer commands. Specify the number of bytes to be sent in
each Write Buffer command. The default buffer size is 8192 bytes.
A firmware image of 256 K, for example, can be code loaded in 32
Write Buffer commands, each transferring 8192 bytes.
What is the TOTAL
SIZE of the code
image in BYTES
HSUTIL detects that an unsupported device has been selected as
the target device. Enter the total number of bytes of data to be
sent in the code load operation.
[device default]?
Does the target device
support only the
download microcode
HSUTIL detects that an unsupported device has been selected as
the target device. Specify whether the device supports the SCSI
Write Buffer command download and save function.
and save?
Table 2–22: HSUTIL Messages and Inquiries (Sheet 3 of 3)
MessageDescription
Should the code be
downloaded with a
single write buffer
command?
HSUTIL detects that an unsupported device has been selected as
the target device. Indicate whether to download the firmware
image to the device in one or more contiguous blocks, each
corresponding to one SCSI Write Buffer command.
Configuration (CONFIG) Utility
Use the CONFIG utility to add one or more storage devices to the subsystem. This
utility checks the device ports for new disk drives, adds them to the controller
configuration, and automatically names them. Refer to the controller installation and
configuration guide for more information about using the CONFIG utility.
Code Load and Code Patch (CLCP) Utility
Use the CLCP utility to upgrade the controller software and the EMU software. Also
use CLCP to patch the controller software. To successfully install a new controller, the
correct (or current) software version and patch numbers must be available. See the
controller maintenance and service guide for more information about using this utility
during a replacement or upgrade process.
NOTE: Only StorageWorks authorized service providers are allowed to upload EMU microcode
updates. Contact the Customer Service Center (CSC) for directions to obtain the appropriate
EMU microcode and installation guide.
Clone (CLONE) Utility
Use the CLONE utility to duplicate the data on any unpartitioned single-disk unit,
stripeset, mirrorset, or striped mirrorset. Back up the cloned data while the actual
storageset remains online. When the cloning operation is done, back up the clones
rather than the storageset or single-disk unit, which can continue to service the I/O
load. When cloning a mirrorset, the CLONE utility does not need to create a
temporary mirrorset. Instead, the CLONE utility adds a temporary member to the
mirrorset and copies the data onto this new member.
The CLONE utility creates a temporary, two-member mirrorset for each member in a
single-disk unit or stripeset. Each temporary mirrorset contains one disk drive from
the unit being cloned and one disk drive onto which the CLONE utility copies the
data. During the copy operation, the unit remains online and active so the clones
contain the most up-to-date data.
After the CLONE utility copies the data from the members to the clones, the CLONE
utility restores the unit to the original configuration and creates a clone unit for backup
purposes.
Field Replacement Utility (FRUTIL)
Use FRUTIL to replace a failed controller, cache module, or ECB, in a dual-redundant
controller configuration, without shutting down the subsystem. See the controller
maintenance and service guide for a more detailed explanation of how FRUTIL is
used during the replacement process.
IMPORTANT: FRUTIL cannot run in remote copy set environments while I/O is in progress to
the target side due to host write and normalization (ACS version 8.7P only).
Change Volume Serial Number (CHVSN) Utility
The CHVSN utility generates a new volume serial number (called VSN) for the
specified device and writes the VSN on the media. The CHVSN utility is used to
eliminate duplicate volume serial numbers and to rename duplicates with different
volume serial numbers.
NOTE: Only StorageWorks authorized service providers can use this utility.
This chapter describes the event codes that the fault management software provides
for spontaneous events and last failure events.
The HSG80 controller uses various codes to report different types of events, and these
codes are presented in template displays.
•Instance codes are unique codes that identify events, additional sense codes (ASC)
•Additional sense code qualifier (ASCQ) codes explain the cause of the events
•Last failure codes describe unrecoverable conditions that might occur with the
controller.
NOTE: The error log messages in this chapter are used for all StorageWorks controller devices;
therefore, some of the events reported in this chapter might not be applicable to the HSG80
controller.
Passthrough Device Reset Event Sense Data
Response
3
Events reported by passthrough devices during host/device operations are conveyed
directly to the host system without intervention or interpretation by the HSG80
controller, with the exception of device sense data that is truncated to 160 bytes when
it exceeds 160 bytes.
Events that are related to passthrough device recognition, initialization, and SCSI bus
communication events, result in a reset of a passthrough device by the HSG80
controller. These events are reported using standard SCSI Sense Data (see Table 3–1).
For all other events, refer to the templates contained within this section.
•ASC and ASCQ codes (byte offsets 12 and 13) are detailed in Chapter 4.
•Instance codes (byte offsets 8–11) are detailed in Chapter 5.
Table 3–1: Passthrough Device Reset Event Sense Data Response Format
↓
bit
offset
0ValidError Code
1Segment
2FMEOMILIReserv
3–6Information
7Additional Sense Length
8–11Instance Code
12Additional Sense Code (ASC)
13Additional Sense Code Qualifier (ASCQ)
14 Field Replaceable Unit Code
15SKSVSense Key Specific
16Sense Key Specific
17Sense Key Specific
→76543210
Sense Key
ed
Last Failure Event Sense Data Response (Template 01)
Unrecoverable conditions detected by either software or hardware, and certain
operator-initiated conditions, terminate controller operation. In most cases, following
such a termination, the controller attempts to restart with hardware components and
software data structures initialized to the states necessary to perform normal
operations (see Table 3–2). Following a successful restart, the condition that caused
controller operation to terminate is signaled to all host systems on all logical units.
NOTE: For ACS version 8.7P configurations, last failure events generated by the target will not
be signaled to any host unless the host has a direct connection to the target—which is not
through the initiator. In addition, these events might not appear on the initiator.
•ASC and ASCQ codes (byte offsets 12 and 13) are detailed in Chapter 4.
•Instance codes (byte offsets 32–35) are detailed in Chapter 5.
•Last failure codes (byte offsets 104–107) are detailed in Chapter 6.
Multiple-Bus Failover Event Sense Data Response
(Template 04)
The controller SCSI Host Interconnect Services software component reports
Multiple-Bus Failover events via the Multiple-Bus Failover Event Sense Data
Response (see Table 3–3). The error or condition is signaled to all host systems on all
logical units.
•ASC and ASCQ codes (byte offsets 12 and 13) are detailed in Chapter 4.
•Instance codes (byte offsets 32–35) are detailed in Chapter 5.
Table 3–3: Template 04—Multiple-Bus Failover Event Sense Data Response
Format (Sheet 1 of 2)
↓
bit
offset
0UnusedError Code
1Unused
2UnusedSense Key
3–6Unused
7Additional Sense Length
8–11Unused
12Additional Sense Code (ASC)
13Additional Sense Code Qualifier (ASCQ)
14 Unused
15–17Unused
18–26Reserved
27Failed Controller Target Number
28–31Affected LUNs
32–35Instance Code
36Template
37Template Flags
38–53Other Controller Board Serial Number
54–69Controller Board Serial Number
70–73Controller Software Revision Level
74Reserved or Patch Version (TM2)
75Reserved
76LUN Status
The controller Failover Control software component reports errors and other
conditions encountered during redundant controller communications and failover
operation via the Failover Event Sense Data Response (see Table 3–4). The error or
condition is signaled to all host systems on all logical units.
•ASC and ASCQ codes (byte offsets 12 and 13) are detailed in Chapter 4.
•Instance codes (byte offsets 32–35) are detailed in Chapter 5.
•Last failure codes (byte offsets 104–107) are detailed in Chapter 6.
Nonvolatile Parameter Memory Component Event
Sense Data Response (Template 11)
The controller executive software component reports errors detected while accessing a
nonvolatile parameter memory component via the Nonvolatile Parameter Memory
Component Event Sense Data Response (see Table 3–5). Errors are signaled to all host
systems on all logical units.
•ASC and ASCQ codes (byte offsets 12 and 13) are detailed in Chapter 4.
•Instance codes (byte offsets 32–35) are detailed in Chapter 5.
Backup Battery Failure Event Sense Data Response
(Template 12)
The controller Value Added Services software component reports backup battery
failure conditions for the various hardware components that use a battery to maintain
state during power failures via the Backup Battery Failure Event Sense Data Response
(see Table 3–6). The failure condition is signaled to all host systems on all logical
units.
•ASC and ASCQ codes (byte offsets 12 and 13) are detailed in Chapter 4.
•Instance codes (byte offsets 32–35) are detailed in Chapter 5.
Table 3–6: Template 12—Backup Battery Failure Event Sense Data Response
Format (Sheet 1 of 2)
↓
bit
offset
0UnusedError Code
1Unused
2UnusedSense Key
3–6Unused
7Additional Sense Length
8–11Unused
12Additional Sense Code (ASC)
13Additional Sense Code Qualifier (ASCQ)
14 Unused
15–17Unused
18–31Reserved
32–35Instance Code
36Template
37Template Flags
38–53Reserved
54–69Controller Board Serial Number
70–73Controller Software Revision Level
74Reserved or Patch Version (TM2)
75Reserved
76LUN Status
77–103Reserved
Table 3–6: Template 12—Backup Battery Failure Event Sense Data Response
Format (Sheet 2 of 2)
↓
bit
offset
104–107Memory Address
108–159Reserved
→7 6543210
Subsystem Built-In Self-Test Failure Event Sense Data
Response (Template 13)
The controller Subsystem Built-In Self-Test software component reports errors
detected during test execution via the Subsystem Built-In Self-Test Failure Event
Sense Data Response (see Table 3–7). Errors are signaled to all host systems on all
logical units.
•ASC and ASCQ codes (byte offsets 12 and 13) are detailed in Chapter 4.
•Instance codes (byte offsets 32–35) are detailed in Chapter 5.
Table 3–7: Template 13—Subsystem Built-In Self Test Failure Event Sense Data
Response Format (Sheet 1 of 2)
↓
bit
offset
0UnusedError Code
1Unused
2UnusedSense Key
3–6Unused
7Additional Sense Length
8–11Unused
12Additional Sense Code (ASC)
13Additional Sense Code Qualifier (ASCQ)
14 Unused
15–17Unused
18–31Reserved
32–35Instance Code
36Template
37Template Flags
38–53Reserved