While Digital Equipment Corporation believes the information included in this manual
is correct as of the date of publication, it is subject to change without notice. DIGITAL
makes no representations that the interconnection of its products in the manner
described in this document will not infringe existing or future patent rights, nor do the
descriptions contained in this document imply the granting of licenses to make, use, or
sell equipment or software in accordance with the description. No responsibility is
assumed for the use or reliability of firmware on equipment not supplied by DIGITAL
or its affiliated companies. Possession, use, or copying of the software or firmware
described in this documentation is authorized only pursuant to a valid written license
from DIGITAL, an authorized sublicensor, or the identified licensor.
Commercial Computer Software, Computer Software Documentation and Technical
Data for Commercial Items are licensed to the U.S. Government with DIGITAL’s
standard commercial license and, when applicable, the rights in DFAR 252.227-7015,
“Technical Data—Commercial Items.”
Alpha, CI, DCL, DECconnect, DECserver, DIGITAL, DSSI, HSC, HSJ, HSD, HSZ,
MSCP, OpenVMS, StorageWorks, TMSCP, VAX, VAXcluster, VAX 7000, VAX
10000, VMS, VMScluster, and the DIGITAL logo are trademarks of Digital Equipment
Corporation. All other trademarks and registered trademarks are the property of their
respective holders.
This equipment has been tested and found to comply with the limits for a Class A
digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to
provide reasonable protection against harmful interference when the equipment is
operated in a commercial environment. This equipment generates, uses and can radiate
radio frequency energy and, if not installed and used in accordance with the instruction
manual, may cause harmful interference to radio communications. Operation of this
equipment in a residential area is likely to cause harmful interference in which case the
user will be required to correct the interference at his own expense. Restrictions apply
to the use of the local-connection port on this series of controllers; failure to observe
these restrictions may result in harmful interference. Always disconnect this port as
soon as possible after completing the setup operation. Any changes or modifications
made to this equipment may void the user's authority to operate the equipment.
Warning!
This is a Class A product. In a domestic environment this product may cause radio
interference in which case the user may be required to take adequate measures.
Achtung!
Dieses ist ein Gerät der Funkstörgrenzwertklasse A. In Wohnbereichen können bei
Betrieb dieses Gerätes Rundfunkstörungen auftreten, in welchen Fällen der Benutzer
für entsprechende Gegenmaßnahmen verantwortlich ist.
Avertissement!
Cet appareil est un appareil de Classe A. Dans un environnement résidentiel cet
appareil peut provoquer des brouillages radioélectriques. Dans ce cas, il peut être
demandé à l’ utilisateur de prendre les mesures appropriées.
This chapter is designed to help you quickly isolate the source of any
problems you might encounter when you service the StorageWorks
HSZ50 controllers, and take the necessary steps to correct the
problems.
Interpreting controller LED codes
This section provides information on how to interpret controller
LED codes. The operator control panel (OCP) on each HSZ
controller contains a green reset LED and six device bus LEDs.
These LEDs light in patterns to display codes when there is a
problem with a device configuration, a device, or a controller.
• During normal operation, the green reset LED on each
controller flashes once per second, and the device bus LEDs are
not lit.
• The amber LED for a device bus lights continuously when the
installed devices do not match the controller configuration, or
when a device fault occurs.
• The green reset LED lights continuously and the amber LEDs
display a code when a controller problem occurs. Solid LED
codes indicate a fault detected by internal diagnostic and
initialization routines. Flashing LED codes indicate a fault that
occurred during core diagnostics.
Look up the LED code that is showing on your controller in Table
1–1 or Table 1–2 to determine its meaning and find the corrective
action. The symbols used in the tables have the following meanings:
O
LED on
P
LED off
M
LED flashing
Service ManualHSZ50 Array Controller
Troubleshooting1–3
Table 1–1 Solid controller LED codes
CodeDescription of ErrorCorrective Action
OOOOOOO
OOOOOOP
OOOOOPO
OOOOOPP
OOOOPOO
OOOOPOP
OOOOPPO
OOOOPPPOOOPOOO
OOOPOOP
OOOPOPO
OOOPPOO
OOOPPOPOOOPPPO
OOOPPPP
OPPPPPP
DAEMON hard errorReplace controller
module.
Repeated firmware bugcheckReplace controller
module.
NVMEM version mismatchReplace program card
with later version of
firmware.
NVMEM write errorReplace controller
module.
NVMEM read errorReplace controller
module.
NMI error within firmware
bugcheck
Inconsistent NVMEM
structures repaired
Bugcheck with no restartReset the controller.
Firmware induced restart
following bugcheck failed to
occur
Hardware induced restart
following bugcheck failed to
occur
Bugcheck within bugcheck
controller
NVMEM version is too lowVerify the card is the
Program card write failReplace the card.
ILF, INIT unable to allocate
memory
Bugcheck before subsystem
initialization completed
No program card seenTry the card in another
Reset the controller.
Reset the controller.
Replace controller
module.
Replace controller
module.
Reset controller module.
latest revision. If the
problem still exists,
replace the module.
Reset the controller.
Reset the controller.
module. If the problem
follows the card, replace
the card. Otherwise,
replace the controller.
HSZ50 Array Controller Service Manual
1–4Troubleshooting
Table 1–2 Flashing controller LED codes
CodeDescription of ErrorCorrective Action
OPPPPPMOPPPMPP
OPPPMPM
OPPPMMP
OPPPMMM
OPPMPPP
OPPMPPM
OPPMPMP
OPPMPMM
OPPMMPP
OPPMMPM
OPPMMMP
OPPMMMM
OPMPPPP
OPMPPPM
OPMPPMP
Program card EDC errorReplace program card.
Timer zero in the timer chip
will run when disabled
Timer zero in the timer chip
decrements incorrectly
Timer zero in the timer chip
did not interrupt the processor
when requested
Timer one in the timer chip
decrements incorrectly
Timer one in the timer chip
did not interrupt the processor
when requested
Timer two in the timer chip
decrements incorrectly
Timer two in the timer chip
did not interrupt the processor
when requested
Memory failure in the I/D
cache
No hit or miss to the I/D
cache when expected
One or more bits in the
diagnostic registers did not
match the expected reset
value
Memory error in the
nonvolatile journal SRAM
Wrong image seen on
program card
At least one register in the
controller DRAB does not
read as written
Main memory is fragmented
into too many sections for the
number of entries in the good
memory list
The controller DRAB or
DRAC chip does not arbitrate
correctly
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace program card.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Service ManualHSZ50 Array Controller
Troubleshooting1–5
CodeDescription of ErrorCorrective Action
OPMPPMM
OPMPMPP
OPMPMPM
OPMPMMP
OPMPMMM
OPMMPPP
OPMMPPM
OPMMPMP
OPMMPMM
OPMMMPP
The controller DRAB or
DRAC chip failed to detect
forced parity, or detected
parity when not forced
The controller DRAB or
DRAC chip failed to verify
the EDC correctly
The controller DRAB or
DRAC chip failed to report
forced ECC
The controller DRAB or
DRAC chip failed some
operation in the reporting,
validating, and testing of the
multibit ECC memory error
The controller DRAB or
DRAC chip failed some
operation in the reporting,
validating, and testing of the
multiple single-bit ECC
memory error
The controller main memory
did not write correctly in one
or more sized memory
transfers
The controller did not cause
an I-to-N bus timeout when
accessing a “reset” host port
chip
The controller DRAB or
DRAC chip did not report an
I-to-N bus timeout when
accessing a “reset” host port
chip.
The controller DRAB or
DRAC chip did not interrupt
the controller processor when
expected
The controller DRAB or
DRAC chip did not report an
NXM error when nonexistent
memory was accessed
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
HSZ50 Array Controller Service Manual
1–6Troubleshooting
CodeDescription of ErrorCorrective Action
OPMMMPM
OPMMMMP
OPMMMMM
OMPPPPP
OMPPPPM
OMPPPMM
OMPPMPP
OMPPMPM
OMPPMMP
OMMPPPP
The controller DRAB or
DRAC chip did not report an
address parity error when one
was forced
There was an unexpected
nonmaskable interrupt from
the controller DRAB or
DRAC chip during the DRAB
memory test
Diagnostic register indicates
there is no cache module, but
an interrupt exists from the
non-existent cache module
The required amount of
memory available for the
code image to be loaded from
the program card is
insufficient
The required amount of
memory available in the pool
area is insufficient for the
controller to run
The required amount of
memory available in the
buffer area is insufficient for
the controller to run
The code image was not the
same as the image on the card
after the contents were copied
to memory
Diagnostic register indicates
that the cache module does
not exist, but access to that
cache module caused an error
Diagnostic register indicates
that the cache module does
not exist, but access to that
cache module did not cause
an error
The journal SRAM battery is
bad
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller shelf
backplane.
Replace controller shelf
backplane.
Replace controller
module.
Service ManualHSZ50 Array Controller
Troubleshooting1–7
CodeDescription of ErrorCorrective Action
OMMMPMP
OMMMPMM
OMMMMPP
OMMMMPM
OMMMMMP
OMMMMMM
There was an unexpected
interrupt from a read cache or
the present and lock bits are
not working correctly
There is an interrupt pending
on the controller’s policy
processor when there should
be none
There was an unexpected
fault during initialization
There was an unexpected
maskable interrupt received
during initialization
There was an unexpected
nonmaskable interrupt
received during initialization
An illegal process was
activated during initialization
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Replace controller
module.
Troubleshooting HSZ50 controllers
This section covers the following topics:
Troubleshooting when you cannot access HSZ units.
•
Troubleshooting on DIGITAL UNIX
•
VMS host troubleshooting
•
Troubleshooting application errors
•
Troubleshooting when you cannot access host units
If the error that occurred prevents you from accessing units for the
host, determine if any HSZ units can be accessed. If no HSZ units
can be accessed, run the VTDPY display and ensure that the host
established communications with all HSZ target IDs. Refer to the
section later in this chapter on “Monitoring system performance with
the VTDPY utility” for more information about running VTDPY. If
the host has not established communications, one of the following
might be true:
The host adapter is bad.
•
The host SCSI bus is bad or misconfigured.
•
HSZ50 Array Controller Service Manual
1–8Troubleshooting
• The HSZ controller is bad.
To find more information about this error, use the following
procedure from the HSZ console. (If this is a dual controller
configuration, the command must be executed on both controllers.)
1. To determine if the unit is on-line to a controller:
HSZ50> SHOW UNITS FULL
2. Check the following:
– Is the unit on-line or available to this or the other
controller?
– From the HSZ controller to which the unit is on-line, does
the SHOW UNITS command also show the size in blocks?
3. If the answer to both of the questions in step 2 is no, there is a
problem with the HSZ controller. Look for any type of errors in
the SHOW UNITS output, such as Lost Data or Media Format.
4. Run the VTDPY display.
5. Look at the unit status in the VTDPY display. Use the
information in a later section in this chapter, “Monitoring
System Performance with the VTDPY Utility” to interpret the
VTDPY display.
6. If the unit is not on-line or if errors are present in the SHOW
UNITS display, take appropriate action to clear the errors or
rebuild the unit.
Be careful with user’s data. If this is a RAIDset, try to save the
user’s data. Do not initialize the storage unit unless there is no
other alternative.
If you determine that units are on-line and everything seems to be in
order on the HSZ side, proceed to check the host side using the file
utility procedure.
Troubleshooting on a DIGITAL UNIX system
To troubleshoot on a DIGITAL UNIX system, use the file utility to
access the device. The error message from the file utility might
explain where the problem lies.
Service ManualHSZ50 Array Controller
Troubleshooting1–9
Using the DIGITAL UNIX file utility
You can use the DIGITAL UNIX file utility to determine if an HSZ
unit can be accessed from the DIGITAL UNIX host system. In the
following procedure, an HSZ controller has a unit named D101,
which will be used by the file utility.
1. Enter the following command from the HSZ CLI:
HSZ50>SHOW D101
2. Disable the writeback_cache and read_cache on this unit, if they
are both enabled, using the following command:
or disable just the read_cache if it is enabled on the unit with the
following command:.
HSZ50>SET D101 noread_cache
Disabling the read_cache causes information to be accessed
from the unit rather than from the cache, if the information is in
cache. This gives a visual indication that the unit is being
accessed.
3. From the DIGITAL UNIX console, issue the file command to
start the file utility. (Assume that the character special file has
been created for rrzb17a.)
/usr/bin/file /dev/rrzb17a
The device activity indicator on the device, the green light,
should light up. If the unit is a multidevice storage unit only one
of the devices that is part of that storage unit lights.
The host system should display the following output after the
file command is issued (the output displays on one line):
/dev/rrzb17a character special (8/mmmm) SCSI #
n HSZ50 disk #xxx (SCSI ID #t)
The output values have the following meanings:
– 8 - major number
• mmmm - minor number
• n - SCSI host side bus number
HSZ50 Array Controller Service Manual
1–10Troubleshooting
p
y
g
p
g
• t - target ID as used in the HSZ50 unit DTZL where the
“T”. In the DTZL HSZ50 unit matches the “t” from the
file command.
• xxx - the disk number
4. If an error occurs, use the information in the following table to
evaluate errors or output:
Error or OutputMeaning and action
file: Cannot get file status
on /dev/mmmm
/dev/mmmm: Cannot
open for reading
Only the major and minor
number is returned from
the file command
Indicates the s
/dev director
mmmm does not exist.
The device is not answerin
the device s
have the correct minor number.
Check the minor number to be
sure that it matches the correct
SCSI host side bus number and
the correct HSZ50 Tar
LUN from the HSZ50 unit
designator.
ecial file in the
that matches
or
ecial file does not
et ID and
5. If the unit had write-back cache enabled, remember to enable
the cache again using the following HSZ CLI command that
enables both the write-back and read cache:
HSZ50> SET D101 WRITEBACK_CACHE
6. If the unit had only the read cache enabled, enable the read
cache with this HSZ50 CLI command:
HSZ50> SET D101 READ_CACHE
7. Run VTDPY to ensure the host established communication with
all HSZ target IDs.
OpenVMS host troubleshooting
If you cannot access the host on an OpenVMS system, use the
following procedure to troubleshoot:
1. On the VMS system, enter the following command.
$ SHOW DEVICE DK*
Device names will display in the following format:
DKA101
Service ManualHSZ50 Array Controller
Troubleshooting1–11
The A in the device name represents a SCSI controller
designation and the 101 represents a unit number on an HSZ or
other SCSI controller. If there was an HSZ unit named D101 on
the HSZ whose letter designation was A, that would be the VMS
device DKA101.
If there are multiple SCSI controllers, there would be a different
controller letter designation, for example DKA, DKB, and so
forth.
The SHOW DEVICE FULL command also would give the
controller type. If the device was configured on an HSZ
controller, HSZ would appear in the device information.
2. The SHOW DEVICE DK* command should display the HSZ
unit. If the unit is not displayed, follow the procedures in the
previous section to determine if the unit is on-line.
3. If the unit is on-line to the HSZ, run the SYSMAN utility on the
VMS system to ensure the device is configured.
$ MC SYSMANSYSMAN> IO AUTOCONFIGURESYSMAN> EXIT
4. If you still cannot see the unit, check the error logs for SCSI
errors. The problem could be due to a bad host adapter, SCSI
bus problem, or the HSZ.
5. Use the VTDPY display to ensure the host adapter established
connectivity to all HSZ target IDs. The host port portion of the
VTDPY display should show all HSZ target IDs, and the rate
should be 10MZ.
Troubleshooting application errors
Application errors can be categorized into three different types:
device errors, controller errors, and host adapter errors. For each of
these error types, you should check the log entries for key pieces of
information. The important information for each error example is
described in the following sections.
Locating a device error
This section contains an example of a DECevent error log for a
device event or error. You should be able to locate the following
important details in the DECevent error log when a device event
HSZ50 Array Controller Service Manual
1–12Troubleshooting
occurs. Note that if the controller ASC and ASCQ are zero, the
device generated the error. Also note the Generic String message,
BBR disabled bad block number: 230262. This message is always
generated and is a generic message for a device software error.
Check the device ASC and ASCQ.
The following important information is highlighted in the example:
• Unit Information, Port-Target-LUN
• Generic String message. This message is always generated and
is a generic message for a device software error. You should
check the ASC and ASCQ.
• CAM Status
• SCSI Status
• Command Information
• Most Recent ASC and ASCQ
• Device Information, Port-Target-LUN
• Controller ASC and ASCQ
• LBN
• Device ASC and ASCQ
The “-i ios” qualifier used in the following command indicates that
I/O subsystem log entries should be included: these entries include
CAM events. The complete command syntax is: