Digital Equipment StorageWorks HSZ50 Service Manual

DIGITAL StorageWorks
y
HSZ50 Array Controller HSOF Version 5.1
Service Manual
Part Number: EK-HSZ50-SV.C01
March 1997
Software Version: HSOF Version 5.1
Digital Equipment Corporation
nard, Massachusetts
March, 1997
While Digital Equipment Corporation believes the information included in this manual is correct as of the date of publication, it is subject to change without notice. DIGITAL makes no representations that the interconnection of its products in the manner described in this document will not infringe existing or future patent rights, nor do the descriptions contained in this document imply the granting of licenses to make, use, or sell equipment or software in accordance with the description. No responsibility is assumed for the use or reliability of firmware on equipment not supplied by DIGITAL or its affiliated companies. Possession, use, or copying of the software or firmware described in this documentation is authorized only pursuant to a valid written license from DIGITAL, an authorized sublicensor, or the identified licensor.
Commercial Computer Software, Computer Software Documentation and Technical Data for Commercial Items are licensed to the U.S. Government with DIGITAL’s standard commercial license and, when applicable, the rights in DFAR 252.227-7015, “Technical Data—Commercial Items.”
© Digital Equipment Corporation 1997.
Printed in U.S.A. All rights reserved.
Alpha, CI, DCL, DECconnect, DECserver, DIGITAL, DSSI, HSC, HSJ, HSD, HSZ, MSCP, OpenVMS, StorageWorks, TMSCP, VAX, VAXcluster, VAX 7000, VAX 10000, VMS, VMScluster, and the DIGITAL logo are trademarks of Digital Equipment Corporation. All other trademarks and registered trademarks are the property of their respective holders.
This equipment has been tested and found to comply with the limits for a Class A digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a commercial environment. This equipment generates, uses and can radiate radio frequency energy and, if not installed and used in accordance with the instruction manual, may cause harmful interference to radio communications. Operation of this equipment in a residential area is likely to cause harmful interference in which case the user will be required to correct the interference at his own expense. Restrictions apply to the use of the local-connection port on this series of controllers; failure to observe these restrictions may result in harmful interference. Always disconnect this port as soon as possible after completing the setup operation. Any changes or modifications made to this equipment may void the user's authority to operate the equipment.
Warning!
This is a Class A product. In a domestic environment this product may cause radio interference in which case the user may be required to take adequate measures.
Achtung!
Dieses ist ein Gerät der Funkstörgrenzwertklasse A. In Wohnbereichen können bei Betrieb dieses Gerätes Rundfunkstörungen auftreten, in welchen Fällen der Benutzer für entsprechende Gegenmaßnahmen verantwortlich ist.
Avertissement!
Cet appareil est un appareil de Classe A. Dans un environnement résidentiel cet appareil peut provoquer des brouillages radioélectriques. Dans ce cas, il peut être demandé à l’ utilisateur de prendre les mesures appropriées.
1 Troubleshooting
Introduction............................................................................................................1–2
Interpreting controller LED codes...........................................................................1–2
Troubleshooting HSZ50 controllers ........................................................................ 1–7
Troubleshooting when you cannot access host units.........................................1–7
Troubleshooting on a DIGITAL UNIX system................................................. 1–8
Using the DIGITAL UNIX file utility.....................................................1–9
OpenVMS host troubleshooting...................................................................... 1–10
Troubleshooting application errors................................................................. 1–11
Locating a device error .........................................................................1–11
Controller generated event....................................................................1–19
Locating a host bus error....................................................................... 1–26
Command Timeout (Host system timeout) .............................................. 1–26
Select timeout (SCSI protocol timeout)..................................................1–30
Identifying unit attention errors............................................................. 1–33
OpenVMS unit attention.......................................................................... 1–33
DIGITAL UNIX unit attention................................................................ 1–37
Using FMU to describe event log codes................................................................1–42
FMU Command Example ..................................................................... 1–44
Using FMU to Describe Recent Last Fail or Memory
System Failure Codes .................................................................................. 1–44
FMU Output Example........................................................................... 1–45
Testing disks (DILX)............................................................................................ 1–46
Running a quick disk test ............................................................................... 1–46
Running an initial test on all disks..................................................................1–47
Running a disk basic function test.................................................................. 1–49
Running an advanced disk test ....................................................................... 1–52
DILX error codes........................................................................................... 1–55
v
Table of Contents
vi
DILX data patterns......................................................................................... 1–56
Monitoring system performance with the VTDPY utility...................................... 1–57
How to Run VTDPY...................................................................................... 1–57
Using the VTDPY Control Keys.................................................................... 1–58
Using the VTDPY Command Line................................................................. 1–58
How to Interpret the VTDPY Display Fields.................................................. 1–60
SCSI Host port Characteristics.............................................................. 1–60
Device SCSI Status............................................................................... 1–61
Unit Status (abbreviated) ...................................................................... 1–62
Unit Status (full)................................................................................... 1–65
Device Status........................................................................................ 1–68
Device SCSI Port Performance............................................................. 1–71
Help Example................................................................................................ 1–71
2 Replacing field-replaceable units
Introduction and precautions................................................................................... 2–2
Electrostatic Discharge .................................................................................... 2–2
Handling controllers or cache modules............................................................. 2–2
Handling the program card............................................................................... 2–2
Handling controller host-port cables: ............................................................... 2–3
Required tools.................................................................................................. 2–3
Replacing dual-redundant controllers and cache modules
using C_SWAP..................................................................................................... 2–3
Preparing the subsystem.......................................................................... 2–4
Removing the controller and cache modules........................................... 2–7
Reinstalling the controller subsystem components ................................ 2–12
Restarting the subsystem....................................................................... 2–16
Replacing a controller and cache module in a single
controller configuration...................................................................................... 2–18
Removing the controller and cache modules......................................... 2–18
Reinstalling controller subsystem components...................................... 2–21
Replacing dual-redundant controllers and cache modules
using the off-line method.................................................................................... 2–24
Removing the controller and cache....................................................... 2–24
Reinstalling subsystem components...................................................... 2–25
Replacing external cache batteries (ECBs)............................................................ 2–28
Replacing ECBs using the on-line method ..................................................... 2–28
Preparing the subsystem........................................................................ 2–28
Replacing the failed ECB...................................................................... 2–29
Reinstalling the modules....................................................................... 2–30
Restarting the subsystem....................................................................... 2–32
vii
Preparing to replace the second ECB .................................................... 2–33
Replacing the second ECB.................................................................... 2–33
Reinstalling the modules....................................................................... 2–34
Restarting the subsystem.......................................................................2–36
Replacing ECBs using the off-line method..................................................... 2–37
Replacing power supplies...................................................................................... 2–39
Cold-swap...................................................................................................... 2–39
Removing the power supply..................................................................2–39
Installing the new power supply............................................................ 2–40
Asynchronous swap method........................................................................... 2–41
Replacing storage devices..................................................................................... 2–42
Asynchronous disk drive swap ....................................................................... 2–42
Disk drive replacement procedure (3.5, 5.25-inch drives)............................... 2–42
Replacing tape drives............................................................................................ 2–44
Tape drive replacement procedure..................................................................2–44
Replacing solid-state disk and CD-ROM drives .................................................... 2–45
Solid-state disk and CD-ROM drive replacement
procedure..................................................................................................... 2–45
Replacing SCSI host cables................................................................................... 2–47
Replacing the SCSI host cables...................................................................... 2–47
Replacing SCSI device port cables........................................................................ 2–49
Replacing the device port cables....................................................................2–49
3 Installing and Upgrading
Introduction............................................................................................................3–2
Upgrading Array Controller software...................................................................... 3–3
Program card upgrade (single controller configuration).................................... 3–3
Program card upgrade (dual-redundant configuration)...................................... 3–4
Upgrading controller software using the CLCP utility.............................................3–5
Invoking the CLCP utility................................................................................ 3–5
Code load methods........................................................................................... 3–5
Single controller upgrade method..................................................................... 3–6
Host port upgrade............................................................................................. 3–7
Host download script requirements ......................................................... 3–8
Preparing the software image.................................................................. 3–8
Setting up the host...................................................................................3–9
Write enable the program card in the controller ...................................... 3–9
Running the CLCP utility...................................................................... 3–10
Maintenance terminal port upgrade................................................................ 3–13
System setup......................................................................................... 3–14
Write enable the program card in the controller .................................... 3–16
viii
Running the CLCP utility ..................................................................... 3–16
The dual-redundant, sequential upgrade method ............................................ 3–19
Special considerations for the sequential code load
upgrade method.................................................................................. 3–19
Sequential upgrade procedure........................................................................ 3–21
The dual-redundant concurrent code load upgrade
method ........................................................................................................ 3–21
Considerations for the concurrent code load upgrade
method ........................................................................................................ 3–22
Concurrent code load upgrade procedure ....................................................... 3–24
Patching controller software ................................................................................. 3–25
Code patch considerations.............................................................................. 3–26
Listing patches............................................................................................... 3–26
Installing a patch............................................................................................ 3–28
Code patch messages ..................................................................................... 3–30
Formatting disk drives.......................................................................................... 3–32
Considerations for formatting disk drives....................................................... 3–33
Installing new firmware on a device ..................................................................... 3–35
Considerations for installing new device firmware......................................... 3–36
HSUTIL abort codes...................................................................................... 3–37
HSUTIL messages ......................................................................................... 3–37
Installing a controller and cache module in a single controller
configuration...................................................................................................... 3–41
Installing a second controller and cache module ................................................... 3–45
Installing a write-back cache module.................................................................... 3–48
Removing the controller ....................................................................... 3–48
Installing the write-back cache module................................................. 3–49
Adding Cache Memory......................................................................................... 3–50
Installing SIMM Cards................................................................................... 3–50
Installing power supplies ...................................................................................... 3–53
Power supply and shelf LED status indicators................................................ 3–53
Power supply installation procedure............................................................... 3–56
Installing storage building blocks.......................................................................... 3–57
SBB activity and fault indicators.................................................................... 3–58
Installing SBBs (except solid state disk and CD-ROM).................................. 3–60
Installing a solid state disk or CD-ROM......................................................... 3–60
4 Moving storagesets and devices
Precautions for retaining data.................................................................................. 4–2
Moving storagesets ................................................................................................. 4–3
Moving storageset members.................................................................................... 4–6
Moving a single disk-drive unit............................................................................... 4–8
Moving a tape drive, CD-ROM drive, or tape loader............................................... 4–9
5 Removing
Removing a patch...................................................................................................5–2
Removing a controller and cache module................................................................ 5–5
Removing storage devices....................................................................................... 5–6
Removing disk drives....................................................................................... 5–6
Removing solid state disks and CD-ROM drives.............................................. 5–7
Removing tape drives....................................................................................... 5–8
Appendix A
Instance, codes and definitions................................................................................A-2
Last fail codes...................................................................................................... A-42
Repair action codes.............................................................................................. A-91
ix
Glossary
Index
Figures
Figure 2–1 Connecting a maintenance terminal.....................................................2–4
Figure 2–2 Disconnecting the trilink connector..................................................... 2–6
Figure 2–3 Removing the program card................................................................ 2–8
Figure 2–4 Disconnecting the battery cable and disabling
the ECB................................................................................................................ 2–9
Figure 2–5 Removing controllers and cache modules.......................................... 2–10
Figure 2–6 Installing controllers and cache modules........................................... 2–13
Figure 2–7 Disabling the ECB ............................................................................ 2–19
Figure 2–8 Installing the program card................................................................ 2–21
x
Figure 2–9 Removing the power supply............................................................. 2–38
Figure 2–10 Power supply fault indicators.......................................................... 2–39
Figure 2–11 Removing a disk drive .................................................................... 2–41
Figure 2–12 Default indicators for 3.5- and 5.25-inch SBBs ............................... 2–42
Figure 2–13 OCP LED patterns .......................................................................... 2–43
Figure 2–14 Removing the CD-ROM drive......................................................... 2–44
Figure 2–15 Disconnecting the SCSI host cable.................................................. 2–46
Figure 2–16 Removing the volume shield........................................................... 2–48
Figure 2–17 Access to the SCSI cables............................................................... 2–49
Figure 3–1 Single controller code load method..................................................... 3–7
Figure 3–2 Host port code load operation.............................................................. 3–8
Figure 3–3 Write enable the program card.......................................................... 3–10
Figure 3–4 Terminal port code load operation..................................................... 3–13
Figure 3–5 Binary transfer protocol selection...................................................... 3–15
Figure 3–6 The sequential upgrade method......................................................... 3–22
Figure 3–7 The concurrent upgrade method........................................................ 3–25
Figure 3–8 Installing new firmware on a disk or tape drive................................. 3–37
Figure 3–9 Installing an SBB battery module...................................................... 3–44
Figure 3–10 Installing controller power supplies................................................. 3–44
Figure 3–11 Installing a single controller (SW800 cabinet)................................. 3–45
Figure 3–12 Cache configurations for cache Version 3 ....................................... 3–53
Figure 3–13 Installing a power supply ................................................................ 3–58
Figure 3–14 Typical 3.5-inch and 5.25-inch disk drive SBBs.............................. 3–59
Figure 3–15 Typical 5.25-inch CD-ROM SBB.................................................... 3–60
Figure 3–16 Typical 3.5-inch tape drive SBB ..................................................... 3–60
Figure 4–1 Moving a storageset from one subsystem to
another................................................................................................................. 4–3
Figure 4–2 Moving storageset members................................................................ 4–6
Figure 5–1 Removing a 3.5-inch disk drive........................................................... 5–7
Figure 5–2 OCP LED patterns .............................................................................. 5–8
Tables
Table 1–1 Solid controller LED codes.................................................................. 1–3
Table 1–2 Flashing controller LED codes............................................................. 1–4
Table 1–3 DILX data patterns............................................................................. 1–57
Table 1–4 VTDPY control keys.......................................................................... 1–59
Table 1–5 VTDPY commands............................................................................ 1–60
Table 2–1 Required tools...................................................................................... 2–3
Table 2–2 ECB status indicators ......................................................................... 2–16
Table 2–3 ECB status indicators......................................................................... 2–26
Table 2–4 ECB status indicators ......................................................................... 2–36
Table 3–1 Abort codes........................................................................................ 3–39
Table 3–2 SCSI ID Slots..................................................................................... 3–43
Table 3–3 ECB status indicators ......................................................................... 3–46
Table 3–4 Adding cache memory capacity.......................................................... 3–53
Table 3–5 Power supply status indicators -- SW300 cabinet................................ 3–55
Table 3–6 Shelf and single power supply status indicators --
SW500, SW800 cabinets .................................................................................... 3–56
Table 3–7 Shelf and dual power supply status indicators --
SW500, SW800 cabinets .................................................................................... 3–57
Table 3–8 Storage SBB Status Indicators............................................................ 3–62
Table A–1 Instance, codes ....................................................................................A-2
Table A–2 Executive services last failure codes..................................................A-42
Table A–3 Value-added services last failure codes.............................................. A-46
Table A–4 Device services last failure codes ...................................................... A-56
Table A–5 Fault manager last failure codes.........................................................A-64
Table A–6 Common library last failure codes..................................................... A-67
Table A–7 DUART services last failure codes.................................................... A-67
Table A–8 Failover control last failure codes......................................................A-68
Table A–9 Nonvolatile parameter memory failover control
last failure codes................................................................................................. A-69
Table A–10 Facility lock manager last failure codes........................................... A-71
Table A–11 Integrated logging facility last failure codes .................................... A-72
Table A–12 CLI last failure codes.......................................................................A-72
Table A–13 Host interconnect services last failure codes.................................... A-74
Table A–14 SCSI host interconnect services last failure
codes..................................................................................................................A-76
Table A–15 Host interconnect port services last failure
codes..................................................................................................................A-77
Table A–16 Disk and tape MSCP server last failure codes.................................. A-80
Table A–17 Diagnostics and utilities protocol server last
failure codes.......................................................................................................A-84
Table A–18 System communication services directory last
failure code......................................................................................................... A-85
Table A–19 SCSI host value-added services last failure
codes..................................................................................................................A-85
Table A–20 Disk inline exerciser (DILX) last failure codes................................A-86
Table A–21 Tape inline exerciser (TILX) last failure codes................................ A-87
Table A–22 Device configuration utilities
(CONFIG/CFMENU) last failure codes ..............................................................A-89
Table A–23 Clone unit utility (CLONE) last failure codes..................................A-89
xi
xii
Table A–24 Format and device code load utility (HSUTIL)
last failure codes................................................................................................. A-89
Table A–25 Code load/code patch utility (CLCP) last
failure codes....................................................................................................... A-90
Table A–26 Induce controller crash utility (CRASH) last
failure codes....................................................................................................... A-90
Table A–27 Repair action codes ......................................................................... A-91
Related documents
y
The following table lists documents that contain information related to this product.
Document title Part number
DECevent Installation Guide AA–Q73JA–TE StorageWorks BA350–MA Controller Shelf User's
Guide StorageWorks Configuration Manager for DEC
OSF/1 Installation Guide StorageWorks Configuration Manager for DEC
OSF/1 System Manager's Guide for HSZterm StorageWorks Solutions Configuration Guide EK–BA350–CG StorageWorks Solutions Shelf and SBB User's
Guide StorageWorks Solutions SW300-Series RAID
Enclosure Installation and User's Guide StorageWorks SW500-Series Cabinet Installation
and User's Guide StorageWorks SW800-Series Data Center Cabinet
Installation and User's Guide The RAIDBOOK—A Source for RAID
Technology Polycenter Console Manager User's Guide Computer Associates VAXcluster Systems Guidelines for VAXcluster
System Configurations 16-Bit SBB User’s Guide EK-SBB16-UG 7-Bit SBB Shelf (BA356 Series) User’s Guide EK-BA356-UG SBB User’s Guide EK-SBB35-UG
xiii
EK–350MA–UG
AA–QC38A–TE
AA–QC39A–TE
EK–BA350–UG
EK–SW300–UG
EK–SW500–UG
EK–SW800–UG
RAID Advisor Board
EK–VAXCS–CG
Troubleshooting
1
Interpreting controller LED codes
Troubleshooting controllers
Using FMU to describe event log codes
Testing disk drives
Monitoring subsystem performance
HSZ50 Array Controller Service Manual
1–2 Troubleshooting
Introduction
This chapter is designed to help you quickly isolate the source of any problems you might encounter when you service the StorageWorks HSZ50 controllers, and take the necessary steps to correct the problems.
Interpreting controller LED codes
This section provides information on how to interpret controller LED codes. The operator control panel (OCP) on each HSZ controller contains a green reset LED and six device bus LEDs. These LEDs light in patterns to display codes when there is a problem with a device configuration, a device, or a controller.
During normal operation, the green reset LED on each
controller flashes once per second, and the device bus LEDs are not lit.
The amber LED for a device bus lights continuously when the
installed devices do not match the controller configuration, or when a device fault occurs.
The green reset LED lights continuously and the amber LEDs
display a code when a controller problem occurs. Solid LED codes indicate a fault detected by internal diagnostic and initialization routines. Flashing LED codes indicate a fault that occurred during core diagnostics.
Look up the LED code that is showing on your controller in Table 1–1 or Table 1–2 to determine its meaning and find the corrective action. The symbols used in the tables have the following meanings:
O
LED on
P
LED off
M
LED flashing
Service Manual HSZ50 Array Controller
Troubleshooting 1–3
Table 1–1 Solid controller LED codes
Code Description of Error Corrective Action
O O O O O O O
O O O O O O P
O O O O O P O
O O O O O P P
O O O O P O O
O O O O P O P
O O O O P P O
O O O O P P P O O O P O O O
O O O P O O P
O O O P O P O
O O O P P O O
O O O P P O P O O O P P P O
O O O P P P P
O P P P P P P
DAEMON hard error Replace controller
module.
Repeated firmware bugcheck Replace controller
module.
NVMEM version mismatch Replace program card
with later version of firmware.
NVMEM write error Replace controller
module.
NVMEM read error Replace controller
module.
NMI error within firmware bugcheck
Inconsistent NVMEM structures repaired
Bugcheck with no restart Reset the controller. Firmware induced restart
following bugcheck failed to occur
Hardware induced restart following bugcheck failed to occur
Bugcheck within bugcheck controller
NVMEM version is too low Verify the card is the
Program card write fail Replace the card. ILF, INIT unable to allocate
memory Bugcheck before subsystem
initialization completed No program card seen Try the card in another
Reset the controller.
Reset the controller.
Replace controller module.
Replace controller module.
Reset controller module.
latest revision. If the problem still exists, replace the module.
Reset the controller.
Reset the controller.
module. If the problem follows the card, replace the card. Otherwise, replace the controller.
HSZ50 Array Controller Service Manual
1–4 Troubleshooting
Table 1–2 Flashing controller LED codes
Code Description of Error Corrective Action
O P P P P P M O P P P M P P
O P P P M P M
O P P P M M P
O P P P M M M
O P P M P P P
O P P M P P M
O P P M P M P
O P P M P M M
O P P M M P P
O P P M M P M
O P P M M M P
O P P M M M M
O P M P P P P
O P M P P P M
O P M P P M P
Program card EDC error Replace program card. Timer zero in the timer chip
will run when disabled Timer zero in the timer chip
decrements incorrectly Timer zero in the timer chip
did not interrupt the processor when requested
Timer one in the timer chip decrements incorrectly
Timer one in the timer chip did not interrupt the processor when requested
Timer two in the timer chip decrements incorrectly
Timer two in the timer chip did not interrupt the processor when requested
Memory failure in the I/D cache
No hit or miss to the I/D cache when expected
One or more bits in the diagnostic registers did not match the expected reset value
Memory error in the nonvolatile journal SRAM
Wrong image seen on program card
At least one register in the controller DRAB does not read as written
Main memory is fragmented into too many sections for the number of entries in the good memory list
The controller DRAB or DRAC chip does not arbitrate correctly
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace program card.
Replace controller module.
Replace controller module.
Replace controller module.
Service Manual HSZ50 Array Controller
Troubleshooting 1–5
Code Description of Error Corrective Action
O P M P P M M
O P M P M P P
O P M P M P M
O P M P M M P
O P M P M M M
O P M M P P P
O P M M P P M
O P M M P M P
O P M M P M M
O P M M M P P
The controller DRAB or DRAC chip failed to detect forced parity, or detected parity when not forced
The controller DRAB or DRAC chip failed to verify the EDC correctly
The controller DRAB or DRAC chip failed to report forced ECC
The controller DRAB or DRAC chip failed some operation in the reporting, validating, and testing of the multibit ECC memory error
The controller DRAB or DRAC chip failed some operation in the reporting, validating, and testing of the multiple single-bit ECC memory error
The controller main memory did not write correctly in one or more sized memory transfers
The controller did not cause an I-to-N bus timeout when accessing a “reset” host port chip
The controller DRAB or DRAC chip did not report an I-to-N bus timeout when accessing a “reset” host port chip.
The controller DRAB or DRAC chip did not interrupt the controller processor when expected
The controller DRAB or DRAC chip did not report an NXM error when nonexistent memory was accessed
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
HSZ50 Array Controller Service Manual
1–6 Troubleshooting
Code Description of Error Corrective Action
O P M M M P M
O P M M M M P
O P M M M M M
O M P P P P P
O M P P P P M
O M P P P M M
O M P P M P P
O M P P M P M
O M P P M M P
O M M P P P P
The controller DRAB or DRAC chip did not report an address parity error when one was forced
There was an unexpected nonmaskable interrupt from the controller DRAB or DRAC chip during the DRAB memory test
Diagnostic register indicates there is no cache module, but an interrupt exists from the non-existent cache module
The required amount of memory available for the code image to be loaded from the program card is insufficient
The required amount of memory available in the pool area is insufficient for the controller to run
The required amount of memory available in the buffer area is insufficient for the controller to run
The code image was not the same as the image on the card after the contents were copied to memory
Diagnostic register indicates that the cache module does not exist, but access to that cache module caused an error
Diagnostic register indicates that the cache module does not exist, but access to that cache module did not cause an error
The journal SRAM battery is bad
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller shelf backplane.
Replace controller shelf backplane.
Replace controller module.
Service Manual HSZ50 Array Controller
Troubleshooting 1–7
Code Description of Error Corrective Action
O M M M P M P
O M M M P M M
O M M M M P P
O M M M M P M
O M M M M M P
O M M M M M M
There was an unexpected interrupt from a read cache or the present and lock bits are not working correctly
There is an interrupt pending on the controller’s policy processor when there should be none
There was an unexpected fault during initialization
There was an unexpected maskable interrupt received during initialization
There was an unexpected nonmaskable interrupt received during initialization
An illegal process was activated during initialization
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Replace controller module.
Troubleshooting HSZ50 controllers
This section covers the following topics:
Troubleshooting when you cannot access HSZ units.
Troubleshooting on DIGITAL UNIX
VMS host troubleshooting
Troubleshooting application errors
Troubleshooting when you cannot access host units
If the error that occurred prevents you from accessing units for the host, determine if any HSZ units can be accessed. If no HSZ units can be accessed, run the VTDPY display and ensure that the host established communications with all HSZ target IDs. Refer to the section later in this chapter on “Monitoring system performance with the VTDPY utility” for more information about running VTDPY. If the host has not established communications, one of the following might be true:
The host adapter is bad.
The host SCSI bus is bad or misconfigured.
HSZ50 Array Controller Service Manual
1–8 Troubleshooting
The HSZ controller is bad. To find more information about this error, use the following
procedure from the HSZ console. (If this is a dual controller configuration, the command must be executed on both controllers.)
1. To determine if the unit is on-line to a controller:
HSZ50> SHOW UNITS FULL
2. Check the following: – Is the unit on-line or available to this or the other
controller?
From the HSZ controller to which the unit is on-line, does
the SHOW UNITS command also show the size in blocks?
3. If the answer to both of the questions in step 2 is no, there is a
problem with the HSZ controller. Look for any type of errors in the SHOW UNITS output, such as Lost Data or Media Format.
4. Run the VTDPY display.
5. Look at the unit status in the VTDPY display. Use the
information in a later section in this chapter, “Monitoring System Performance with the VTDPY Utility” to interpret the VTDPY display.
6. If the unit is not on-line or if errors are present in the SHOW
UNITS display, take appropriate action to clear the errors or rebuild the unit.
Be careful with user’s data. If this is a RAIDset, try to save the
user’s data. Do not initialize the storage unit unless there is no other alternative.
If you determine that units are on-line and everything seems to be in order on the HSZ side, proceed to check the host side using the file utility procedure.
Troubleshooting on a DIGITAL UNIX system
To troubleshoot on a DIGITAL UNIX system, use the file utility to access the device. The error message from the file utility might explain where the problem lies.
Service Manual HSZ50 Array Controller
Troubleshooting 1–9
Using the DIGITAL UNIX file utility
You can use the DIGITAL UNIX file utility to determine if an HSZ unit can be accessed from the DIGITAL UNIX host system. In the following procedure, an HSZ controller has a unit named D101, which will be used by the file utility.
1. Enter the following command from the HSZ CLI:
HSZ50>SHOW D101
2. Disable the writeback_cache and read_cache on this unit, if they are both enabled, using the following command:
HSZ50>SET D101 nowriteback_cache HSZ50>SET D101 noread_cache
or disable just the read_cache if it is enabled on the unit with the
following command:.
HSZ50>SET D101 noread_cache
Disabling the read_cache causes information to be accessed
from the unit rather than from the cache, if the information is in cache. This gives a visual indication that the unit is being accessed.
3. From the DIGITAL UNIX console, issue the file command to start the file utility. (Assume that the character special file has been created for rrzb17a.)
/usr/bin/file /dev/rrzb17a
The device activity indicator on the device, the green light,
should light up. If the unit is a multidevice storage unit only one of the devices that is part of that storage unit lights.
The host system should display the following output after the
file command is issued (the output displays on one line):
/dev/rrzb17a character special (8/mmmm) SCSI # n HSZ50 disk #xxx (SCSI ID #t)
The output values have the following meanings:
8 - major number
mmmm - minor number
n - SCSI host side bus number
HSZ50 Array Controller Service Manual
1–10 Troubleshooting
p
y
g
p
g
t - target ID as used in the HSZ50 unit DTZL where the
“T”. In the DTZL HSZ50 unit matches the “t” from the file command.
xxx - the disk number
4. If an error occurs, use the information in the following table to evaluate errors or output:
Error or Output Meaning and action
file: Cannot get file status on /dev/mmmm
/dev/mmmm: Cannot open for reading
Only the major and minor number is returned from the file command
Indicates the s /dev director mmmm does not exist.
The device is not answerin the device s have the correct minor number. Check the minor number to be sure that it matches the correct SCSI host side bus number and the correct HSZ50 Tar LUN from the HSZ50 unit designator.
ecial file in the
that matches
or
ecial file does not
et ID and
5. If the unit had write-back cache enabled, remember to enable the cache again using the following HSZ CLI command that enables both the write-back and read cache:
HSZ50> SET D101 WRITEBACK_CACHE
6. If the unit had only the read cache enabled, enable the read cache with this HSZ50 CLI command:
HSZ50> SET D101 READ_CACHE
7. Run VTDPY to ensure the host established communication with all HSZ target IDs.
OpenVMS host troubleshooting
If you cannot access the host on an OpenVMS system, use the following procedure to troubleshoot:
1. On the VMS system, enter the following command.
$ SHOW DEVICE DK*
Device names will display in the following format: DKA101
Service Manual HSZ50 Array Controller
Troubleshooting 1–11
The A in the device name represents a SCSI controller
designation and the 101 represents a unit number on an HSZ or other SCSI controller. If there was an HSZ unit named D101 on the HSZ whose letter designation was A, that would be the VMS device DKA101.
If there are multiple SCSI controllers, there would be a different
controller letter designation, for example DKA, DKB, and so forth.
The SHOW DEVICE FULL command also would give the
controller type. If the device was configured on an HSZ controller, HSZ would appear in the device information.
2. The SHOW DEVICE DK* command should display the HSZ
unit. If the unit is not displayed, follow the procedures in the previous section to determine if the unit is on-line.
3. If the unit is on-line to the HSZ, run the SYSMAN utility on the
VMS system to ensure the device is configured.
$ MC SYSMAN SYSMAN> IO AUTOCONFIGURE SYSMAN> EXIT
4. If you still cannot see the unit, check the error logs for SCSI
errors. The problem could be due to a bad host adapter, SCSI bus problem, or the HSZ.
5. Use the VTDPY display to ensure the host adapter established
connectivity to all HSZ target IDs. The host port portion of the VTDPY display should show all HSZ target IDs, and the rate should be 10MZ.
Troubleshooting application errors
Application errors can be categorized into three different types: device errors, controller errors, and host adapter errors. For each of these error types, you should check the log entries for key pieces of information. The important information for each error example is described in the following sections.
Locating a device error
This section contains an example of a DECevent error log for a device event or error. You should be able to locate the following important details in the DECevent error log when a device event
HSZ50 Array Controller Service Manual
1–12 Troubleshooting
occurs. Note that if the controller ASC and ASCQ are zero, the device generated the error. Also note the Generic String message, BBR disabled bad block number: 230262. This message is always generated and is a generic message for a device software error. Check the device ASC and ASCQ.
The following important information is highlighted in the example:
Unit Information, Port-Target-LUN
Generic String message. This message is always generated and
is a generic message for a device software error. You should check the ASC and ASCQ.
CAM Status
SCSI Status
Command Information
Most Recent ASC and ASCQ
Device Information, Port-Target-LUN
Controller ASC and ASCQ
LBN
Device ASC and ASCQ
The “-i ios” qualifier used in the following command indicates that I/O subsystem log entries should be included: these entries include CAM events. The complete command syntax is:
#dia -i ios -t s:03-oct-1995, 10:47 e:03-oct-1995, 10:48
DECevent Log Example - Locating a Device Error
*************************ENTRY 4**************************
Logging OS 2. DIGITAL UNIX System Architecture 2. Alpha Event sequence number 1. Timestamp of occurrence 03-OCT-1995 10:47:59 Host name testsys
Service Manual HSZ50 Array Controller
Troubleshooting 1–13
System type register x00000004 DEC 3000 Number of CPUs (mpnum) x00000001 CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid Event severity 5. Low Priority Entry type 199. CAM SCSI Event Type
------- Unit Info ------­Bus Number 2.
Unit Number x0090 Target = 2. LUN = 0.
------- CAM Data ------­Class x00 Disk Subsystem x00 Disk Number of Packets 10.
------ Packet Type ------ 258. Module Name String
Routine Name cdisk_bbr_done
------ Packet Type ------ 256. Generic String
cdisk_bbr: BBR disabled bad
block number:
230262
HSZ50 Array Controller Service Manual
1–14 Troubleshooting
------ Packet Type ------ 261. Soft Error String
Error Type Soft Error Detected
(recovered)
------ Packet Type ------ 257. Device Name String
Device Name DEC HSZ4
------ Packet Type ------ 256. Generic String
Active CCB at time of error
------ Packet Type ------ 256. Generic String
CCB request completed with
an error
------ Packet Type ------ 1. SCSI I/O Request
Packet Revision 37.
CCB Address xFFFFFC0007F9BB28 CCB Length x00C0 XPT Function Code x01 Execute requested SCSI I/O
Cam Status x84 CCB Request Completed WITH
Autosense Data Valid for
Path ID 2. Target ID 2. Target LUN 0.
Service Manual HSZ50 Array Controller
CCB(CCB_SCSIIO)
Error
Target
Troubleshooting 1–15
Cam Flags x00000482 SIM Queue Actions are
Enabled
Data Direction (10: DATA
OUT)
Disable the SIM Queue
Frozen State *pdrv_ptr xFFFFFC0007F9B828 *next_ccb x0000000000000000 *req_map xFFFFFC0007F8C200 void (*cam_cbfcnp)() xFFFFFC00004AC8A0 *data_ptr x000000014000A1A0 Data Transfer Length 8192. *sense_ptr xFFFFFC0007F9B850 Auotsense Byte Length 160. CDB Length 6. Scatter/Gather Entry Cnt 0.
SCSI Status x02 Check Condition
Autosense Residue Length x00 Transfer Residue Length x00000000 (CDB) Command & Data Buf
15--<-12 11--<-08 07--<-04 03--<-00 :Byte Order
0000: 00000000 00000010 7083030A *...p.......*
Timeout Value x0000003C *msg_ptr x0000000000000000 Message Length 0. Vendor Unique Flags x4000 Tag Queue Actions x20 Tag for Simple Queue
HSZ50 Array Controller Service Manual
1–16 Troubleshooting
------ Packet Type ------ 256. Generic String
Error, exception, or
abnormal condition
------ Packet Type ------ 256. Generic String
RECOVERED ERROR - Recovery
action performed
------ Packet Type ------ 768. SCSI Sense Data Packet Revision 0.
------- HSZ Data ------­Instance, Code x0328450A The disk device reported
standard SCSI Sense Data.
Component ID = Device
Event Number = x00000028 Repair Action = x00000045 NR Threshold = x0000000A Template Type x51 Disk Transfer Error. Template Flags x01 HCE = 1, Event occurred
Ctrl Serial # ZG41800293 Ctrl Software Revision V20Z RAIDSET State x00 NORMAL. All members present
Service Manual HSZ50 Array Controller
Services.
during Host Command Execution.
and reconstructed, IF LUN is configured as a RAIDSET.
Loading...
+ 300 hidden pages