Digital Equipment StorageWorks HSZ50 Service Manual

DIGITAL StorageWorks

HSZ50 Array Controller HSOF Version 5.1

Service Manual

Part Number: EK-HSZ50-SV.C01

March 1997

Software Version: HSOF Version 5.1

Digital Equipment Corporation

nard, Massachusetts

March, 1997

While Digital Equipment Corporation believes the information included in this manual is correct as of the date of publication, it is subject to change without notice. DIGITAL makes no representations that the interconnection of its products in the manner described in this document will not infringe existing or future patent rights, nor do the descriptions contained in this document imply the granting of licenses to make, use, or sell equipment or software in accordance with the description. No responsibility is assumed for the use or reliability of firmware on equipment not supplied by DIGITAL or its affiliated companies. Possession, use, or copying of the software or firmware described in this documentation is authorized only pursuant to a valid written license from DIGITAL, an authorized sublicensor, or the identified licensor.

Commercial Computer Software, Computer Software Documentation and Technical Data for Commercial Items are licensed to the U.S. Government with DIGITAL’s standard commercial license and, when applicable, the rights in DFAR 252.227-7015, “Technical Data—Commercial Items.”

Alpha, CI, DCL, DECconnect, DECserver, DIGITAL, DSSI, HSC, HSJ, HSD, HSZ, MSCP, OpenVMS, StorageWorks, TMSCP, VAX, VAXcluster, VAX 7000, VAX 10000, VMS, VMScluster, and the DIGITAL logo are trademarks of Digital Equipment Corporation. All other trademarks and registered trademarks are the property of their respective holders.

This equipment has been tested and found to comply with the limits for a Class A digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a commercial environment. This equipment generates, uses and can radiate radio frequency energy and, if not installed and used in accordance with the instruction manual, may cause harmful interference to radio communications. Operation of this equipment in a residential area is likely to cause harmful interference in which case the user will be required to correct the interference at his own expense. Restrictions apply to the use of the local-connection port on this series of controllers; failure to observe these restrictions may result in harmful interference. Always disconnect this port as soon as possible after completing the setup operation. Any changes or modifications made to this equipment may void the user's authority to operate the equipment.

Warning!

This is a Class A product. In a domestic environment this product may cause radio interference in which case the user may be required to take adequate measures.

Achtung!

Dieses ist ein Gerät der Funkstörgrenzwertklasse A. In Wohnbereichen können bei Betrieb dieses Gerätes Rundfunkstörungen auftreten, in welchen Fällen der Benutzer für entsprechende Gegenmaßnahmen verantwortlich ist.

Avertissement!

Cet appareil est un appareil de Classe A. Dans un environnement résidentiel cet appareil peut provoquer des brouillages radioélectriques. Dans ce cas, il peut être demandé à l’ utilisateur de prendre les mesures appropriées.

1 Troubleshooting

Introduction............................................................................................................1–2

Interpreting controller LED codes...........................................................................1–2

Troubleshooting HSZ50 controllers ........................................................................ 1–7

Troubleshooting when you cannot access host units.........................................1–7

Troubleshooting on a DIGITAL UNIX system................................................. 1–8

Using the DIGITAL UNIX file utility.....................................................1–9

OpenVMS host troubleshooting...................................................................... 1–10

Troubleshooting application errors................................................................. 1–11

Locating a device error .........................................................................1–11

Controller generated event....................................................................1–19

Locating a host bus error....................................................................... 1–26

Command Timeout (Host system timeout) .............................................. 1–26

Select timeout (SCSI protocol timeout)..................................................1–30

Identifying unit attention errors............................................................. 1–33

OpenVMS unit attention.......................................................................... 1–33

DIGITAL UNIX unit attention................................................................ 1–37

Using FMU to describe event log codes................................................................1–42

FMU Command Example ..................................................................... 1–44

Using FMU to Describe Recent Last Fail or Memory

System Failure Codes .................................................................................. 1–44

FMU Output Example........................................................................... 1–45

Testing disks (DILX)............................................................................................ 1–46

Running a quick disk test ............................................................................... 1–46

Running an initial test on all disks..................................................................1–47

Running a disk basic function test.................................................................. 1–49

Running an advanced disk test ....................................................................... 1–52

DILX error codes........................................................................................... 1–55

Table of Contents

DILX data patterns......................................................................................... 1–56

Monitoring system performance with the VTDPY utility...................................... 1–57

How to Run VTDPY...................................................................................... 1–57

Using the VTDPY Control Keys.................................................................... 1–58

Using the VTDPY Command Line................................................................. 1–58

How to Interpret the VTDPY Display Fields.................................................. 1–60

SCSI Host port Characteristics.............................................................. 1–60

Device SCSI Status............................................................................... 1–61

Unit Status (abbreviated) ...................................................................... 1–62

Unit Status (full)................................................................................... 1–65

Device Status........................................................................................ 1–68

Device SCSI Port Performance............................................................. 1–71

Help Example................................................................................................ 1–71

2 Replacing field-replaceable units

Introduction and precautions................................................................................... 2–2

Electrostatic Discharge .................................................................................... 2–2

Handling controllers or cache modules............................................................. 2–2

Handling the program card............................................................................... 2–2

Handling controller host-port cables: ............................................................... 2–3

Required tools.................................................................................................. 2–3

Replacing dual-redundant controllers and cache modules

using C_SWAP..................................................................................................... 2–3

Preparing the subsystem.......................................................................... 2–4

Removing the controller and cache modules........................................... 2–7

Reinstalling the controller subsystem components ................................ 2–12

Restarting the subsystem....................................................................... 2–16

Replacing a controller and cache module in a single

controller configuration...................................................................................... 2–18

Removing the controller and cache modules......................................... 2–18

Reinstalling controller subsystem components...................................... 2–21

Replacing dual-redundant controllers and cache modules

using the off-line method.................................................................................... 2–24

Removing the controller and cache....................................................... 2–24

Reinstalling subsystem components...................................................... 2–25

Replacing external cache batteries (ECBs)............................................................ 2–28

Replacing ECBs using the on-line method ..................................................... 2–28

Preparing the subsystem........................................................................ 2–28

Replacing the failed ECB...................................................................... 2–29

Reinstalling the modules....................................................................... 2–30

Restarting the subsystem....................................................................... 2–32

vii

Preparing to replace the second ECB .................................................... 2–33

Replacing the second ECB.................................................................... 2–33

Reinstalling the modules....................................................................... 2–34

Restarting the subsystem.......................................................................2–36

Replacing ECBs using the off-line method..................................................... 2–37

Replacing power supplies...................................................................................... 2–39

Cold-swap...................................................................................................... 2–39

Removing the power supply..................................................................2–39

Installing the new power supply............................................................ 2–40

Asynchronous swap method........................................................................... 2–41

Replacing storage devices..................................................................................... 2–42

Asynchronous disk drive swap ....................................................................... 2–42

Disk drive replacement procedure (3.5, 5.25-inch drives)............................... 2–42

Replacing tape drives............................................................................................ 2–44

Tape drive replacement procedure..................................................................2–44

Replacing solid-state disk and CD-ROM drives .................................................... 2–45

Solid-state disk and CD-ROM drive replacement

procedure..................................................................................................... 2–45

Replacing SCSI host cables................................................................................... 2–47

Replacing the SCSI host cables...................................................................... 2–47

Replacing SCSI device port cables........................................................................ 2–49

Replacing the device port cables....................................................................2–49

3 Installing and Upgrading

Introduction............................................................................................................3–2

Upgrading Array Controller software...................................................................... 3–3

Program card upgrade (single controller configuration).................................... 3–3

Program card upgrade (dual-redundant configuration)...................................... 3–4

Upgrading controller software using the CLCP utility.............................................3–5

Invoking the CLCP utility................................................................................ 3–5

Code load methods........................................................................................... 3–5

Single controller upgrade method..................................................................... 3–6

Host port upgrade............................................................................................. 3–7

Host download script requirements ......................................................... 3–8

Preparing the software image.................................................................. 3–8

Setting up the host...................................................................................3–9

Write enable the program card in the controller ...................................... 3–9

Running the CLCP utility...................................................................... 3–10

Maintenance terminal port upgrade................................................................ 3–13

System setup......................................................................................... 3–14

Write enable the program card in the controller .................................... 3–16

viii

Running the CLCP utility ..................................................................... 3–16

The dual-redundant, sequential upgrade method ............................................ 3–19

Special considerations for the sequential code load

upgrade method.................................................................................. 3–19

Sequential upgrade procedure........................................................................ 3–21

The dual-redundant concurrent code load upgrade

method ........................................................................................................ 3–21

Considerations for the concurrent code load upgrade

method ........................................................................................................ 3–22

Concurrent code load upgrade procedure ....................................................... 3–24

Patching controller software ................................................................................. 3–25

Code patch considerations.............................................................................. 3–26

Listing patches............................................................................................... 3–26

Installing a patch............................................................................................ 3–28

Code patch messages ..................................................................................... 3–30

Formatting disk drives.......................................................................................... 3–32

Considerations for formatting disk drives....................................................... 3–33

Installing new firmware on a device ..................................................................... 3–35

Considerations for installing new device firmware......................................... 3–36

HSUTIL abort codes...................................................................................... 3–37

HSUTIL messages ......................................................................................... 3–37

Installing a controller and cache module in a single controller

configuration...................................................................................................... 3–41

Installing a second controller and cache module ................................................... 3–45

Installing a write-back cache module.................................................................... 3–48

Removing the controller ....................................................................... 3–48

Installing the write-back cache module................................................. 3–49

Adding Cache Memory......................................................................................... 3–50

Installing SIMM Cards................................................................................... 3–50

Installing power supplies ...................................................................................... 3–53

Power supply and shelf LED status indicators................................................ 3–53

Power supply installation procedure............................................................... 3–56

Installing storage building blocks.......................................................................... 3–57

SBB activity and fault indicators.................................................................... 3–58

Installing SBBs (except solid state disk and CD-ROM).................................. 3–60

Installing a solid state disk or CD-ROM......................................................... 3–60

4 Moving storagesets and devices

Precautions for retaining data.................................................................................. 4–2

Moving storagesets ................................................................................................. 4–3

Moving storageset members.................................................................................... 4–6

Moving a single disk-drive unit............................................................................... 4–8

Moving a tape drive, CD-ROM drive, or tape loader............................................... 4–9

5 Removing

Removing a patch...................................................................................................5–2

Removing a controller and cache module................................................................ 5–5

Removing storage devices....................................................................................... 5–6

Removing disk drives....................................................................................... 5–6

Removing solid state disks and CD-ROM drives.............................................. 5–7

Removing tape drives....................................................................................... 5–8

Appendix A

Instance, codes and definitions................................................................................A-2

Last fail codes...................................................................................................... A-42

Repair action codes.............................................................................................. A-91

Glossary

Index

Figures

Figure 2–1 Connecting a maintenance terminal.....................................................2–4

Figure 2–2 Disconnecting the trilink connector..................................................... 2–6

Figure 2–3 Removing the program card................................................................ 2–8

Figure 2–4 Disconnecting the battery cable and disabling

the ECB................................................................................................................ 2–9

Figure 2–5 Removing controllers and cache modules.......................................... 2–10

Figure 2–6 Installing controllers and cache modules........................................... 2–13

Figure 2–7 Disabling the ECB ............................................................................ 2–19

Figure 2–8 Installing the program card................................................................ 2–21

Figure 2–9 Removing the power supply............................................................. 2–38

Figure 2–10 Power supply fault indicators.......................................................... 2–39

Figure 2–11 Removing a disk drive .................................................................... 2–41

Figure 2–12 Default indicators for 3.5- and 5.25-inch SBBs ............................... 2–42

Figure 2–13 OCP LED patterns .......................................................................... 2–43

Figure 2–14 Removing the CD-ROM drive......................................................... 2–44

Figure 2–15 Disconnecting the SCSI host cable.................................................. 2–46

Figure 2–16 Removing the volume shield........................................................... 2–48

Figure 2–17 Access to the SCSI cables............................................................... 2–49

Figure 3–1 Single controller code load method..................................................... 3–7

Figure 3–2 Host port code load operation.............................................................. 3–8

Figure 3–3 Write enable the program card.......................................................... 3–10

Figure 3–4 Terminal port code load operation..................................................... 3–13

Figure 3–5 Binary transfer protocol selection...................................................... 3–15

Figure 3–6 The sequential upgrade method......................................................... 3–22

Figure 3–7 The concurrent upgrade method........................................................ 3–25

Figure 3–8 Installing new firmware on a disk or tape drive................................. 3–37

Figure 3–9 Installing an SBB battery module...................................................... 3–44

Figure 3–10 Installing controller power supplies................................................. 3–44

Figure 3–11 Installing a single controller (SW800 cabinet)................................. 3–45

Figure 3–12 Cache configurations for cache Version 3 ....................................... 3–53

Figure 3–13 Installing a power supply ................................................................ 3–58

Figure 3–14 Typical 3.5-inch and 5.25-inch disk drive SBBs.............................. 3–59

Figure 3–15 Typical 5.25-inch CD-ROM SBB.................................................... 3–60

Figure 3–16 Typical 3.5-inch tape drive SBB ..................................................... 3–60

Figure 4–1 Moving a storageset from one subsystem to

another................................................................................................................. 4–3

Figure 4–2 Moving storageset members................................................................ 4–6

Figure 5–1 Removing a 3.5-inch disk drive........................................................... 5–7

Figure 5–2 OCP LED patterns .............................................................................. 5–8

Tables

Table 1–1 Solid controller LED codes.................................................................. 1–3

Table 1–2 Flashing controller LED codes............................................................. 1–4

Table 1–3 DILX data patterns............................................................................. 1–57

Table 1–4 VTDPY control keys.......................................................................... 1–59

Table 1–5 VTDPY commands............................................................................ 1–60

Table 2–1 Required tools...................................................................................... 2–3

Table 2–2 ECB status indicators ......................................................................... 2–16

Table 2–3 ECB status indicators......................................................................... 2–26

Table 2–4 ECB status indicators ......................................................................... 2–36

Table 3–1 Abort codes........................................................................................ 3–39

Table 3–2 SCSI ID Slots..................................................................................... 3–43

Table 3–3 ECB status indicators ......................................................................... 3–46

Table 3–4 Adding cache memory capacity.......................................................... 3–53

Table 3–5 Power supply status indicators -- SW300 cabinet................................ 3–55

Table 3–6 Shelf and single power supply status indicators --

SW500, SW800 cabinets .................................................................................... 3–56

Table 3–7 Shelf and dual power supply status indicators --

SW500, SW800 cabinets .................................................................................... 3–57

Table 3–8 Storage SBB Status Indicators............................................................ 3–62

Table A–1 Instance, codes ....................................................................................A-2

Table A–2 Executive services last failure codes..................................................A-42

Table A–3 Value-added services last failure codes.............................................. A-46

Table A–4 Device services last failure codes ...................................................... A-56

Table A–5 Fault manager last failure codes.........................................................A-64

Table A–6 Common library last failure codes..................................................... A-67

Table A–7 DUART services last failure codes.................................................... A-67

Table A–8 Failover control last failure codes......................................................A-68

Table A–9 Nonvolatile parameter memory failover control

last failure codes................................................................................................. A-69

Table A–10 Facility lock manager last failure codes........................................... A-71

Table A–11 Integrated logging facility last failure codes .................................... A-72

Table A–12 CLI last failure codes.......................................................................A-72

Table A–13 Host interconnect services last failure codes.................................... A-74

Table A–14 SCSI host interconnect services last failure

codes..................................................................................................................A-76

Table A–15 Host interconnect port services last failure

codes..................................................................................................................A-77

Table A–16 Disk and tape MSCP server last failure codes.................................. A-80

Table A–17 Diagnostics and utilities protocol server last

failure codes.......................................................................................................A-84

Table A–18 System communication services directory last

failure code......................................................................................................... A-85

Table A–19 SCSI host value-added services last failure

codes..................................................................................................................A-85

Table A–20 Disk inline exerciser (DILX) last failure codes................................A-86

Table A–21 Tape inline exerciser (TILX) last failure codes................................ A-87

Table A–22 Device configuration utilities

(CONFIG/CFMENU) last failure codes ..............................................................A-89

Table A–23 Clone unit utility (CLONE) last failure codes..................................A-89

xii

Table A–24 Format and device code load utility (HSUTIL)

last failure codes................................................................................................. A-89

Table A–25 Code load/code patch utility (CLCP) last

failure codes....................................................................................................... A-90

Table A–26 Induce controller crash utility (CRASH) last

failure codes....................................................................................................... A-90

Table A–27 Repair action codes ......................................................................... A-91

Related documents

The following table lists documents that contain information related to this product.

Document title Part number

DECevent Installation Guide AA–Q73JA–TE StorageWorks BA350–MA Controller Shelf User's

Guide StorageWorks Configuration Manager for DEC

OSF/1 Installation Guide StorageWorks Configuration Manager for DEC

OSF/1 System Manager's Guide for HSZterm StorageWorks Solutions Configuration Guide EK–BA350–CG StorageWorks Solutions Shelf and SBB User's

Guide StorageWorks Solutions SW300-Series RAID

Enclosure Installation and User's Guide StorageWorks SW500-Series Cabinet Installation

and User's Guide StorageWorks SW800-Series Data Center Cabinet

Installation and User's Guide The RAIDBOOK—A Source for RAID

Technology Polycenter Console Manager User's Guide Computer Associates VAXcluster Systems Guidelines for VAXcluster

System Configurations 16-Bit SBB User’s Guide EK-SBB16-UG 7-Bit SBB Shelf (BA356 Series) User’s Guide EK-BA356-UG SBB User’s Guide EK-SBB35-UG

xiii

EK–350MA–UG

AA–QC38A–TE

AA–QC39A–TE

EK–BA350–UG

EK–SW300–UG

EK–SW500–UG

EK–SW800–UG

RAID Advisor Board

EK–VAXCS–CG

Troubleshooting

Interpreting controller LED codes

Troubleshooting controllers

Using FMU to describe event log codes

Testing disk drives

Monitoring subsystem performance

HSZ50 Array Controller Service Manual

1–2 Troubleshooting

Introduction

This chapter is designed to help you quickly isolate the source of any problems you might encounter when you service the StorageWorks HSZ50 controllers, and take the necessary steps to correct the problems.

Interpreting controller LED codes

This section provides information on how to interpret controller LED codes. The operator control panel (OCP) on each HSZ controller contains a green reset LED and six device bus LEDs. These LEDs light in patterns to display codes when there is a problem with a device configuration, a device, or a controller.

• During normal operation, the green reset LED on each

controller flashes once per second, and the device bus LEDs are not lit.

• The amber LED for a device bus lights continuously when the

installed devices do not match the controller configuration, or when a device fault occurs.

• The green reset LED lights continuously and the amber LEDs

display a code when a controller problem occurs. Solid LED codes indicate a fault detected by internal diagnostic and initialization routines. Flashing LED codes indicate a fault that occurred during core diagnostics.

Look up the LED code that is showing on your controller in Table 1–1 or Table 1–2 to determine its meaning and find the corrective action. The symbols used in the tables have the following meanings:

LED on

LED off

LED flashing

Service Manual HSZ50 Array Controller

Troubleshooting 1–3

Table 1–1 Solid controller LED codes

Code Description of Error Corrective Action

O O O O O O O

O O O O O O P

O O O O O P O

O O O O O P P

O O O O P O O

O O O O P O P

O O O O P P O

O O O O P P P O O O P O O O

O O O P O O P

O O O P O P O

O O O P P O O

O O O P P O P O O O P P P O

O O O P P P P

O P P P P P P

DAEMON hard error Replace controller

module.

Repeated firmware bugcheck Replace controller

module.

NVMEM version mismatch Replace program card

with later version of firmware.

NVMEM write error Replace controller

module.

NVMEM read error Replace controller

module.

NMI error within firmware bugcheck

Inconsistent NVMEM structures repaired

Bugcheck with no restart Reset the controller. Firmware induced restart

following bugcheck failed to occur

Hardware induced restart following bugcheck failed to occur

Bugcheck within bugcheck controller

NVMEM version is too low Verify the card is the

Program card write fail Replace the card. ILF, INIT unable to allocate

memory Bugcheck before subsystem

initialization completed No program card seen Try the card in another

Reset the controller.

Replace controller module.

Reset controller module.

latest revision. If the problem still exists, replace the module.

Reset the controller.

module. If the problem follows the card, replace the card. Otherwise, replace the controller.

HSZ50 Array Controller Service Manual

1–4 Troubleshooting

Table 1–2 Flashing controller LED codes

Code Description of Error Corrective Action

O P P P P P M O P P P M P P

O P P P M P M

O P P P M M P

O P P P M M M

O P P M P P P

O P P M P P M

O P P M P M P

O P P M P M M

O P P M M P P

O P P M M P M

O P P M M M P

O P P M M M M

O P M P P P P

O P M P P P M

O P M P P M P

Program card EDC error Replace program card. Timer zero in the timer chip

will run when disabled Timer zero in the timer chip

decrements incorrectly Timer zero in the timer chip

did not interrupt the processor when requested

Timer one in the timer chip decrements incorrectly

Timer one in the timer chip did not interrupt the processor when requested

Timer two in the timer chip decrements incorrectly

Timer two in the timer chip did not interrupt the processor when requested

Memory failure in the I/D cache

No hit or miss to the I/D cache when expected

One or more bits in the diagnostic registers did not match the expected reset value

Memory error in the nonvolatile journal SRAM

Wrong image seen on program card

At least one register in the controller DRAB does not read as written

Main memory is fragmented into too many sections for the number of entries in the good memory list

The controller DRAB or DRAC chip does not arbitrate correctly

Replace controller module.

Replace program card.

Replace controller module.

Service Manual HSZ50 Array Controller

Troubleshooting 1–5

Code Description of Error Corrective Action

O P M P P M M

O P M P M P P

O P M P M P M

O P M P M M P

O P M P M M M

O P M M P P P

O P M M P P M

O P M M P M P

O P M M P M M

O P M M M P P

The controller DRAB or DRAC chip failed to detect forced parity, or detected parity when not forced

The controller DRAB or DRAC chip failed to verify the EDC correctly

The controller DRAB or DRAC chip failed to report forced ECC

The controller DRAB or DRAC chip failed some operation in the reporting, validating, and testing of the multibit ECC memory error

The controller DRAB or DRAC chip failed some operation in the reporting, validating, and testing of the multiple single-bit ECC memory error

The controller main memory did not write correctly in one or more sized memory transfers

The controller did not cause an I-to-N bus timeout when accessing a “reset” host port chip

The controller DRAB or DRAC chip did not report an I-to-N bus timeout when accessing a “reset” host port chip.

The controller DRAB or DRAC chip did not interrupt the controller processor when expected

The controller DRAB or DRAC chip did not report an NXM error when nonexistent memory was accessed

Replace controller module.

HSZ50 Array Controller Service Manual

1–6 Troubleshooting

Code Description of Error Corrective Action

O P M M M P M

O P M M M M P

O P M M M M M

O M P P P P P

O M P P P P M

O M P P P M M

O M P P M P P

O M P P M P M

O M P P M M P

O M M P P P P

The controller DRAB or DRAC chip did not report an address parity error when one was forced

There was an unexpected nonmaskable interrupt from the controller DRAB or DRAC chip during the DRAB memory test

Diagnostic register indicates there is no cache module, but an interrupt exists from the non-existent cache module

The required amount of memory available for the code image to be loaded from the program card is insufficient

The required amount of memory available in the pool area is insufficient for the controller to run

The required amount of memory available in the buffer area is insufficient for the controller to run

The code image was not the same as the image on the card after the contents were copied to memory

Diagnostic register indicates that the cache module does not exist, but access to that cache module caused an error

Diagnostic register indicates that the cache module does not exist, but access to that cache module did not cause an error

The journal SRAM battery is bad

Replace controller module.

Replace controller shelf backplane.

Replace controller module.

Service Manual HSZ50 Array Controller

Troubleshooting 1–7

Code Description of Error Corrective Action

O M M M P M P

O M M M P M M

O M M M M P P

O M M M M P M

O M M M M M P

O M M M M M M

There was an unexpected interrupt from a read cache or the present and lock bits are not working correctly

There is an interrupt pending on the controller’s policy processor when there should be none

There was an unexpected fault during initialization

There was an unexpected maskable interrupt received during initialization

There was an unexpected nonmaskable interrupt received during initialization

An illegal process was activated during initialization

Replace controller module.

Troubleshooting HSZ50 controllers

This section covers the following topics:

Troubleshooting when you cannot access HSZ units.

•

Troubleshooting on DIGITAL UNIX

•

VMS host troubleshooting

•

Troubleshooting application errors

•

Troubleshooting when you cannot access host units

If the error that occurred prevents you from accessing units for the host, determine if any HSZ units can be accessed. If no HSZ units can be accessed, run the VTDPY display and ensure that the host established communications with all HSZ target IDs. Refer to the section later in this chapter on “Monitoring system performance with the VTDPY utility” for more information about running VTDPY. If the host has not established communications, one of the following might be true:

The host adapter is bad.

•

The host SCSI bus is bad or misconfigured.

•

HSZ50 Array Controller Service Manual

1–8 Troubleshooting

• The HSZ controller is bad. To find more information about this error, use the following

procedure from the HSZ console. (If this is a dual controller configuration, the command must be executed on both controllers.)

1. To determine if the unit is on-line to a controller:

HSZ50> SHOW UNITS FULL

2. Check the following: – Is the unit on-line or available to this or the other

controller?

– From the HSZ controller to which the unit is on-line, does

the SHOW UNITS command also show the size in blocks?

3. If the answer to both of the questions in step 2 is no, there is a

problem with the HSZ controller. Look for any type of errors in the SHOW UNITS output, such as Lost Data or Media Format.

4. Run the VTDPY display.

5. Look at the unit status in the VTDPY display. Use the

information in a later section in this chapter, “Monitoring System Performance with the VTDPY Utility” to interpret the VTDPY display.

6. If the unit is not on-line or if errors are present in the SHOW

UNITS display, take appropriate action to clear the errors or rebuild the unit.

Be careful with user’s data. If this is a RAIDset, try to save the

user’s data. Do not initialize the storage unit unless there is no other alternative.

If you determine that units are on-line and everything seems to be in order on the HSZ side, proceed to check the host side using the file utility procedure.

Troubleshooting on a DIGITAL UNIX system

To troubleshoot on a DIGITAL UNIX system, use the file utility to access the device. The error message from the file utility might explain where the problem lies.

Service Manual HSZ50 Array Controller

Troubleshooting 1–9

Using the DIGITAL UNIX file utility

You can use the DIGITAL UNIX file utility to determine if an HSZ unit can be accessed from the DIGITAL UNIX host system. In the following procedure, an HSZ controller has a unit named D101, which will be used by the file utility.

1. Enter the following command from the HSZ CLI:

HSZ50>SHOW D101

2. Disable the writeback_cache and read_cache on this unit, if they are both enabled, using the following command:

HSZ50>SET D101 nowriteback_cache HSZ50>SET D101 noread_cache

or disable just the read_cache if it is enabled on the unit with the

following command:.

HSZ50>SET D101 noread_cache

Disabling the read_cache causes information to be accessed

from the unit rather than from the cache, if the information is in cache. This gives a visual indication that the unit is being accessed.

3. From the DIGITAL UNIX console, issue the file command to start the file utility. (Assume that the character special file has been created for rrzb17a.)

/usr/bin/file /dev/rrzb17a

The device activity indicator on the device, the green light,

should light up. If the unit is a multidevice storage unit only one of the devices that is part of that storage unit lights.

The host system should display the following output after the

file command is issued (the output displays on one line):

/dev/rrzb17a character special (8/mmmm) SCSI # n HSZ50 disk #xxx (SCSI ID #t)

The output values have the following meanings:

– 8 - major number

• mmmm - minor number

• n - SCSI host side bus number

HSZ50 Array Controller Service Manual

1–10 Troubleshooting

• t - target ID as used in the HSZ50 unit DTZL where the

“T”. In the DTZL HSZ50 unit matches the “t” from the file command.

• xxx - the disk number

4. If an error occurs, use the information in the following table to evaluate errors or output:

Error or Output Meaning and action

file: Cannot get file status on /dev/mmmm

/dev/mmmm: Cannot open for reading

Only the major and minor number is returned from the file command

Indicates the s /dev director mmmm does not exist.

The device is not answerin the device s have the correct minor number. Check the minor number to be sure that it matches the correct SCSI host side bus number and the correct HSZ50 Tar LUN from the HSZ50 unit designator.

ecial file in the

that matches

ecial file does not

et ID and

5. If the unit had write-back cache enabled, remember to enable the cache again using the following HSZ CLI command that enables both the write-back and read cache:

HSZ50> SET D101 WRITEBACK_CACHE

6. If the unit had only the read cache enabled, enable the read cache with this HSZ50 CLI command:

HSZ50> SET D101 READ_CACHE

7. Run VTDPY to ensure the host established communication with all HSZ target IDs.

OpenVMS host troubleshooting

If you cannot access the host on an OpenVMS system, use the following procedure to troubleshoot:

1. On the VMS system, enter the following command.

$ SHOW DEVICE DK*

Device names will display in the following format: DKA101

Service Manual HSZ50 Array Controller

Troubleshooting 1–11

The A in the device name represents a SCSI controller

designation and the 101 represents a unit number on an HSZ or other SCSI controller. If there was an HSZ unit named D101 on the HSZ whose letter designation was A, that would be the VMS device DKA101.

If there are multiple SCSI controllers, there would be a different

controller letter designation, for example DKA, DKB, and so forth.

The SHOW DEVICE FULL command also would give the

controller type. If the device was configured on an HSZ controller, HSZ would appear in the device information.

2. The SHOW DEVICE DK* command should display the HSZ

unit. If the unit is not displayed, follow the procedures in the previous section to determine if the unit is on-line.

3. If the unit is on-line to the HSZ, run the SYSMAN utility on the

VMS system to ensure the device is configured.

$ MC SYSMAN SYSMAN> IO AUTOCONFIGURE SYSMAN> EXIT

4. If you still cannot see the unit, check the error logs for SCSI

errors. The problem could be due to a bad host adapter, SCSI bus problem, or the HSZ.

5. Use the VTDPY display to ensure the host adapter established

connectivity to all HSZ target IDs. The host port portion of the VTDPY display should show all HSZ target IDs, and the rate should be 10MZ.

Troubleshooting application errors

Application errors can be categorized into three different types: device errors, controller errors, and host adapter errors. For each of these error types, you should check the log entries for key pieces of information. The important information for each error example is described in the following sections.

Locating a device error

This section contains an example of a DECevent error log for a device event or error. You should be able to locate the following important details in the DECevent error log when a device event

HSZ50 Array Controller Service Manual

1–12 Troubleshooting

occurs. Note that if the controller ASC and ASCQ are zero, the device generated the error. Also note the Generic String message, BBR disabled bad block number: 230262. This message is always generated and is a generic message for a device software error. Check the device ASC and ASCQ.

The following important information is highlighted in the example:

• Unit Information, Port-Target-LUN

• Generic String message. This message is always generated and

is a generic message for a device software error. You should check the ASC and ASCQ.

• CAM Status

• SCSI Status

• Command Information

• Most Recent ASC and ASCQ

• Device Information, Port-Target-LUN

• Controller ASC and ASCQ

• LBN

• Device ASC and ASCQ

The “-i ios” qualifier used in the following command indicates that I/O subsystem log entries should be included: these entries include CAM events. The complete command syntax is:

#dia -i ios -t s:03-oct-1995, 10:47 e:03-oct-1995, 10:48

DECevent Log Example - Locating a Device Error

*************************ENTRY 4**************************

Logging OS 2. DIGITAL UNIX System Architecture 2. Alpha Event sequence number 1. Timestamp of occurrence 03-OCT-1995 10:47:59 Host name testsys

Service Manual HSZ50 Array Controller

Troubleshooting 1–13

System type register x00000004 DEC 3000 Number of CPUs (mpnum) x00000001 CPU logging event (mperr) x00000000

Event validity 1. O/S claims event is valid Event severity 5. Low Priority Entry type 199. CAM SCSI Event Type

------- Unit Info ------Bus Number 2.

Unit Number x0090 Target = 2. LUN = 0.

------- CAM Data ------Class x00 Disk Subsystem x00 Disk Number of Packets 10.

------ Packet Type ------ 258. Module Name String

Routine Name cdisk_bbr_done

------ Packet Type ------ 256. Generic String

cdisk_bbr: BBR disabled bad

block number:

230262

HSZ50 Array Controller Service Manual

1–14 Troubleshooting

------ Packet Type ------ 261. Soft Error String

Error Type Soft Error Detected

(recovered)

------ Packet Type ------ 257. Device Name String

Device Name DEC HSZ4

------ Packet Type ------ 256. Generic String

Active CCB at time of error

------ Packet Type ------ 256. Generic String

CCB request completed with

an error

------ Packet Type ------ 1. SCSI I/O Request

Packet Revision 37.

CCB Address xFFFFFC0007F9BB28 CCB Length x00C0 XPT Function Code x01 Execute requested SCSI I/O

Cam Status x84 CCB Request Completed WITH

Autosense Data Valid for

Path ID 2. Target ID 2. Target LUN 0.

Service Manual HSZ50 Array Controller

CCB(CCB_SCSIIO)

Error

Target

Troubleshooting 1–15

Cam Flags x00000482 SIM Queue Actions are

Enabled

Data Direction (10: DATA

OUT)

Disable the SIM Queue

Frozen State *pdrv_ptr xFFFFFC0007F9B828 *next_ccb x0000000000000000 *req_map xFFFFFC0007F8C200 void (*cam_cbfcnp)() xFFFFFC00004AC8A0 *data_ptr x000000014000A1A0 Data Transfer Length 8192. *sense_ptr xFFFFFC0007F9B850 Auotsense Byte Length 160. CDB Length 6. Scatter/Gather Entry Cnt 0.

SCSI Status x02 Check Condition

Autosense Residue Length x00 Transfer Residue Length x00000000 (CDB) Command & Data Buf

15--<-12 11--<-08 07--<-04 03--<-00 :Byte Order

0000: 00000000 00000010 7083030A *...p.......*

Timeout Value x0000003C *msg_ptr x0000000000000000 Message Length 0. Vendor Unique Flags x4000 Tag Queue Actions x20 Tag for Simple Queue

HSZ50 Array Controller Service Manual

1–16 Troubleshooting

------ Packet Type ------ 256. Generic String

Error, exception, or

abnormal condition

------ Packet Type ------ 256. Generic String

RECOVERED ERROR - Recovery

action performed

------ Packet Type ------ 768. SCSI Sense Data Packet Revision 0.

------- HSZ Data ------Instance, Code x0328450A The disk device reported

standard SCSI Sense Data.

Component ID = Device

Event Number = x00000028 Repair Action = x00000045 NR Threshold = x0000000A Template Type x51 Disk Transfer Error. Template Flags x01 HCE = 1, Event occurred

Ctrl Serial # ZG41800293 Ctrl Software Revision V20Z RAIDSET State x00 NORMAL. All members present

Service Manual HSZ50 Array Controller

Services.

during Host Command Execution.

and reconstructed, IF LUN is configured as a RAIDSET.

+ 300 hidden pages

Digital Equipment StorageWorks HSZ50 Service Manual

Specifications and Main Features

Frequently Asked Questions

User Manual