HP B2000 User Manual

hp StorageWorks
HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Part Number: EK–G80TR–SA. B01
Second Edition (August 2002)
Product Version: 8.7
This guide provides troubleshooting instructions for the HSG80 array controllers running array controller software (ACS) Versions
© Hewlett-Packard Company, 2002. All rights reserved. Hewlett-Packard Company makes no warranty of any kind with regard to this material, including, but
not limited to, the implied warranties of merchantability and fitness for a particular purpose. Hewlett-Packard shall not be liable for errors contained herein or for incidental or consequential damages in connection with the furnishing, performance, or use of this material.
This document contains proprietary information, which is protected by copyright. No part of this document may be photocopied, reproduced, or translated into another language without the prior written consent of Hewlett-Packard. The information contained in this document is subject to change without notice.
Microsoft, MS-DOS, Windows, and Windows NT are trademarks of Microsoft Corporation in the U.S. and/or other countries.
All other product names mentioned herein may be trademarks of their respective companies. Hewlett-Packard Company shall not be liable for technical or editorial errors or omissions contained
herein. The information is provided “as is” without warranty of any kind and is subject to change without notice. The warranties for Hewlett-Packard Company products are set forth in the express limited warranty statements accompanying such products. Nothing herein should be construed as constituting an additional warranty.
Printed in the U.S.A.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide Second Edition (August 2002) Part Number: EK–G80TR–SA. B01

Contents

About this Guide
Document Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Symbols in Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ix
Symbols on Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Rack Stability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Getting Help. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
StorageWorks Technical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
StorageWorks Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
StorageWorks Authorized Reseller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Troubleshooting Information
Typical Installation Troubleshooting Checklist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–1
Troubleshooting Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–3
Significant Event Reporting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–12
Reporting Events That Cause Controller Operation to Halt . . . . . . . . . . . . . . . . . 1–13
Flashing OCP Pattern Display Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–13
Solid OCP Pattern Display Reporting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–15
Last Failure Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–21
Reporting Events That Allow Controller Operation to Continue . . . . . . . . . . . . . 1–21
Spontaneous Event Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–22
CLI Event Reporting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–22
Running the Controller Diagnostic Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–23
ECB Charging Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–23
Battery Hysteresis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–23
Caching Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–24
Read Caching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–24
Read-Ahead Caching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–25
Write-Through Caching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–25
Write-Back Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–26
Fault-Tolerance for Write-Back Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–26
HSG80 Array Controller V8.7 Troubleshooting Reference Guide iii
Contents
Nonvolatile Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–26
Cache Policies Resulting from Cache Module Failures . . . . . . . . . . . . . . . . . 1–27
Enabling Mirrored Write-Back Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–32
2 Utilities and Exercisers
Fault Management Utility (FMU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–1
Displaying Failure Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–2
Translating Event Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–3
Controlling the Display of Significant Events and Failures. . . . . . . . . . . . . . . . . . . 2–5
Video Terminal Display (VTDPY) Utility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–7
Restrictions with VTDPY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–7
Running VTDPY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–8
VTDPY Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–9
VTDPY Display Screens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–10
Default Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–11
Controller Status Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–11
Cache Performance Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–12
Device Performance Screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–13
Host Ports Statistics Screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–15
Resource Statistics Screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–17
Remote Status Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–17
Interpreting VTDPY Screen Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–18
Screen Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–19
Common Data Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–20
Unit Performance Data Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–21
Device Performance Data Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–23
Device Port Performance Data Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–25
Host Port Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–26
TACHYON Chip Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–28
Runtime Status of Remote Copy Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–29
Device Port Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–31
Controller/Processor Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–32
Resource Performance Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–34
Disk Inline Exerciser (DILX). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–35
Checking for Unit Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–35
Finding a Unit in the Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–35
Testing the Read Capability of a Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–36
Testing the Read and Write Capabilities of a Unit . . . . . . . . . . . . . . . . . . . . . 2–37
iv HSG80 Array Controller V8.7 Troubleshooting Reference Guide
DILX Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–40
Format and Device Code Load Utility (HSUTIL). . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–40
Configuration (CONFIG) Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–42
Code Load and Code Patch (CLCP) Utility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–42
Clone (CLONE) Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–42
Field Replacement Utility (FRUTIL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–43
Change Volume Serial Number (CHVSN) Utility. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–43
3 Event Reporting Templates
Passthrough Device Reset Event Sense Data Response . . . . . . . . . . . . . . . . . . . . . . . . 3–1
Last Failure Event Sense Data Response (Template 01) . . . . . . . . . . . . . . . . . . . . . . . . 3–2
Multiple-Bus Failover Event Sense Data Response (Template 04). . . . . . . . . . . . . . . . 3–4
Failover Event Sense Data Response (Template 05) . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–5
Nonvolatile Parameter Memory Component Event Sense Data Response (Template 11) . 3–7
Backup Battery Failure Event Sense Data Response (Template 12). . . . . . . . . . . . . . . 3–9
Subsystem Built-In Self-Test Failure Event Sense Data Response (Template 13) . . . 3–10
Memory System Failure Event Sense Data Response (Template 14) . . . . . . . . . . . . . 3–11
Device Services Nontransfer Error Event Sense Data Response (Template 41) . . . . . 3–13
Disk Transfer Error Event Sense Data Response (Template 51). . . . . . . . . . . . . . . . . 3–15
Data Replication Manager Services Event Sense Response (Template 90) . . . . . . . . 3–17
Contents
4 ASC/ASCQ, Repair Action, and Component Identifier Codes
Vendor Specific SCSI ASC/ASCQ Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–1
Recommended Repair Action Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–4
Component ID Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–11
5 Instance Codes
Instance Code Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–1
Instance Codes and FMU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–1
Notification/Recovery Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–2
Repair Action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–2
Event Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–2
Component ID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–3
6 Last Failure Codes
Last Failure Code Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–1
Last Failure Codes and FMU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–1
HSG80 Array Controller V8.7 Troubleshooting Reference Guide v
Contents
Parameter Count. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–2
Restart Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–2
Hardware/Software Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–2
Repair Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–3
Error Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–3
Component ID Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–3
Glossary
Index
Figures
2–1 VTDPY commands and shortcuts generated from the Help command. . . . . . 2–10
2–2 Sample of the VTDPY default screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–11
2–3 Sample of the VTDPY status screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–12
2–4 Sample of the VTDPY cache screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–13
2–5 Sample of regions on the VTDPY device screen . . . . . . . . . . . . . . . . . . . . . . 2–14
2–6 Sample of the VTDPY host screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–16
2–7 Sample of the VTDPY resource screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–17
2–8 Sample of the VTDPY remote status screen (ACS version 8.7P only). . . . . . 2–18
5–1 Structure of an Instance Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–1
6–1 Structure of a Last Failure Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–1
Tables
1 Document Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1–1 Troubleshooting Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–3
1–2 Flashing OCP Pattern Displays and Repair Actions . . . . . . . . . . . . . . . . . . . . 1–13
1–3 Solid OCP Pattern Displays and Repair Actions. . . . . . . . . . . . . . . . . . . . . . . 1–16
1–4 ECB Capacity Based On Memory Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–24
1–5 Cache Policies—Cache Module Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–27
1–6 Resulting Cache Policies—ECB Status. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–29
2–1 Event Code Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–4
2–2 FMU SET Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–5
2–3 VTDPY Key Sequences and Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–8
2–4 VTDPY—Common Data Fields Column Definitions: Part 1 . . . . . . . . . . . . . 2–20
2–5 VTDPY—Common Data Fields Column Definitions: Part 2 . . . . . . . . . . . . . 2–21
2–6 VTDPY—Unit Performance Data Fields Column Definitions . . . . . . . . . . . . 2–22
2–7 VTDPY—Device Performance Data Fields Column Definitions. . . . . . . . . . 2–24
vi HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Contents
2–8 VTDPY—Device Port Performance Data Fields Column Definitions. . . . . . 2–25
2–9 Fibre Channel Host Status Display—Known Host Connections . . . . . . . . . . 2–26
2–10 Fibre Channel Host Status Display—Port Status . . . . . . . . . . . . . . . . . . . . . . 2–26
2–11 Fibre Channel Host Status Display—Link Error Counters. . . . . . . . . . . . . . . 2–27
2–12 First Digit on the TACHYON Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–28
2–13 Second Digit on the TACHYON Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–29
2–14 Remote Display Column Definitions— ACS Version 8.7P Only . . . . . . . . . 2–29
2–15 Device Map Column Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–31
2–16 Controller/Processor Utilization Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 2–32
2–17 VTDPY Thread Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–33
2–18 Resource Performance Statistics Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 2–34
2–19 DILX Control Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–37
2–20 Data Patterns for Phase 1: Write Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–37
2–21 DILX Error Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–40
2–22 HSUTIL Messages and Inquiries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–40
3–1 Passthrough Device Reset Event Sense Data Response Format. . . . . . . . . . . . 3–2
3–2 Template 01—Last Failure Event Sense Data Response Format . . . . . . . . . . . 3–3
3–3 Template 04—Multiple-Bus Failover Event Sense Data Response Format. . . 3–4
3–4 Template 05—Failover Event Sense Data Response Format. . . . . . . . . . . . . . 3–6
3–5 Template 11—Nonvolatile Parameter Memory Component Event Sense Data
Response Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–8
3–6 Template 12—Backup Battery Failure Event Sense Data Response Format . . 3–9 3–7 Template 13—Subsystem Built-In Self Test Failure Event Sense Data Response
Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–10
3–8 Template 14—Memory System Failure Event Sense Data Response Format 3–12 3–9 Template 41—Device Services Non-Transfer Error Event Sense Data Response
Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–14
3–10 Template 51—Disk Transfer Error Event Sense Data Response Format. . . . 3–16
3–11 Template 90—Data Replication Manager Services Event Sense Data Response
Format (ACS Version 8.7P Only). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–18
4–1 ASC and ASCQ Code Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–1
4–2 Recommended Repair Action Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–4
4–3 Component ID Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–11
5–1 Instance Code Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–1
5–2 Event Notification/Recovery (NR) Threshold Classifications . . . . . . . . . . . . . 5–2
5–3 Instance Codes and Repair Action Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–4
6–1 Last Failure Code Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–2
6–2 Controller Restart Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–2
HSG80 Array Controller V8.7 Troubleshooting Reference Guide vii
Contents
6–3 Last Failure Codes and Repair Action Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 6–4
viii HSG80 Array Controller V8.7 Troubleshooting Reference Guide

Document Conventions

The conventions included in Table 1 apply in most cases.
Table 1: Document Conventions
Element Convention
Key names, menu items, buttons, and dialog box titles
File names and application names Italics User input, command names, system
responses (output and messages)
Variables Monospace, italic font Website addresses Sans serif font (http://www.compaq.com

Symbols in Text

About this Guide

Bold
Monospace font COMMAND NAMES are uppercase
unless they are case sensitive
)
These symbols may be found in the text of this guide. They have the following meanings.
WARNING: Text set off in this manner indicates that failure to follow directions in the warning could result in bodily harm or loss of life.
CAUTION: Text set off in this manner indicates that failure to follow directions could
result in damage to equipment or data.
IMPORTANT: Text set off in this manner presents clarifying information or specific instructions.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide ix
About this Guide
NOTE: Text set off in this manner presents commentary, sidelights, or interesting points of information.

Symbols on Equipment

Any enclosed surface or area of the equipment marked with these symbols indicates the presence of electrical shock hazards. Enclosed area contains no operator serviceable parts.
WARNING: To reduce the risk of injury from electrical shock hazards, do not open this enclosure.
Any RJ-45 receptacle marked with these symbols indicates a network interface connection.
WARNING: To reduce the risk of electrical shock, fire, or damage to the equipment, do not plug telephone or telecommunications connectors into this receptacle.
Any surface or area of the equipment marked with these symbols indicates the presence of a hot surface or hot component. Contact with this surface could result in injury.
WARNING: To reduce the risk of injury from a hot component, allow the surface to cool before touching.
Power supplies or systems marked with these symbols indicate the presence of multiple sources of power.
WARNING: To reduce the risk of injury from electrical shock, remove all power cords to completely disconnect power from the power supplies and systems.
x HSG80 Array Controller V8.7 Troubleshooting Reference Guide

Rack Stability

WARNING: To reduce the risk of personal injury or damage to the equipment, be sure that:
The leveling jacks are extended to the floor.
The full weight of the rack rests on the leveling jacks.
In single rack installations, the stabilizing feet are attached to the rack.
In multiple rack installations, the racks are coupled.
Only one rack component is extended at any time. A rack may become unstable if more than one rack component is extended for any reason.
About this Guide
Any product or assembly marked with these symbols indicates that the component exceeds the recommended weight for one individual to handle safely.
WARNING: To reduce the risk of personal injury or damage to the equipment, observe local occupational health and safety requirements and guidelines for manually handling material.

Getting Help

If you still have a question after reading this guide, contact service representatives or visit our website.
StorageWorks Technical Support
In North America, call StorageWorks technical support at 1-800-OK-COMPAQ, available 24 hours a day, 7 days a week.
NOTE: For continuous quality improvement, calls may be recorded or monitored.
Outside North America, call StorageWorks technical support at the nearest location. Telephone numbers for worldwide technical support are listed on the StorageWorks website: http://www.compaq.com
Be sure to have the following information available before calling:
Technical support registration number (if applicable)
Product serial numbers
HSG80 Array Controller V8.7 Troubleshooting Reference Guide xi
.
About this Guide
Product model names and numbers
Applicable error messages
Operating system type and revision level
Detailed, specific questions.
StorageWorks Website
The StorageWorks website has the latest information on this product, as well as the latest drivers. Access the StorageWorks website at: http://www.compaq.com/storage From this website, select the appropriate product or solution.

StorageWorks Authorized Reseller

For the name of your nearest StorageWorks Authorized Reseller:
In the United States, call 1-800-345-1518.
In Canada, call 1-800-263-5868.
Elsewhere, see the StorageWorks website for locations and telephone numbers.
.
xii HSG80 Array Controller V8.7 Troubleshooting Reference Guide

Troubleshooting Information

This chapter provides guidelines for troubleshooting the controller, cache module, and external cache battery (ECB). See enclosure documentation for information on troubleshooting enclosure hardware, such as the power supplies, cooling fans, and environmental monitoring unit (EMU).

Typical Installation Troubleshooting Checklist

The following checklist identifies many of the problems that occur in a typical installation. After identifying a problem, use Table 1–1 to confirm the diagnosis and fix the problem.
If an initial diagnosis points to several possible causes, use the tools described in this chapter and then those in Chapter 2 to further refine the diagnosis. If a problem cannot be diagnosed using the checklist and tools, contact a StorageWorks authorized service provider for additional support.
To troubleshoot the controller and supporting modules, complete the following:
1
1. Check the power to the enclosure and enclosure components.
Are power cords connected properly?
Is power within specifications?
2. Check the component cables.
Are bus cables to the controllers connected properly?
For BA370 enclosures, are ECB cables connected properly?
3. Check each program card to make sure the card is fully seated.
4. Check the operator control panel (OCP) and devices for LED codes. See “Flashing OCP Pattern Display Reporting” on page 1–13 and “Solid OCP
Pattern Display Reporting” on page 1–15, to interpret the LED codes.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide 1–1
Troubleshooting Information
5. Connect a local terminal to the controller and check the controller configuration with the following command:
SHOW THIS_CONTROLLER FULL
Make sure that the ACS version loaded is correct and that pertinent patches are installed. Also, check the status of the cache module and the supporting ECB.
In a dual redundant configuration, check the “other controller” with the following command:
SHOW OTHER_CONTROLLER FULL
6. Use the fault management utility (FMU) to check for Last Failure or “memory system failure” entries.
Show these codes and translate the Last Failure Codes they contain. See Chapter 2, “Displaying Failure Entries” and “Translating Event Codes” sections.
If the controller failed to the extent that the controller cannot support a local terminal for FMU, check the host error log for the Instance or Last Failure Codes. See Chapter 5 and Chapter 6 to interpret the event codes.
7. Check device status with the following command:
SHOW DEVICES FULL
Look for errors such as “misconfigured device” or “No device at this PTL.” If a device reports misconfigured or missing, check the device status with the following command:
SHOW device-name
8. Check storageset status with the following command:
SHOW STORAGESETS FULL
Make sure that all storagesets are normal (or normalizing if the storageset is a RAIDset or mirrorset). Check again for misconfigured or missing devices using step 7.
9. Check unit status with the following command:
SHOW UNITS FULL
Make sure that all units are available or online. If the controller reports a unit as unavailable or offline, recheck the storageset the unit belongs to with the following command:
SHOW storageset-name
1–2 HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Troubleshooting Information
If the controller reports that a unit has lost data or is unwriteable, recheck the status of the devices that make up the storageset. If the devices are operating normally, recheck the status of the cache module. If the unit reports a media format error, recheck the status of the storageset and storageset devices.

Troubleshooting Table

After diagnosing a problem, use Table 1–1 to resolve the problem.
Table 1–1: Troubleshooting Guidelines (Sheet 1 of 10)
Symptom Possible Cause Investigation Remedy
Reset button not lit. No power to
subsystem.
Failed controller. If the previous
Reset button lit steadily; other LEDs also lit.
Various. See OCP LED Codes. Follow repair action
Check power to subsystem and power supplies on controller enclosure.
BA370 enclosure only: Make sure that all cooling fans are installed. If one or more fans are missing or all are inoperative for more than 8 minutes, the EMU shuts down the subsystem.
BA370 enclosure only: Determine if the standby power switch on the PVA was pressed for more than 5 seconds.
remedies fail to resolve the problem, check OCP LED codes.
Replace cord or (BA370 enclosure only) AC input box.
Turn off power switch on AC input box. Replace cooling fan. Restore power to subsystem.
Press the alarm control switch on the EMU.
Replace controller.
using Table 1–2.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide 1–3
Troubleshooting Information
Table 1–1: Troubleshooting Guidelines (Sheet 2 of 10)
Symptom Possible Cause Investigation Remedy
Reset button
FLASHING; other
LEDs also lit.
Device in error or failedset on corresponding
SHOW device FULL. Follow repair action
using Table 1–3.
device port with other LEDs lit.
Cannot set failover to create dual-redundant configuration.
Incorrect command syntax.
Different software versions on controllers.
See the controller CLI reference guide for the SET FAILOVER command.
Check software versions on both controllers.
Use the correct command syntax.
Update one or both controllers so that both are using the same software version.
Incompatible hardware.
Check hardware versions.
Upgrade controllers so that they are using compatible hardware.
Controller previously set for failover.
Make sure that neither controller is configured for failover.
Use the SET NOFAILOVER command on both controllers, then reset “this controller” for failover.
Failed controller. If the previous
remedies fail to resolve the problem,
Follow repair action using Table 1–2 or
Table 1–3. check for OCP LED codes.
1–4 HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Troubleshooting Information
Table 1–1: Troubleshooting Guidelines (Sheet 3 of 10)
Symptom Possible Cause Investigation Remedy
Node ID is all zeros. SHOW_THIS to see if
node ID is all zeros.
Set node ID using the node ID (bar code) that is located on the frame in which the controller sits. See SET THIS_CONTROLLE R NODE_ID in the controller CLI reference guide. Also, be sure to copy in the right direction. If cabled to the new controller, use SET FAILOVER COPY= OTHER_CONTROL LER. If cabled to the old controller, use SET FAILOVER COPY=THIS_CONT ROLLER.
Nonmirrored cache: controller reports failed DIMM in Cache A or B.
Improperly installed DIMM.
Remove cache module and make sure that the DIMM is fully seated in the slot.
Failed DIMM. If the previous remedy
Reseat DIMM.
Replace DIMM. fails to resolve the problem, check for OCP LED codes.
Mirrored cache: “this controller” reports DIMM 1 or 2 failed in Cache A or B.
Improperly installed DIMM in “this controller” cache module.
Failed DIMM in “this controller” cache module.
Remove cache module and make sure that DIMMs are installed properly.
If the previous remedy fails to resolve the problem, check for
Reseat DIMM.
Replace DIMM in
“this controller”
cache module. OCP LED codes.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide 1–5
Troubleshooting Information
Table 1–1: Troubleshooting Guidelines (Sheet 4 of 10)
Symptom Possible Cause Investigation Remedy
Mirrored cache: “this controller” reports DIMM 3 or 4 failed in Cache A or B.
Improperly installed DIMM in “other controller” cache module.
Failed DIMM in “other controller” cache module.
Remove cache module and make sure that the DIMMs are installed properly.
If the previous remedy fails to resolve the problem, check for
Reseat DIMM.
Replace DIMM in “other controller” cache module.
OCP LED codes.
Mirrored cache: controller reports battery not present.
Memory module was installed before the cache module was connected to an ECB.
BA370 enclosure: ECB cable not connected to cache module.
Model 2200 enclosure: ECB not installed or seated
BA370 enclosure: Connect ECB cable to cache module, then restart both controllers by pushing their reset buttons simultaneously.
properly in backplane.
Model 2200 enclosure: install or reseat ECB.
Mirrored cache: controller reports cache or mirrored cache has failed.
Primary data and the mirrored copy data are not identical.
SHOW THIS_CONTROLLER indicates that the cache or mirrored cache has failed.
Spontaneous FMU message displays: “Primary cache declared failed - data inconsistent with mirror,” or “Mirrored cache declared failed
- data inconsistent
Enter the SHUTDOWN command on controllers that report the problem. (This command flushes the cache contents to synchronize the primary and mirrored data.) Restart the controllers that were shut down.
with primary.”
1–6 HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Troubleshooting Information
Table 1–1: Troubleshooting Guidelines (Sheet 5 of 10)
Symptom Possible Cause Investigation Remedy
Invalid cache. Mirrored-cache
mode discrepancy. This discrepancy might occur after installing a new controller. The existing cache module is set for mirrored caching, but the new controller is set for unmirrored caching. This discrepancy might also occur if the new controller is set for mirrored
SHOW THIS_CONTROLLER indicates “invalid cache.”
Spontaneous FMU message displays: “Cache modules inconsistent with mirror mode.”
Connect a terminal
to the maintenance
port on the controller
reporting the error
and clear the error
with the following
command—all on
one line:
CLEAR_ERRORS
THIS_CONTROLLE
R INVALID_CACHE
NODESTROY_UNF
LUSHED_ DATA.
See the controller
CLI reference guide
for more information.
caching, but the existing cache module is not.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide 1–7
Troubleshooting Information
Table 1–1: Troubleshooting Guidelines (Sheet 6 of 10)
Symptom Possible Cause Investigation Remedy
Cache module might erroneously contain unflushed write-back data. This might occur after installing a new controller. The existing cache module might indicate that the cache module contains unflushed write-back data, but the new controller expects to find no data in the existing
SHOW THIS_CONTROLLER indicates “invalid cache.”
No spontaneous FMU message.
Connect a terminal to the maintenance port on the controller reporting the error, and clear the error with the following command—all on one line: CLEAR_ERRORS THIS_CONTROLLE R INVALID_CACHE DESTROY_UNFLUS HED_ DATA. See the controller CLI reference guide for more information.
cache module. This error might
also occur if installing a new cache module for a controller that expects write-back data in the cache.
1–8 HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Troubleshooting Information
Table 1–1: Troubleshooting Guidelines (Sheet 7 of 10)
Symptom Possible Cause Investigation Remedy
Cannot add device. Illegal device. See product-specific
Replace device. release notes that accompanied the software release for the most recent list of supported devices.
Device not properly installed in
Check that the device is fully seated.
Firmly press the
device into the bay.
enclosure. Failed device. Check for presence of
device LEDs.
Follow repair action
in the documentation
provided with the
enclosure or device.
Failed power supplies.
Check for presence of power supply LEDs.
Follow repair action
in the documentation
provided with the
enclosure or power
supply.
Failed bus to device.
If the previous remedies fail to
Replace enclosure.
resolve the problem, check for OCP LED codes.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide 1–9
Troubleshooting Information
Table 1–1: Troubleshooting Guidelines (Sheet 8 of 10)
Symptom Possible Cause Investigation Remedy
Cannot configure storagesets.
Incorrect command syntax.
Exceeded maximum number of storagesets.
See the controller CLI reference guide for the ADD storageset command.
Use the SHOW command to count the number of
Reconfigure storageset with correct command syntax.
Delete unused storagesets.
storagesets configured on the controller.
Failed battery on ECB. An ECB or uninterruptible power supply (UPS)
Use the SHOW command to check the ECB battery status.
Replace the ECB if required.
is required for RAIDsets and mirrorsets.
Cannot assign unit number to storageset.
Unit is available but not online.
Incorrect command syntax.
This is normal. Units are “available”
See the controller CLI reference guide for correct syntax.
Reassign the unit number with the correct syntax.
None None
until the host accesses them, at which point their status is changed to “online.”
Host cannot see device.
Broken cables. Check for broken
cables.
Replace broken cables.
1–10 HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Troubleshooting Information
Table 1–1: Troubleshooting Guidelines (Sheet 9 of 10)
Symptom Possible Cause Investigation Remedy
Host cannot access unit.
Host files or device drivers not properly installed or configured.
Check for the required device special files.
Configure device
special files as
described in the
installation and
configuration guide
that accompanied
the software release.
Invalid Cache See the description
for the invalid cache symptom on page
See the description
for the invalid cache
symptom. 1–7.
Units have lost data. Issue the SHOW
UNITS FULL command.
Clear these units
with:
CLEAR_ERRORS
unit-number
LOST_DATA.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide 1–11
Troubleshooting Information
Table 1–1: Troubleshooting Guidelines (Sheet 10 of 10)
Symptom Possible Cause Investigation Remedy
Host log file or maintenance terminal indicates that a forced error occurred when the controller was reconstructing a RAIDset or mirrorset.
Unrecoverable read errors might have occurred when the controller was reconstructing the storageset. Errors occur if another member fails while the controller is reconstructing the storageset.
Host requested data from a normalizing storageset that did not contain the data.
Conduct a read scan of the storageset using the appropriate utility from the host operating system, such as the “dd” utility for a TRU64 UNIX host.
Use the SHOW storageset-name command to see if all storageset members are “normal.”
Rebuild the storageset, then restore storageset data from a backup source. While the controller is reconstructing the storageset, monitor the host error log activity or spontaneous event reports on the maintenance terminal for any unrecoverable errors. If unrecoverable errors persist, note the device on which they occurred, and replace the device before proceeding.
Wait for normalizing members to become normal, then resume I/O to them.

Significant Event Reporting

Controller fault management software reports information about significant events that occur. These events are reported by:
Maintenance terminal displays
Host error logs
OCP LEDs
1–12 HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Troubleshooting Information
Some events cause controller operation to halt; others allow the controller to remain operable. Both types of events are detailed in the following sections.
Reporting Events That Cause Controller Operation to Halt
Events that cause the controller to halt operations are reported in three possible ways:
•a
FLASHING OCP pattern display
•a
SOLID OCP pattern display
Last Failure reporting Use Table 1–2 to interpret
FLASHING OCP patterns and Table 1–3 to interpret SOLID (ON)
OCP patterns. In the Error column of the solid OCP patterns, there are two separate descriptions. The first denotes the actual error message that appears on the terminal, and the second provides a more detailed explanation of the designated error.
Use the following legend to interpret both tables as indicated:
= reset button F
n o
= reset button O
l
= LED FLASHING (in Table 1–2) or ON (in TABLE 1–3)
m
= LED O
NOTE: If the reset button is FLASHING and an LED is ON, either the devices on the bus that corresponds to the LED do not match the controller configuration, or an error occurred in one of the devices on that bus.
Also, a single LED that is turned O
FF
LASHING (in Table 1–2) or ON (in TABLE 1–3)
FF
N indicates a failure of the drive on that bus.
Flashing OCP Pattern Display Reporting
Certain events can cause a FLASHING display of the OCP LEDs. Each event and the resulting pattern are described in Table 1–2.
IMPORTANT: Remember that a solid black pattern represents a FLASHING display. A white pattern indicates OFF.
All LEDs F
Table 1–2: FLASHING OCP Pattern Displays and Repair Actions (Sheet 1 of 3)
LASH at the same time and at the same rate.
OCP
Pattern
Code Error Repair Action
nmmmmml 1 Program card EDC error. Replace program card. Legend:
= reset button F
HSG80 Array Controller V8.7 Troubleshooting Reference Guide 1–13
LASHING ❏ = reset button OFF ● = LED FLASHING ❍ = LED OFF
Troubleshooting Information
Table 1–2: F
LASHING OCP Pattern Displays and Repair Actions (Sheet 2 of 3)
OCP
Pattern
nmmmlmm 4 Timer zero on the processor is
Code Error Repair Action
Replace controller.
bad.
nmmmlml 5 Timer one on the processor is
Replace controller.
bad.
nmmmllm 6 Processor Guarded Memory
Replace controller.
Unit (GMU) is bad.
nmmlmll B Nonvolatile Journal Memory
(JSRAM) structure is bad because of a memory error or an incorrect upgrade procedure.
nmmllml D One or more bits in the
diagnostic registers did not match the expected reset value.
Verify the correct upgrade (see the controller release notes and cover letters, if available). If error continues, replace controller.
Press the reset button to restart the controller. If this does not correct the error,
replace the controller. nmmlllm E Memory error in the JSRAM. Replace controller. nmmllll F Wrong image found on
program card.
nmlmmmm 10 Controller Module memory is
Replace program card or
replace controller if needed.
Replace controller.
bad.
nmlmmlm 12 Controller Module memory
Replace controller.
addressing is malfunctioning.
nmlmmll 13 Controller Module memory
Replace controller.
parity is not working.
nmlmlmm 14 Controller Module memory
Replace controller.
controller timer has failed.
nmllmml 15 The Controller Module memory
Replace controller.
controller interrupt handler has failed.
Legend:
= reset button F
LASHING ❏ = reset button OFF ● = LED FLASHING ❍ = LED OFF
1–14 HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Troubleshooting Information
Table 1–2: F
LASHING OCP Pattern Displays and Repair Actions (Sheet 3 of 3)
OCP
Pattern
nmllllm 1E During the diagnostic memory
Code Error Repair Action
Replace controller. test, the Controller Module memory controller caused an unexpected Non-Maskable Interrupt (NMI).
nlmmlmm 24 The card code image changed
Replace controller. when the contents were copied to memory.
nllmmmm 30 The JSRAM battery is bad. Replace controller. nllmmlm 32 First-half diagnostics of the
Replace controller. Time of Year Clock failed.
nllmmll 33 Second-half diagnostics of the
Replace controller. Time of Year Clock failed.
nllmlml 35 The processor bus-to-device
Replace controller. bus bridge chip is bad.
nlllmll 3B An unnecessary interrupt
Replace controller. pending.
nllllmm 3C An unexpected fault during
Replace controller. initialization.
nllllml 3D An unexpected maskable
Replace controller. interrupt during initialization.
nlllllm 3E An unexpected NMI during
Replace controller. initialization.
nllllll 3F An invalid process ran during
Replace controller. initialization.
Legend:
= reset button F
LASHING ❏ = reset button OFF ● = LED FLASHING ❍ = LED OFF
Solid OCP Pattern Display Reporting
Certain events cause the OCP LEDs to display ON or SOLID. Each event and the resulting pattern are described in Table 1–3.
HSG80 Array Controller V8.7 Troubleshooting Reference Guide 1–15
Troubleshooting Information
Information related to the solid OCP patterns is automatically displayed on the maintenance terminal (unless disabled with the FMU) using %FLL formatting, as detailed in the following examples:
%FLL--HSG> --13-MAY-2001 04:39:45 (time not set)-- OCP Code: 38 Controller operation terminated.
%FLL--HSG> --13-MAY-2001 04:32:26 (time not set)-- OCP Code: 26 Memory module is missing.
Table 1–3: Solid OCP Pattern Displays and Repair Actions (Sheet 1 of 6)
OCP
Pattern
ommmmmm 0 Catastrophic controller or
nmmmmmm 0 No program card detected or
nlmmlml 25 Recursive Bugcheck detected.
Legend:
= reset button O
Code Error Repair Action
Check power. If good, reset
power failure.
controller. If problem persists, reseat controller module and reset controller. If problem is still evident, replace controller module.
Make sure that the program
kill asserted by other controller. Controller unable to read
program card.
card is properly seated while resetting the controller. If the error persists, try the card with another controller; or replace the card. Otherwise, replace the controller that reported the error.
Reset the controller. If this fault
The same bugcheck has occurred three times within 10 minutes, and controller operation has halted.
pattern is displayed repeatedly, follow the repair actions associated with the Last Failure code that is repeatedly terminating controller execution.
N ❏ = reset button OFF ● = LED ON ❍ = LED OFF
1–16 HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Troubleshooting Information
Table 1–3: Solid OCP Pattern Displays and Repair Actions (Sheet 2 of 6)
OCP
Pattern
nlmmllm 26 Indicated memory module is
Code Error Repair Action
Insert memory module (cache missing.
board). Controller is unable to detect a
particular memory module.
nlmmlll 27 Memory module has
insufficient usable memory.
Replace indicated DIMMs.
This indication is only provided
when Fault LED logging is
enabled.
nlmlmmm 28 An unexpected Machine
Reset the controller. Fault/NMI occurred during Last Failure processing.
A machine fault was detected while a Non-Maskable Interrupt was processing.
nlmlmml 29 EMU protocol version
incompatible. The microcode in the EMU and
the software in the controller are not compatible.
nlmlmlm 2A All enclosure I/O modules are
not of the same type. Enclosure I/O modules are a
combination of single-ended
Upgrade either the EMU
microcode or the software
(refer to the release notes that
accompanied the controller
software).
Make sure that the I/O modules
in an extended subsystem are
either all single-ended or all
differential, but not both. and differential.
nlmlmll 2B Jumpers, not terminators,
found on backplane. One or more SCSI bus
terminators are either missing from the backplane or broken.
Make sure that enclosure SCSI
bus terminators are installed
and that no jumpers are
installed. Replace the failed
terminator if the problem
continues.
Legend:
= reset button O
N ❏ = reset button OFF ● = LED ON ❍ = LED OFF
HSG80 Array Controller V8.7 Troubleshooting Reference Guide 1–17
Troubleshooting Information
Table 1–3: Solid OCP Pattern Displays and Repair Actions (Sheet 3 of 6)
OCP
Pattern
nlmllmm 2C Enclosure I/O termination
Code Error Repair Action
Make sure that all of the
power out of range. Faulty or missing I/O module
causes enclosure I/O termination power to be out of
enclosure device SCSI buses have an I/O module. If problem persists, replace the failed I/O module.
range.
nlmllml 2D Master enclosure SCSI buses
are not all set to ID 0.
Set the PVA ID to 0 for the enclosure with the controllers. If the problem persists, try the following repair actions:
1. Replace the PVA module.
2. Replace the EMU.
3. Remove all devices.
4. Replace the enclosure.
nlmlllm 2E Multiple enclosures have the
same SCSI ID. More than one enclosure has
the same SCSI ID.
Reconfigure the PVA ID to uniquely identify each enclosure in the subsystem. The enclosure with the controllers must be set to PVA ID 0; additional enclosures must use PVA IDs 2 and 3. If the error continues after PVA settings are unique, replace each PVA module one at a time. Check the enclosure if the problem remains.
nlmllll 2F Memory module has illegal
DIMM configuration.
Verify that DIMMs are installed correctly.
Legend:
= reset button O
N ❏ = reset button OFF ● = LED ON ❍ = LED OFF
1–18 HSG80 Array Controller V8.7 Troubleshooting Reference Guide
Loading...
+ 228 hidden pages