DEC 4000 AXP Service Manual

Download

Page 1

DEC4000AXP ServiceGuide

Order Number: EK–KN430–SV. B01

Digital Equipment Corporation Maynard, Massachusetts

Page 2

Revised, July 1993 First Printing, December 1992

The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation.

Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document.

The software, if any, described in this document is furnished under a license and may be used or copied only in accordance with the terms of such license. No responsibility is assumed for the use or reliability of software or equipment that is not supplied by Digital Equipment Corporation or its afﬁliated companies.

in preparing future documentation. The following are trademarks of Digital Equipment Corporation: Alpha AXP, AXP, DEC, DECchip,

DECconnect, DECdirect, DECnet, DECserver, DEC VET, DESTA, MSCP, RRD40, ThinWire, TMSCP, TU, UETP, ULTRIX, VAX, VAX DOCUMENT, VAXcluster, VMS, the AXP logo, and the DIGITAL logo.

OSF/1 is a registered trademark of Open Software Foundation, Inc. All other trademarks and registered trademarks are the property of their respective holders. FCC NOTICE: The equipment described in this manual generates, uses, and may emit radio

frequency energy. The equipment has been type tested and found to comply with the limits for a Class A computing device pursuant to Subpart J of Part 15 of FCC Rules, which are designed to provide reasonable protection against such radio frequency interference when operated in a commercial environment. Operation of this equipment in a residential area may cause interference, in which case the user at his own expense may be required to take measures to correct the interference.

This document was prepared using VAX DOCUMENT, Version 2.1.

S2384

Page 3

Contents

Preface ................................................ xiii

1 System Maintenance Strategy

1.1 Troubleshooting the System . ....................... 1–1

1.2 Service Delivery Methodology ...................... 1–7

1.3 Product Service Tools and Utilities . . . ............... 1–8

1.4 Information Services ............................. 1–11

1.5 Field Feedback . . . ............................... 1–12

2 Power-On Diagnostics and System LEDs

2.1 Interpreting System LEDs . . ....................... 2–1

2.1.1 Power Supply LEDs ........................... 2–2

2.1.2 Operator Control Panel LEDs ................... 2–7

2.1.3 I/O Panel LEDs .............................. 2–9

2.1.4 Futurebus+ Option LEDs ....................... 2–11

2.1.5 Storage Device LEDs . . . ....................... 2–12

2.2 Power-Up Screens ............................... 2–15

2.2.1 Console Event Log ............................ 2–17

2.2.2 Mass Storage Problems Indicated at Power-Up ...... 2–18

2.2.3 Robust Mode Power-Up . ....................... 2–26

2.3 Power-Up Sequence .............................. 2–27

2.3.1 AC Power-Up Sequence . ....................... 2–27

2.3.2 DC Power-Up Sequence . ....................... 2–29

2.3.3 Firmware Power-Up Diagnostics . . ............... 2–32

2.3.3.1 Serial ROM Diagnostics ..................... 2–32

2.3.3.2 Console Firmware-Based Diagnostics........... 2–33

2.4 Boot Sequence . . . ............................... 2–33

2.4.1 Cold Bootstrapping in a Uniprocessor Environment . . 2–34

2.4.2 Loading of System Software ..................... 2–35

2.4.3 Warm Bootstrapping in a Uniprocessor

Environment . ............................... 2–36

Page 4

2.4.4 Multiprocessor Bootstrapping ................... 2–37

2.4.5 Boot Devices . . ............................... 2–37

3 Running System Diagnostics

3.1 Running ROM-Based Diagnostics ................... 3–1

3.1.1 test . ....................................... 3–3

3.1.2 show fru .................................... 3–5

3.1.3 show_status . . ............................... 3–7

3.1.4 show error . . . ............................... 3–8

3.1.5 memexer ................................... 3–10

3.1.6 memexer_mp . ............................... 3–11

3.1.7 exer_read ................................... 3–12

3.1.8 exer_write . . . ............................... 3–14

3.1.9 fbus_diag ................................... 3–16

3.1.10 show_mop_counter ............................ 3–18

3.1.11 clear_mop_counter ............................ 3–19

3.1.12 Loopback Tests............................... 3–20

3.1.12.1 Testing the Auxiliary Console Port (exer) . . ...... 3–20

3.1.12.2 Testing the Ethernet Ports (netexer) ........... 3–20

3.1.13 kill and kill_diags ............................ 3–21

3.1.14 Summary of Diagnostic and Related Commands ..... 3–21

3.2 DSSI Device Internal Tests . ....................... 3–22

3.3 DECVET...................................... 3–25

3.4 Running UETP . . ............................... 3–26

3.4.1 Summary of UETP Operating Instructions . . . ...... 3–26

3.4.2 System Disk Requirements ..................... 3–28

3.4.3 Preparing Additional Disks ..................... 3–28

3.4.4 Preparing Magnetic Tape Drives . . ............... 3–29

3.4.5 Preparing Tape Cartridge Drives . . ............... 3–29

3.4.5.1 TLZ06 Tape Drives. . ....................... 3–30

3.4.6 Preparing RRD42 Compact Disc Drives ............ 3–30

3.4.7 Preparing Terminals and Line Printers ............ 3–30

3.4.8 Preparing Ethernet Adapters .................... 3–30

3.4.9 DECnet for OpenVMS AXP Phase . ............... 3–31

3.4.10 Termination of UETP . . . ....................... 3–32

3.4.11 Interpreting UETP VMS Failures . ............... 3–32

3.4.12 Interpreting UETP Output ..................... 3–32

3.4.12.1 UETP Log Files ........................... 3–33

3.4.12.2 Possible UETP Errors ...................... 3–33

3.5 Acceptance Testing and Initialization. . ............... 3–34

Page 5

4 Error Log Analysis

4.1 Fault Detection and Reporting ...................... 4–1

4.1.1 Machine Check/Interrupts ...................... 4–2

4.1.2 System Bus Transaction Cycle ................... 4–4

4.2 Error Logging and Event Log Entry Format ........... 4–4

4.3 Event Record Translation. . . ....................... 4–6

4.3.1 OpenVMS AXP Translation ..................... 4–6

4.3.2 DEC OSF/1 Translation . ....................... 4–7

4.4 Interpreting System Faults Using ERF and UERF ...... 4–7

4.4.1 Note 1: System Bus Address Cycle Failures . . ...... 4–12

4.4.2 Note 2: System Bus Write-Data Cycle Failures ...... 4–13

4.4.3 Note 3: System Bus Read Parity Error ............ 4–14

4.4.4 Note 4: Backup Cache Uncorrectable Error . . . ...... 4–14

4.4.5 Note 5: Data Delivered to I/O Is Known Bad. . ...... 4–15

4.4.6 Note 6: Futurebus+ DMA Parity Error ............ 4–15

4.4.7 Note 7: Futurebus+ Mailbox Access Parity Error .... 4–16

4.4.8 Note 8: Multi-Event Analysis of Command/Address

Parity, Write-Data Parity, or Read-Data Parity

Errors ..................................... 4–16

4.4.9 Sample System Error Report (ERF) ............... 4–16

4.4.10 Sample System Error Report (UERF) ............. 4–18

5 Repairing the System

5.1 General Guidelines for FRU Removal and Replacement . . 5–1

5.2 Front FRUs .................................... 5–4

5.2.1 Operator Control Panel . ....................... 5–4

5.2.2 Vterm Module ............................... 5–4

5.2.3 Fixed-Media Storage . . . ....................... 5–4

5.2.3.1 3.5-Inch Fast-SCSI Disk Drives (RZ26, RZ27,

RZ35) ................................... 5–4

5.2.3.2 3.5-Inch SCSI Disk Drives ................... 5–5

5.2.3.3 5.25-Inch SCSI Disk Drive ................... 5–6

5.2.3.4 SCSI Storageless Tray Assembly .............. 5–6

5.2.3.5 3.5-Inch DSSI Disk Drive .................... 5–7

5.2.3.6 5.25-Inch DSSI Disk Drive ................... 5–7

5.2.3.7 DSSI Storageless Tray Assembly .............. 5–8

5.2.4 Removable-Media Storage (Tape and Compact

Disc) ....................................... 5–8

5.2.4.1 SCSI Bulkhead Connector ................... 5–8

5.2.4.2 SCSI Continuity Card ...................... 5–8

5.2.5 Fans ....................................... 5–9

vii

Page 6

5.3 Rear FRUs ..................................... 5–16

5.3.1 Modules (CPU, Memory, I/O, Futurebus+) .......... 5–16

5.3.2 Ethernet Fuses .............................. 5–17

5.3.3 Power Supply . ............................... 5–17

5.3.4 Fans ....................................... 5–17

5.4 Backplane ..................................... 5–20

5.5 Repair Data for Returning FRUs .................... 5–22

6 System Conﬁguration and Setup

6.1 Functional Description ............................ 6–1

6.1.1 System Bus . . ............................... 6–7

6.1.1.1 KN430 CPU .............................. 6–7

6.1.1.2 Memory . . ............................... 6–10

6.1.1.3 I/O Module ............................... 6–13

6.1.2 Serial Control Bus ............................ 6–15

6.1.3 Futurebus+ . . ............................... 6–16

6.1.4 Power Subsystem ............................. 6–17

6.1.5 Mass Storage . ............................... 6–19

6.1.5.1 Fixed-Media Compartments . . . ............... 6–19

6.1.5.2 Removable-Media Storage Compartment . . ...... 6–21

6.1.6 System Expansion ............................ 6–23

6.1.6.1 Power Control Bus for Expanded Systems . ...... 6–23

6.2 Examining System Conﬁguration.................... 6–25

6.2.1 show conﬁg . . . ............................... 6–25

6.2.2 show device . . ............................... 6–26

6.2.3 show memory . ............................... 6–29

6.3 Setting and Showing Environment Variables ........... 6–29

6.4 Setting and Examining Parameters for DSSI Devices .... 6–33

6.4.1 show device du pu ............................ 6–33

6.4.2 cdp........................................ 6–34

6.4.3 DSSI Device Parameters: Deﬁnitions and Function. . 6–36

6.4.3.1 How OpenVMS AXP Uses the DSSI Device

Parameters .............................. 6–38

6.4.3.2 Example: Modifying DSSI Device Parameters .... 6–39

6.5 Console Port Baud Rate ........................... 6–41

6.5.1 Console Serial Port ........................... 6–42

6.5.2 Auxiliary Serial Port . . . ....................... 6–44

viii

Page 7

A Environment Variables

B Power System Controller Fault Displays

C Worksheet for Recording Customer Environment

Variable Settings

Glossary

Index

Examples

3–1 Running DRVTST ............................ 3–24

3–2 Running DRVEXR ............................ 3–25

4–1 ERF-Generated Error Log Entry Indicating CPU

Corrected Error .............................. 4–17

4–2 UERF-Generated Error Log Entry Indicating CPU

Error ...................................... 4–18

Figures

2–1 Power Supply LEDs ........................... 2–3

2–2 LDC and Fan Unit Locations and Error Codes ...... 2–6

2–3 OCP LEDs . . . ............................... 2–7

2–4 Module Locations Corresponding to OCP LEDs ...... 2–9

2–5 I/O Panel LEDs .............................. 2–10

2–6 Futurebus+ Option LEDs ....................... 2–11

2–7 Fixed-Media Mass Storage LEDs (SCSI) ........... 2–13

2–8 Fixed-Media Mass Storage LEDs (DSSI) ........... 2–14

2–9 Power-Up Self-Test Screen ...................... 2–16

2–10 Sample Power-Up Conﬁguration Screen............ 2–17

2–11 Flowchart for Troubleshooting Fixed-Media

Problems ................................... 2–19

2–12 Flowchart for Troubleshooting Fixed-Media Problems

(Continued) . . ............................... 2–20

Page 8

2–13 Flowchart for Troubleshooting Removable-Media

Problems ................................... 2–23

2–14 Flowchart for Troubleshooting Removable-Media

Problems (Continued) . . ....................... 2–24

2–15 AC Power-Up Sequence . ....................... 2–28

2–16 DC Power-Up Sequence . ....................... 2–30

2–17 DC Power-Up Sequence (Continued) .............. 2–31

4–1 ERF/UERF Error Log Format ................... 4–5

5–1 SCSI Continuity Card Placement . . ............... 5–9

5–2 Front FRUs . . ............................... 5–10

5–3 Storage Compartment with Four 3.5-inch Fast-SCSI

Drives (RZ26, RZ27, RZ35)...................... 5–11

5–4 Storage Compartment with Four 3.5-inch SCSI/DSSI

Drives ..................................... 5–12

5–5 3.5-Inch SCSI Drive Resistor Packs and Power

Termination Jumpers . . . ....................... 5–13

5–6 Position of Drives in Relation to Bus Node ID

Numbers ................................... 5–14

5–7 Storage Compartment with One 5.25-inch SCSI/DSSI

Drive ...................................... 5–15

5–8 Rear FRUs . . . ............................... 5–18

5–9 Ethernet Fuses and Ethernet Address ROMs . ...... 5–19

5–10 Removing Shell .............................. 5–21

5–11 Removing Backplane . . . ....................... 5–22

6–1 System Block Diagram . . ....................... 6–3

6–2 System Backplane ............................ 6–4

6–3 BA640 Enclosure (Front) ....................... 6–5

6–4 BA640 Enclosure (Rear) . ....................... 6–6

6–5 CPU Block Diagram ........................... 6–8

6–6 MS430 Memory Block Diagram . . . ............... 6–12

6–7 I/O Module Block Diagram ...................... 6–14

6–8 Serial Control Bus EEPROM Interaction ........... 6–16

6–9 Power Subsystem Block Diagram . . ............... 6–18

6–10 Fixed-Media Storage . . . ....................... 6–20

6–11 Removable-Media Storage ...................... 6–22

6–12 Sample Power Bus Conﬁguration . . ............... 6–24

6–13 Device Name Convention ....................... 6–27

Page 9

6–14 How OpenVMS Sees Unit Numbers for DSSI

Devices ..................................... 6–39

6–15 Sample DSSI Buses for an Expanded DEC 4000 AXP

System ..................................... 6–41

6–16 Console Baud Rate Select Switch . . ............... 6–43

Tables

1–1 Recommended Troubleshooting Procedures . . . ...... 1–2

1–2 Diagnostic Flow for Power Problems .............. 1–5

1–3 Diagnostic Flow for Problems Getting to Console

Mode ...................................... 1–5

1–4 Diagnostic Flow for Problems Reported by the Console

Program .................................... 1–6

1–5 Diagnostic Flow for Boot Problems ............... 1–6

1–6 Diagnostic Flow for Errors Reported by the Operating

System ..................................... 1–7

2–1 Interpreting Power Supply LEDs . . ............... 2–4

2–2 Interpreting OCP LEDs . ....................... 2–8

2–3 Interpreting I/O Panel LEDs .................... 2–10

2–4 Interpreting Futurebus+ Option LEDs ............. 2–12

2–5 Interpreting Fixed-Media Mass Storage LEDs . ...... 2–14

2–6 Fixed-Media Mass Storage Problems .............. 2–21

2–7 Removable-Media Mass Storage Problems .......... 2–25

2–8 Supported Boot Devices . ....................... 2–37

3–1 Summary of Diagnostic and Related Commands ..... 3–21

4–1 DEC 4000 AXP Fault Detection and Correction ...... 4–2

4–2 Error Field Bit Deﬁnitions for Error Log

Interpretation ............................... 4–8

6–1 Memory Features ............................. 6–11

6–2 Power Control Bus ............................ 6–24

6–3 Environment Variables Set During System

Conﬁguration . ............................... 6–30

6–4 Console Line Baud Rates ....................... 6–43

A–1 Environment Variables ....................... A–1

B–1 Power System Controller Fault ID Display . . . ...... B–1

C–1 Nonvolatile Environment Variables ............... C–1

Page 10

Page 11

Preface

This guide describes the procedures and tests used to service DEC 4000 AXP systems.

Intended Audience

This guide is intended for use by Digital Equipment Corporation service personnel and qualiﬁed self-maintenance customers.

Conventions

The following coventions are used in this guide.

Convention Meaning

Return

Ctrl/x Ctrl/x indicates that you hold down the Ctrl key while you

bold type In the online book (Bookreader), bold type in examples

lowercase Lowercase letters in commands indicate that commands can be

A key name enclosed in a box indicates that you press that key.

press another key, indicated here by x. In examples, this key combination is enclosed in a box, for example,

indicates commands and other instructions that you enter at the keyboard.

entered in uppercase or lowercase.

Ctrl/C

xiii

Page 12

In some illustrations, small drawings of the DEC 4000 AXP system appear in the left margin. Shaded areas help you locate

components on the front or back of the system. Warning Warnings contain information to prevent personal injury. Caution Cautions provide information to prevent damage to equipment

[]

console command abbreviations

boot

italic type Italic type in console command sections indicates a variable. < > In console mode online help, angle brackets enclose a

{ } In command descriptions, braces containing items separated by

or software.

In command format descriptions, brackets indicate optional

elements.

Console command abbreviations must be entered exactly as

shown.

Console and operating system commands are shown in this

special typeface.

placeholder for which you must specify a value.

commas imply mutually exclusive items.

xiv

Page 13

System Maintenance Strategy

Any successful maintenance strategy is based on the proper understanding and use of information services, service tools, service support and escalation procedures, ﬁeld feedback, and troubleshooting procedures. This chapter describes the maintenance strategy for the DEC 4000 AXP system.

• Section 1.1 provides a diagnostic strategy you should use to troubleshoot a

DEC 4000 AXP system.

• Section 1.2 explains the service delivery methodology.

• Section 1.3 lists the product tools and utilities.

• Section 1.4 lists available information services.

• Section 1.5 describes ﬁeld feedback procedures.

1.1 Troubleshooting the System

Before troubleshooting any system problem, check the site maintenance log for the system’s service history. Be sure to ask the system manager the following questions:

• Has the system been used before and did it work correctly?

• Have changes to hardware or updates to ﬁrmware or software been made to

the system recently?

• What is the state of the system—is the operating system up?

If the operating system is down and you are not able to bring it up, use the console environment diagnostic tools, such as RBDs and LEDs.

If the operating system is up, use the operating system environment diagnostic tools, such as error logs, crash dumps, DEC VET and UETP exercisers, and other log ﬁles.

System Maintenance Strategy 1–1

Page 14

System problems can be classiﬁed into the following ﬁve categories:

1. Power problems

2. Problems getting to the console

3. Failures reported by the console subsystem

4. Boot failures

5. Failures reported by the operating system Using these categories, you can quickly determine a starting point for diagnosis

and eliminate the unlikely sources of the problem. Table 1–1 provides the recommended tools or resources you should use to isolate problems in each category.

Table 1–1 Recommended Troubleshooting Procedures

Description

1. Power Problems (Table 1–2)

Diagnostic Tools/Resources Reference

No power at system enclosure or trouble with power supply subsystem, as indicated by LEDs.

2. Problems Getting to Console Mode (Table 1–3)

System powers up, but does not display power-up screen.

Power supply subsystem LEDs

OCP LEDs Refer to Section 2.1.2 for information on

Console terminal troubleshooting ﬂow

Power-up sequence description

Robust mode power-up

Refer to Section 2.1.1 for information on interpreting power supply LEDs.

interpreting OCP LEDs.

Refer to Table 1–3 for information on troubleshooting console terminal problems.

Refer to Section 2.3 and 2.3.3 for a description of the power-up and self-test sequence.

Refer to Section 2.2.3 for a description of robust mode power-up and its functions.

(continued on next page)

1–2 System Maintenance Strategy

Page 15

Table 1–1 (Cont.) Recommended Troubleshooting Procedures

Description

3. Failures Reported by the Console Program (Table 1–4)

Diagnostic Tools/Resources Reference

Power-up console screens indicate a failure.

4. Boot Failures (Table 1–5)

System fails to boot operating system.

Power-up screens

Console event log

RBD device tests

Console commands (to examine environment variables and device parameters)

Storage device troubleshooting ﬂowcharts

RBD device tests

Boot sequence description

Refer to Section 2.2 for information on interpreting power-up self-tests.

Refer to Section 2.2 for information on the console event log.

Refer to Section 3.1 for information on running RBD device tests.

Refer to Chapter 6 for instructions on setting and examining environment variables and device parameters.

Refer to Section 2.2.2.

Refer to Section 3.1 for information on running RBD device tests.

Refer to Section 2.4 for a description of the boot sequence.

(continued on next page)

System Maintenance Strategy 1–3

Page 16

Table 1–1 (Cont.) Recommended Troubleshooting Procedures

Description

5. Failures Reported by the Operating System (Table 1–6)

Diagnostic Tools/Resources Reference

Operating system generates error logs; process hangs or operating system crashes.

Error logs Refer to Chapter 4 for information on

Crash dump Refer to OpenVMS AXP Alpha System

DEC VET or UETP

Other log ﬁles Refer to Chapter 4 for information on

interpreting error logs.

Dump Analyzer Utility Manual for information on how to interpret OpenVMS crash dump ﬁles.

Refer to the Guide to Kernel Debugging (AA–PS2TA–TE) for information on using the DEC OSF/1 Krash Utility.

Refer to Section 3.3 for a description of DEC VET, and Section 3.4 for information on running UETP software exercisers.

using log ﬁles such as SETHOST.LOG and OPERATOR.LOG to aid in troubleshooting.

Use the following tables to identify the diagnostic ﬂow for the ﬁve types of system problems:

• Table 1–2 provides the diagnostic ﬂow for power problems.

• Table 1–3 provides the diagnostic ﬂow for problems getting to console mode.

• Table 1–4 provides the diagnostic ﬂow for problems reported by the console

program.

• Table 1–5 provides the diagnostic ﬂow for boot problems.

• Table 1–6 provides the diagnostic ﬂow for errors reported by the operating

system.

1–4 System Maintenance Strategy

Page 17

Table 1–2 Diagnostic Flow for Power Problems

Symptom Action Reference

No AC power at system as indicated by AC present LED.

AC power is present, but system does not power on.

Check the power source and power cord.

Check the system AC circuit breaker

setting.

Check the DC on/off switch setting.

Examine power supply subsystem LEDs

to determine if a power supply unit

or fan has failed, or if the system has

shut down due to an overtemperature

condition.

Section 2.1.1

Table 1–3 Diagnostic Flow for Problems Getting to Console Mode

Symptom Action Reference

Power-up screens (or console event log) are not displayed.

Check OCP LEDs for a failure during

self-tests. If two OCP LEDs remain lit,

either option could be at fault.

Check baud rate setting for console

terminal and system. The system default

baud rate setting is 9600.

Try connecting the console terminal to

the auxiliary console port.

Note: No console output is directed to

the auxiliary console port untill the

power-up self-tests have completed and

you press the Enter key or Ctrl/x.

For certain situations, power up under

robust mode to bypass the power-up

script and get to a low-level console.

From console mode, you can then edit the

nvram ﬁle, set and examine environment

variables, or initialize individual phases

of drivers.

Section 2.1.2

Section 6.5

Section 2.2.3

System Maintenance Strategy 1–5

Page 18

Table 1–4 Diagnostic Flow for Problems Reported by the Console Program

Symptom Action Reference

Power-up screens are displayed, but tests do not complete.

Console program reports error.

Use power-up display and/or OCP LEDs

to determine error.

Examine the console event log to check

for embedded error messages recorded

during power-up.

If power-up screens indicate problems

with mass storage devices, use the

troubleshooting ﬂow charts to determine

the problems.

Run RBD tests to verify problem. Section 3.1

Use the

examine error information contained

in serial control bus EEPROMs.

show error

command to

Section 2.2 and Section 2.1.2

Section 2.2.1

Section 2.2.2

Section 3.1.4

Table 1–5 Diagnostic Flow for Boot Problems

Symptom Action Reference

System cannot ﬁnd boot device.

Device does not boot. Run device test to check that boot device

Check system conﬁguration for correct

device parameters (node ID, device name,

and so on) and environment variables

(bootdef_dev, boot_ﬁle, boot_osﬂags).

is operating.

Section 6.2.1, Section 6.3, and Section 6.4

Section 3.2

1–6 System Maintenance Strategy

Page 19

Table 1–6 Diagnostic Flow for Errors Reported by the Operating System

Symptom Action Reference

System is hung or has crashed.

Operating system is up. Examine the operating system error log

Examine the crash dump ﬁle. Operating system

Use the

examine error information contained

in serial control bus EEPROMs (console

environment error log).

ﬁles to isolate the problem.

If the problem occurs intermittently, run

DEC VET or UETP to stress the system.

Examine other log ﬁles, such as

SETHOST.LOG, OPCOM.LOG, and

OPERATOR.LOG.

show error

command to

documentation

Section 3.1.4

Chapter 4

Section 3.3 and Section 3.4

1.2 Service Delivery Methodology

Before beginning any maintenance operation, you should be familiar with the following:

• The site agreement

• Your local and area geography support and escalation procedures

• Your Digital Services product delivery plan

System Maintenance Strategy 1–7

Page 20

Service delivery methods are part of the service support and escalation procedure. When appropriate, remote services should be part of the initial system installation. Methods of service delivery include the following:

• Local support

• Remote call screening

• Remote diagnosis (using modem support)

Recommended System Installation

The recommended system installation includes:

1. Hardware installation and acceptance testing. Acceptance testing includes

running ROM-based diagnostics.

2. Software installation and acceptance testing. For example, using OpenVMS

Factory Installed Software (FIS), and then acceptance testing with DEC VET or UETP.

3. Installation of the remote service tools and equipment to allow a Digital

Service Center to dial in to the system. Refer to your remote service delivery strategy.

If you do not follow your service delivery methodology, you risk incurring excessive service expenses for any product.

1.3 Product Service Tools and Utilities

This section lists the array of service tools and utilities available for acceptance testing, diagnosis, and serviceability and provides recommendations for their use.

Error Handling/Logging

OpenVMS and DEC OSF/1 operating systems provide recovery from errors, fault handling, and event logging. The OpenVMS Error Report Formatter (ERF) provides bit-to-text translation of the event logs for interpretation. DEC OSF/1 uses UERF to capture the same kinds of information.

RECOMMENDED USE: Analysis of error logs is the primary method of diagnosis and fault isolation. If the system is up, or the customer allows the service representative to bring the system up, look at this information ﬁrst. Refer to Chapter 4 for information on using error logs to isolate faults.

1–8 System Maintenance Strategy

Page 21

ROM-Based Diagnostics (RBDs)

ROM-based diagnostics have signiﬁcant advantages:

• There is no load time.

• The boot path is more reliable.

• Diagnosis is done in console mode.

RECOMMENDED USE: The ROM-based diagnostic facility is the primary means of console environment testing and diagnosis of the CPU, memory, Ethernet, Futurebus+, and SCSI and DSSI subsystems. Use ROM-based diagnostics in the acceptance test procedures when you install a system, add a memory module, or replace the following: CPU module, memory module, backplane, I/O module, Futurebus+ device, or storage device. Refer to Section 3.1 for information on running ROM-based diagnostics.

Loopback Tests

Internal and external loopback tests are used to isolate a failure by testing segments of a particular control or data path. The loopback tests are a subset of the ROM-based diagnostics.

RECOMMENDED USE: Use loopback tests to isolate problems with the auxiliary console port and Ethernet controllers. Refer to Section 3.1.12 for instructions on performing loopback tests.

Firmware Console Commands

Console commands are used to set and examine environment variables and device parameters. For example, the and

show device

set

(bootdef_dev, auto_action, and boot_osﬂags) commands are used to set environment variables; and the parameters.

RECOMMENDED USE: Use console commands to set and examine environment variables and device parameters. Refer to Section 6.2 for information on ﬁrmware commands and utilities.

commands are used to examine the conﬁguration; the

show memory,show configuration

cdp

command is used to conﬁgure DSSI

System Maintenance Strategy 1–9

Page 22

Option LEDs During Power-Up

The power supply LEDs display pass/fail test results for the power supply subsystem; the operator control panel (OCP) LEDs display pass/fail self-test results for CPU, memory, I/O, and Futurebus+ modules. Storage devices and Futurebus+ modules have their own LEDs as well.

RECOMMENDED USE: Monitor LEDs during power-up to see if the devices pass their self-tests. Refer to Chapter 2 for information on LEDs and powerup tests.

Operating System Exercisers (DEC VET or UETP)

The Digital Veriﬁer and Exerciser Tool (DEC VET) is supported by the OpenVMS and DEC OSF/1 operating systems. DEC VET performs exerciseroriented maintenance testing of both hardware and operating system. UETP is included with OpenVMS and is designed to test whether the OpenVMS operating system is installed correctly.

RECOMMENDED USE: Use DEC VET or UETP as part of acceptance testing to ensure that the CPU, memory, disk, tape, ﬁle system, and network are interacting properly. Also use DEC VET or UETP to stress test the user’s environment and conﬁguration by simulating system operation under heavy loads to diagnose intermittent system failures.

Crash Dumps

For fatal errors, such as fatal bugchecks, OpenVMS and DEC OSF/1 operating systems will save the contents of memory to a crash dump ﬁle.

RECOMMENDED USE: The support representative should analyze crash dump ﬁles. To save a crash dump ﬁle for analysis, you need to know proper system settings. Refer to the OpenVMS AXP Alpha System Dump Analyzer Utility Manual or the Guide to Kernel Debugging (AA–PS2TA–TE) for instructions.

Other Log Files

Several types of log ﬁles, such as operator log, console event log, sethost log, and accounting ﬁle (accounting.dat) are useful in troubleshooting.

RECOMMENDED USE: Use the sethost log and other log ﬁles to capture/examine the console output and compare with event logs and crash dumps in order to see what the system was doing at the time of the error.

1–10 System Maintenance Strategy

Page 23

1.4 Information Services

As a Digital service representative, you may access several information resources, including advanced database applications, online training courses, and remote diagnostic tools. A brief description of some of these resources follows.

Technical Information Management Architecture (TIMA)

TIMA is an online database that delivers technical and reference information to service representatives. A key beneﬁt of TIMA is the pooling of worldwide knowledge and expertise.

DEC 4000 AXP Model 600 Series Information Set

The DEC 4000 AXP Model 600 Series Information Set consists of service documentation that contains information on installing and using, servicing and upgrading, and understanding the system. The guide you are reading is part of the set. The hardcopy kit number is EK–KN430–DK. The set is also available on TIMA. Refer to your DEC 4000 Model 600 Information Map (EK–KN430–IN) for detailed information.

Training

Computer Based Training (CBT) and lecture lab courses are available from the Digital training center:

• DEC 4000 System Installation and Troubleshooting (CBT course, EY–

I090E–CO)

• Alpha Architecture Concepts (CBT course, EY–K725E–MT—magnetic

tape; EY–K725E–TK—TK50 tape)

• Futurebus+ Concepts (EY–F479E–CO)

Digital Services Product Delivery Plan (Hardware or Software)

The Product Delivery Plan documents Digital Services’ delivery commitments. The plan is the communications vehicle used among the various groups responsible for ensuring consistency between Digital Services’ delivery strategies and engineering product strategies.

Blitzes

Technical updates are ‘‘blitzed’’ to the ﬁeld using online mail and TIMA.

System Maintenance Strategy 1–11

Page 24

Storage and Retrieval System (STARS)

STARS is a worldwide database for storing and retrieving technical information. The STARS databases, which contain more than 150,000 entries, are updated daily.

Using STARS, you can quickly retrieve the most up-to-date technical information via DSNlink or DSIN.

1.5 Field Feedback

Providing the proper feedback to the corporation is essential in closing the loop on any service call. Consider the following when completing a service call:

• Fill out repair tags accurately and with as much symptom information as possible so that repair centers can ﬁx a problem.

• Provide accurate call closeout information for Labor Activity Reporting System (LARS) or Call-Handling and Management Planning (CHAMP).

• Keep an up-to-date site maintenance log, whether hardcopy or electronic, to provide a record of the performed maintenance.

1–12 System Maintenance Strategy

Page 25

Power-On Diagnostics and System

LEDs

This chapter provides information on how to interpret system LEDs and the power-up console screens. In addition, a description of the power-up and bootstrap sequence is provided as a resource to aid in troubleshooting.

• Section 2.1 describes how to interpret system LEDs.

• Section 2.2 describes how to interpret the power-up screens.

• Section 2.3 describes the power-up sequence.

• Section 2.3.3 describes power-on self-tests.

• Section 2.4 describes the boot sequence.

2.1 Interpreting System LEDs

DEC 4000 AXP systems have several diagnostic LEDs that indicate whether modules and subsystems have passed self-tests. The power system controller constantly monitors the power supply subsystem and can indicate several types of failures. The system LEDs are used primarily to troubleshoot power problems and problems getting to the console program.

This section describes the function of each of the following types of system LEDs, and what action to take when a failure is indicated.

• Power supply LEDs

• Operator control panel (OCP) LEDs

• I/O panel LEDs

• Futurebus+ option LEDs

• Storage device LEDs

Power-On Diagnostics and System LEDs 2–1

Page 26

2.1.1 Power Supply LEDs

The power supply LEDs (Figure 2–1) are used to indicate the status of the components that make up the power supply subsystem. The following types of failures will cause the power system controller to shut down the system:

• Power system controller (PSC) failure

• Fan failure

• Overtemperature condition

• Power regulator failures (indicated by the DC3 or DC5 failure LEDs)

• Front end unit (FEU) failure

Note

The AC circuit breaker will also shut down the system. If a power surge occurs, the breaker will trip, causing the switch to return to the off position (0). If the circuit breaker trips, wait 30 seconds before setting the switch to the on position (1).

Refer to Table 2–1 for information on interpreting the LEDs and determining what actions to take when a failure is indicated.

Figure 2–2 shows the local disk converter (LDC) and fan locations as they correspond to the fault ID display.

2–2 Power-On Diagnostics and System LEDs

Page 27

Figure 2–1 Power Supply LEDs

PSC DC3FEU DC5

AC Circuit

Breaker FEU Failure

FEU OK DC3 Failure DC3 OK

DC5 Failure DC5 OK

PSC Failure PSC OK

Over

Overtemperature Shutdown

Fan Failure Disk Power Failure Fault ID Display

AC Present

LJ-02011-TI0

Power-On Diagnostics and System LEDs 2–3

Page 28

Table 2–1 Interpreting Power Supply LEDs

Indicator Meaning Action on Error Front End Unit (FEU)

AC Present When lit, indicates AC power

is present at the AC input connector (regardless of circuit breaker position).

FEU OK When lit, indicates DC output

voltages for the FEU are above the speciﬁed minimum.

FEU Failure When lit, indicates DC output

voltages for the FEU are less than the speciﬁed minimum.

If AC power is not present, check the power source and power cord.

If the system will not power up and the AC LED is the only lit LED, check if the system AC circuit breaker has tripped. Replace the front end unit (Chapter 5) if the system circuit breaker is broken.

Replace front end unit (Chapter 5).

(continued on next page)

2–4 Power-On Diagnostics and System LEDs

Page 29

Table 2–1 (Cont.) Interpreting Power Supply LEDs

Indicator Meaning Action on Error Power System Controller (PSC)

PSC OK When blinking, indicates the

PSC Failure When lit, indicates the PSC has

Disk Power Failure

Fan Failure When lit, indicates a fan has

Overtemperature Shutdown

PSC is performing power-up self-tests.

When steady, indicates the PSC is functioning normally.

detected a fault in itself. When lit, indicates a disk

power problem for the storage compartment speciﬁed in the hexadecimal fault ID display. The most likely failing unit is the local disk converter, but a shorting cable or drive could also be at fault.

failed or a cable guide is not properly secured. The failure is identiﬁed by a number displayed in the hexadecimal fault ID display.

When lit, indicates the PSC has shut down the system due to excessive internal temperature.

Replace power system controller (Chapter 5).

To isolate the local disk converter, disconnect the drives on the speciﬁed bus and then power up the system. If the Disk Power Failure LED lights with the drives disconnected, replace the failing local disk converter (Chapter 5). Refer to Figure 2–2 to locate the local disk converter speciﬁed by the fault ID display. A is the top compartment, D is the bottom compartment.

Refer to Figure 2–2 to locate the failure speciﬁed by the fault ID display.

Replace the failing fan (Chapter 5).

Set the AC circuit breaker to off (0) and wait one minute before turning on the system.

Make sure the air intake is unobstructed and that the room temperature does not exceed maximum requirement as described in the DEC 4000 Site Preparation Checklist.

(continued on next page)

Power-On Diagnostics and System LEDs 2–5

Page 30

Table 2–1 (Cont.) Interpreting Power Supply LEDs

Indicator Meaning Action on Error DC–DC Converter (DC3)

DC3 OK When lit, indicates that all the

DC3 output voltages are within speciﬁed tolerances.

DC3 Failure When lit, indicates that one of

the output voltages is outside

Replace the DC3 converter (Chapter 5).

speciﬁed tolerances.

DC–DC Converter (DC5)

DC5 OK When lit, indicates the DC5

output voltage is within speciﬁed tolerances.

DC5 Failure When lit, indicates the DC5

output voltage is outside

Replace the DC5 converter (Chapter 5).

speciﬁed tolerances.

Figure 2–2 LDC and Fan Unit Locations and Error Codes

Fan Error Codes

Local Disk Converter A

Local Disk Converter B

Local Disk Converter C

Local Disk Converter D

Fan 3 Fan 4 Fan 1

1 - Rear left 2 - Rear right 3 - Front left 4 - Front right

9 - A cable guide is not

properly secured or two or more fans have failed.

Fans are located behind the cable guides

Fan 2

MLO-010872

2–6 Power-On Diagnostics and System LEDs

Page 31

2.1.2 Operator Control Panel LEDs

The OCP LEDs (Figure 2–3) are used to indicate the progress and result of self-tests for Futurebus+, memory, CPU, and I/O modules. These LEDs are the primary diagnostic tool for troubleshooting problems getting to the console program.

Note

A failure in the CPU, memory module, or I/O module can cause both the I/O and CPU LEDs or I/O and memory LEDs to indicate self-test failures even if only one of the modules is failing. If two LEDs are lit, the I/O module is the more likely source of the failure.

Figure 2–3 OCP LEDs

DC On/Off Switch

DC Power LED

Self-Test Status LEDs

Reset Halt

6-1 3 2 1 0 0 1

MEM CPU I/O

LJ-02008-TI0

Power-On Diagnostics and System LEDs 2–7

Page 32

Refer to Table 2–2 for information on interpreting the OCP LEDs and determining what actions to take when a failure is indicated.

Figure 2–4 shows the module locations as they correspond to the LEDs.

Table 2–2 Interpreting OCP LEDs

Indicator Meaning Action on Error

Futurebus+ 6–1 Remains lit if a Futurebus+

option has failed power-on diagnostics.

MEM 3, 2, 1, 0 Remains lit if a memory module

has failed power-on diagnostics. If no good memory is found, all four memory LEDs may remain lit even if there are less than four memory modules present.

CPU 0, 1 Remains lit if a CPU module has

failed power-on diagnostics.

I/O Remains lit if the I/O module

has failed power-on diagnostics.

DC Power When lit indicates the proper

DC power is present. When unlit, indicates no DC power is present.

Examine LEDs on the Futurebus+ options to determine which option to replace.

Replace the failed module (Chapter 5).

Replace the I/O module (Chapter 5).

If no DC power is indicated, set the DC on/off switch to on (1) and examine the power supply LEDs.

2–8 Power-On Diagnostics and System LEDs

Page 33

Figure 2–4 Module Locations Corresponding to OCP LEDs

4321

3210

MEM

CPU

I/O

LJ-02052-TI0

2.1.3 I/O Panel LEDs

The I/O panel LEDs (Figure 2–5) are used to indicate the status of ThinWire and thickwire (standard) Ethernet fuses.

Refer to Table 2–3 for information on interpreting the LEDs and determining what actions to take when a failure is indicated.

Power-On Diagnostics and System LEDs 2–9

Page 34

Figure 2–5 I/O Panel LEDs

ThinWire Ethernet Fuse OK

Thickwire Ethernet Fuse OK ThinWire Ethernet Fuse OK

Thickwire Ethernet Fuse OK

LJ-02012-TI0

Table 2–3 Interpreting I/O Panel LEDs

Indicator Meaning Action on Error

ThinWire Ethernet Fuse OK

Thickwire Ethernet Fuse OK

When lit, indicates ThinWire fuse is good; unlit indicates fuse has blown.

When lit, indicates thickwire fuse is good; unlit indicates fuse has blown.

Replace fuse (refer to Chapter 5).

2–10 Power-On Diagnostics and System LEDs

Page 35

2.1.4 Futurebus+ Option LEDs

The Futurebus+ option LEDs (Figure 2–6) are used to indicate the progress and result of self-tests for a speciﬁc Futurebus+ option.

Refer to Table 2–4 for information on interpreting the LEDs and determining what actions to take when a failure is indicated.

Figure 2–6 Futurebus+ Option LEDs

Fault Run

LJ-02010-TI0

Power-On Diagnostics and System LEDs 2–11

Page 36

Table 2–4 Interpreting Futurebus+ Option LEDs

Indicator Meaning Action on Error

Fault The Fault indicator lights during

self-tests. If it remains lit, the module has failed self tests.

Run The Run indicator blinks during

self-tests and remains lit if the module passes self-tests.

Replace module.

2.1.5 Storage Device LEDs

Storage device LEDs are used to indicate the status of the device. The LEDs for ﬁxed-media storage devices are shown in Figures 2–7 and Figure 2–8. Refer to the DEC 4000 Model 600 Series Owner’s Guide for information on LEDs for the removable-media devices.

Refer to Table 2–5 for information on interpreting the LEDs and determining what actions to take when a failure is indicated.

2–12 Power-On Diagnostics and System LEDs

Page 37

Figure 2–7 Fixed-Media Mass Storage LEDs (SCSI)

Fast SCSI

3.5-Inch SCSI

5.25-Inch SCSI

Fault

Local Disk Converter OK

Online

Fault

Local Disk Converter OK

Online SCSI

Terminator

Local Disk Converter OK

SCSI Terminator

LJ-02486-TI0

Power-On Diagnostics and System LEDs 2–13

Page 38

Figure 2–8 Fixed-Media Mass Storage LEDs (DSSI)

3.5-Inch DSSI

5.25-Inch DSSI

Fault

Local Disk Converter OK

Online DSSI Terminator

with LED

Fault

Write Protect Local Disk Converter OK

Run/Ready DSSI Terminator

with LED

LJ-02483-TI0

Table 2–5 Interpreting Fixed-Media Mass Storage LEDs

Indicator Meaning Action on Error

Fault When lit, indicates an error

condition in the device. The Fault indicator may light temporarily during self-tests.

Online DSSI: When lit, indicates the

device is on line and available for use. Under normal operation, ﬂashes as seek operations are performed.

SCSI: Flashes as seek operations are performed; indicates drive activity.

2–14 Power-On Diagnostics and System LEDs

Run device RBD tests and internal device tests to determine the nature of the error, and replace device.

(continued on next page)

Page 39

Table 2–5 (Cont.) Interpreting Fixed-Media Mass Storage LEDs

Indicator Meaning Action on Error

DSSI Terminator When lit, indicates DSSI

Local Disk Converter OK

termination power is present.

When lit, indicates local disk converter for the speciﬁed storage compartment has power (this LED is located on the local disk power supply module behind the front panel of the storage compartment).

If the DSSI terminator LED does not light, check the DSSI bus connections for that bus. If bus connections seem secure, the local disk converter module or DC5 converter may need to be replaced (Section 5.2):

• Local disk converters (located in the ﬁxed-media storage compartments) supply termination power for ﬁxedmedia storage devices.

• The DC5 converter (part of the power supply subsystem) supplies termination power for storageless ﬁxed-media compartments.

Conﬁrm that the system power supply is working properly (by checking power supply LEDs). Replace the local disk converter module (Section 5.2).

2.2 Power-Up Screens

During power-up self-tests a screen similar to the one shown in Figure 2–9 is displayed on the console terminal. The screen shows the status and result of the self-tests.

Power-On Diagnostics and System LEDs 2–15

Page 40

Figure 2–9 Power-Up Self-Test Screen

VMS PAlcode Xn.nnX, OSF PAlcode Xn.nnX (CPU 1 of 1, DECchip 21064)

17:33:56 Tuesday, January 26, 1993

Digital Equipment Corporation

DEC 4000 AXP

\ Executing Power-Up Diagnostics

Memory Storage Net

CPU

APBPCPDPEP0P1P1 2 3 4 5 60P1 0 123

* Test in progress P Pass F Fail - Not Present

Futurebus+

? Sizing

LJ-02266-TI0

A power-on self-test failure indicated under Storage A–E may represent a failure of an embedded storage adapter (A–E) or failure of a drive on the speciﬁed bus. Check the console event log for additional information (Section 2.2.1).

Power-on self-tests failures indicated for all six Futurebus+ slots indicate a failure of the Futurebus+ bridge on the I/O module. Replace the I/O module in the event that all six Futurebus+ slots show failures.

When the power-up diagnostics are completed, a second screen similar to the one shown in Figure 2–10 is displayed. This screen provides conﬁguration information for the system.

2–16 Power-On Diagnostics and System LEDs

Note

Page 41

Figure 2–10 Sample Power-Up Conﬁguration Screen

Console Vn.n-nnnn VMS PALcode Xn.nnX, OSF PALcode Xn.nnX CPU 0

CPU 1 Memory 0 Memory 1 Memory 2 Memory 3 Ethernet 0 Ethernet 1

A SCSI B DSSI C DSSI D DSSI E SCSI Futurebus+

B2001-AA DECchip 21064-2

B2002-DA 128 MB

Address 08-00-2B-2A-D6-97

Address 08-00-2B-2A-D6-A6

ID 1 ID 2 ID 3 ID 4 ID 5 ID 6 ID 7

ID 0

RZ73

RF73

P P P P P

TZ85

RRD42 FBA0

-----

Host Host

Host Host Host Host Host

System Status Pass

>>>

Type

to boot dka0.0.0.0.0

LJ-02267-TI0

2.2.1 Console Event Log

DEC 4000 AXP systems maintain a console event log consisting of status messages received during power-on self-tests. If there are problems during power-up, standard error messages may be embedded in the console event log. To display a console event log, use the

Use the

set screen_mode off

log during power-up, rather than the two power-up screens. The following example shows an abbreviated console event log that contains two

standard error messages: The ﬁrst (a hard error) indicates a failure with storage bus B. This failure could be caused by a bad LDC, improperly seated storage drawer, or a disconnected power cable within the storage drawer. The second (a soft error) indicates a SCSI continuity card is missing from the removable-media storage compartment.

cat el

command.

command if you want to display the console event

Power-On Diagnostics and System LEDs 2–17

Page 42

>>>

cat el

Starting console. halt code = 1 PC=0 initialized idle PCB initializing semaphores

. .

. test Storage Bus B ncr1, loopback connector attached OR SCSI bus failure, could not acquire bus; Control Lines:ff Data lines:ff ncr1 SCSI bus failure

*** Hard Error - Error #800 Diagnostic Name ID Device Pass Test Hard/Soft 7-OCT-1970

powerup 00000004 ncr1 0 0 1 0 10:48:58 Storage Bus B failure

*** End of Error *** enable ncr2 ACK

test Storage Bus C port p_c0.7.0.2.0 initialized, scripts are at 1d07e0 SCSI device found on pkc.0.0.2.0 loading SCSI driver for port p_c0.7.0.2.0

. *** Soft Error - Error #1 - Lower SCSI Continuity Card Missing (connector J7)

Diagnostic Name ID Device Pass Test Hard/Soft 7-OCT-1992 io_test 00000067 scsi_low_con 1 1 0 1 11:25:53

*** End of Error *** device mud9.5.0.3.0 (TF85) found on pud0.5.0.3.0 >>>

2.2.2 Mass Storage Problems Indicated at Power-Up

Mass storage failures at power-up are usually indicated in one of two ways:

• The power-up screens report a storage adapter port failure (indicated by an

‘‘F’’).

• One or more drives are missing from the conﬁguration screen display (or too

many drives are displayed).

Figures 2–11 and 2–12 provide a ﬂowchart for troubleshooting ﬁxed-media mass storage problems indicated at power-up. Use the ﬂowchart to diagnose the likely cause of the problem. Table 2–6 lists the symptoms and corrective action for each of the possible problems.

2–18 Power-On Diagnostics and System LEDs

Page 43

Figure 2–11 Flowchart for Troubleshooting Fixed-Media Problems

Does the disk drive have power?

Check the Disk Power Failure LED on the PSC.

LED off LED on Likely LDC failure

Check the LDC OK LED on the storage compartment front panel.

LED on LED off

Continue

Has the disk drive failed?

Check the drive’s fault LED.

LED on (steady) Drive failure

LED off

LED flashing

Continue

Are bus node ID plugs improperly set?

Check that all drives on the bus have unique bus node ID numbers (no duplicates).

Duplicate bus node IDs Configuration rule violation

Check that no drive is set to bus node ID 7 (reserved for host ID).

Drive set to host ID 7

Continue

Is the storage drawer properly seated?

Power down, remove drawer and inspect connectors, reseat drawer and power up.

LDC failure

Drive is performing extended calibration; wait for tests to complete

Configuration rule violation

Problems persist

Continue

Problems solved Drawer not properly seated

LJ-02548-TI0A

Power-On Diagnostics and System LEDs 2–19

Page 44

Figure 2–12 Flowchart for Troubleshooting Fixed-Media Problems (Continued)

Are cables loose or missing?

Power down, remove drawer and check all cable connections, reseat drawer and power up.

Problems persist

Continue

Is the storage bus terminated?

Check that a terminator is in place.

Check that terminator power is present. For DSSI buses, check that the terminator LED is on. For SCSI buses use a volt meter on the port connector (termination power is supplied by pin 38, ground on pin 1).

Power present

Continue

Is the I/O module the source of the problem?

Swap the failing drive drawer to another compartment.

Likely problem with drive, drawer, or cables. Check again before continuing.

Is the backplane the source of the problem?

Eliminate all of the preceding problem sources before suspecting the backplane. The backplane is the least likely to fail.

Disassemble the system as described in Section 5.4. Inspect the two backplane interconnect cables.

Problems solved

Terminator missing Terminator missingTerminator present

No termination power LDC failure (with fixed-media devices)

Problems solvedProblems persist

Cable disconnected

DC5 failure (for storageless fixed-media compartments)

I/O module failure

Cables are OK

Replace backplane assembly as described in Section 5.4.

Cable connections are Backplane interconnect cable failure

loose or damaged

2–20 Power-On Diagnostics and System LEDs

LJ-02548-TI0B

Page 45

Table 2–6 Fixed-Media Mass Storage Problems

Problem Symptom Corrective Action

LDC failure Disk power failure LED on PSC

Drive failure Fault LED for drive is on

Duplicate bus node ID plugs (or a missing plug)

Bus node ID set to 7 (reserved for host ID)

Storage drawer not properly seated

is on. LDC OK LED on storage

compartment front panel is off.

Power-up screen reports a failing storage adapter port.

(steady). Drives with duplicate bus node

ID plugs are missing from the conﬁguration screen display.

A drive with no bus node ID plug defaults to zero.

Valid drives are missing from the conﬁguration screen display.

One drive may appear seven times on the conﬁguration screen display.

Disk power failure LED on PSC is on.

LDC OK LED on storage compartment front panel is off.

Power-up screen reports a failing storage adapter port.

Replace LDC.

Replace drive.

Correct bus node ID plugs.

Remove drawer and check its connectors. Reseat drawer.

(continued on next page)

Power-On Diagnostics and System LEDs 2–21

Page 46

Table 2–6 (Cont.) Fixed-Media Mass Storage Problems

Problem Symptom Corrective Action

Missing or loose cables

Terminator missing

No termination power

I/O module failure

Backplane failure

Cable: storage device to ID panel—Bus node ID defaults to zero; online LEDs do not come on.

Flex circuit: LDC to storage interface module—Disk power failure LED on PSC is on; LDC OK LED on storage compartment front panel is off; and power-up screen reports a failing storage adapter port.

Cable: LDC to storage interface module—Power-up screen reports a failing storage adapter port; drive LEDs do not come on at power-up.

Cable: LDC to storage device— Drive does not show up in conﬁguration screen display.

Read/write errors in console event log; storage adapter port may fail

DSSI terminator LED is off, or no termination voltage measured at SCSI connector (pin 38, ground pin 1); Read/write errors; storage adapter port may fail.

The storage drawer exhibits no problems when moved to another compartment.

Replacing the I/O module does not solve problem. The port continues to fail and the problem is not with the storage drawer.

Remove storage drawer and inspect cable connections.

Attach terminator to connector port.

Replace LDC (termination power source for ﬁxed-media storage compartments).

Replace DC5 converter (termination power source for storageless ﬁxed-media storage compartments).

Replace I/O module.

Disassemble system and inspect backplane interconnect cables. If the cables and cable connections do not appear to be the problem, replace the backplane.

Figures 2–13 and 2–14 provide a ﬂowchart for troubleshooting removable-media storage problems indicated at power-up. Use the ﬂowchart to diagnose the likely cause of the problem. Table 2–7 lists the symptoms and corrective action for each of the possible problems.

2–22 Power-On Diagnostics and System LEDs

Page 47

Figure 2–13 Flowchart for Troubleshooting Removable-Media Problems

Has the drive failed?

Check the drive’s fault LED.

LED off LED on (steady) Drive failure

Continue

Are bus node ID plugs improperly set?

Check that all drives on the bus have unique bus node ID numbers (no duplicates).

Duplicate bus node IDs Configuration rule violation

Check that no drive is set to bus node ID 7 (reserved for host ID).

Drive set to host ID 7

Continue

Is the SCSI continuity card missing?

Check the console event log for an error message indicating a SCSI continuity card

is missing. If the top and/or bottom storage compartments do not have half-height

drives, a SCSI continuity card is needed to continue the bus. Refer to Section 6.1.5.2 for more information.

Half-height drive or SCSI continuity card present

If console event log reports erroneously that the SCSI continuity card is missing, replace the Vterm module. The Vterm module contains the logic for reporting SCSI continuity card errors.

Continue

missing

Configuration rule violation

SCSI continuity card missingSCSI continuity card

LJ-02549-TI0A

Power-On Diagnostics and System LEDs 2–23

Page 48

Figure 2–14 Flowchart for Troubleshooting Removable-Media Problems

(Continued)

Are cables loose or missing?

Power down, remove drive and check all cable connections, replace drive and power up.

Problems persist

Continue

Is the storage bus terminated?

Check that a terminator is in place.

Check that terminator power is present. Use a voltmeter on the port connector

(termination power is supplied by pin 38, ground on pin 1).

Power present

Continue

Is the I/O module the source of the problem?

Replace the I/O module.

Likely problem with drive or cables. Check again before continuing.

Is the backplane the source of the problem?

Eliminate all of the preceding problem sources before suspecting the backplane. The backplane is the least likely to fail.

Disassemble the system as described in Section 5.4. Inspect the two

backplane interconnect cables.

Problems solved

Terminator missing Terminator missingTerminator present

No termination power Vterm module failure

Problems solvedProblems persist

Cable disconnected

I/O module failure

Cables are OK

Replace backplane assembly as described in Section 5.4.

Cable connections are Backplane interconnect cable failure loose or damaged

2–24 Power-On Diagnostics and System LEDs

LJ-02549-TI0B

Page 49

Table 2–7 Removable-Media Mass Storage Problems

Problem Symptom Corrective Action

Drive failure Fault LED for drive is on

Duplicate bus node ID plugs (or a missing plug)

Bus node ID set to 7 (reserved for host ID)

SCSI continuity card missing

Missing or loose cables

Terminator missing

Vterm module failure

(steady). Drives with duplicate bus node

ID plugs are missing from the conﬁguration screen display.

A drive with no bus node ID plug defaults to zero.

Valid drives are missing from the conﬁguration screen display.

One drive may appear seven times on the conﬁguration screen display.

Power-up screen reports a failing storage adapter port; console event log contains soft error message reporting a SCSI continuity card is missing; drives on Bus E are not displayed on conﬁguration screen; possible read/write errors.

Cable: storage device to ID panel—Bus node ID defaults to zero; online LED does not come on.

Cable: Power—Drive does not show up in conﬁguration screen display.

Read/write errors in console event log; storage adapter port may fail

No termination voltage measured at Bus E SCSI connector (pin 38, ground pin

1); Read/write errors; storage adapter port may fail; or console erroneously reports SCSI continuity card as missing.

Replace drive.

Correct bus node ID plugs.

Attach SCSI continuity card (Section 6.1.5.2).

If console erroneously reports SCSI continuity card as missing, replace the Vterm module. The Vterm module contains the logic for reporting SCSI continuity card errors.

Remove device and inspect cable connections.

Attach terminator to connector port.

Replace Vterm module (termination power source for removablemedia storage compartment).

(continued on next page)

Power-On Diagnostics and System LEDs 2–25

Page 50

Table 2–7 (Cont.) Removable-Media Mass Storage Problems

Problem Symptom Corrective Action

I/O module failure

Backplane failure

Problems persist after eliminating the above problem sources.

Replacing the I/O module does not solve problem—the port continues to fail and the problem is not with the device or cables.

Replace I/O module.

Disassemble system and inspect backplane interconnect cables. If the cables and cable connections do not appear to be the problem, replace the backplane.

2.2.3 Robust Mode Power-Up

Robust mode allows you to power up without initiating drivers or running power-up diagnostics.

Robust mode permits you to get to the console program when one of the following is the cause of a problem getting to the console program under normal power-up:

• An error in the nonvolatile nvram ﬁle

• An incorrect environment variable setting

• A driver error

Note

The console program has limited functionality in robust mode.

Once in console mode, you can:

• Edit the nvram ﬁle (using the

• Assign a correct value to an environment variable (using the

commands)

• Start individual classes or sets of drivers, called phases (using the

-driver #

command. The pound sign (#) is the phase number 2, 3, 4, or 5,

and each phase is started individually in increasing order.

2–26 Power-On Diagnostics and System LEDs

edit

command)

show

and

init

set

Page 51

Note

The nonvolatile ﬁle, nvram, is shipped from the factory with no contents. The customer can use the command ﬁle that is executed as the last step of every power-up.

To set the system to robust mode, set the baud rate select switch located behind the OCP to 0, as shown in Section 6.5. The robust mode setting uses a 9600 console baud rate.

edit

command to create a customized script or

2.3 Power-Up Sequence

During the DEC 4000 AXP power-up sequence, the power supplies are stabilized and tested and the system is initialized and tested via the ﬁrmware power-on self-tests.

The power-up sequence includes the following:

• Power supply power-up:

– Includes AC power-up and power supply self-test. – Includes DC power-up and power supply self-tests.

• Two sets of power-on diagnostics:

– Serial ROM diagnostics – Console ﬁrmware-based diagnostics

2.3.1 AC Power-Up Sequence

With no AC power applied, no energy is supplied to the entire enclosure. AC power is applied to the system with the AC circuit breaker on the front end unit (FEU) of the power supply (see Figure 2–1) . With just AC power applied, the AC present LED is the only LED illuminated on the power supply.

Figure 2–15 provides a description of the AC power-up sequence. Failures during AC power-up are indicated by the power supply subsystem LEDs.

Additional error information is displayed on the PSC Fault ID display. Refer to Appendix B for PSC fault display information.

Power-On Diagnostics and System LEDs 2–27

Page 52

Figure 2–15 AC Power-Up Sequence

AC plug is inserted into wall outlet AC circuit breaker is set to on (1) AC power (country-specific voltage) enters FEU module

FEU creates two +48V outputs:

+48 VDC enters PSC, energizes microprocessor power system

PSC module verifies microprocessor power

OK FAILED Micro power system output not valid

PSC microprocessor performs internal self-test and PSC interface test

OK FAILED

PSC microprocessor self-test passed, PSC OK LED is turned on

PSC verifies +48 VDC BUS_DIRECT output is okay, turns on FEU OK LED

PSC verifies input voltage conditions: AC_POWER, FEU_HVDC, DIRECT_48V

All three are okay

AC power

FEU high voltage (HVDC)

+48V BUS_DIRECT

1.BUS_DIRECT +48 VDC output (always on) immediately goes to +48 DC inputs on DC5, DC3 and PSC modules

2.BUS_SWITCHED (+V-V) +48 VDC output (off) goes to +48 VDC input on LDCs and Futurebus+ modules

FEU failure LED is turned on

PSC microprocessor latches into shutdown

PSC microprocessor failed self-test

PSC failure LED is turned on

PSC microprocessor latches into shutdown

If BUS_DIRECT and AC power are not okay, the system is in AC low line condition

PSC waits for either output to become okay

NO FEU LEDs are turned on

PSC waits for power-up command PSC loops in routine checking status

WAIT

2–28 Power-On Diagnostics and System LEDs

If +48 VDC BUS_DIRECT is not asserted, but AC power is okay, FEU has failed

FEU failure LED comes on

PSC latches in shutdown

LJ-02484-TI0

Page 53

2.3.2 DC Power-Up Sequence

DC power is applied to the system with the DC on/off switch on the operator control panel.

Figures 2–16 and 2–17 provide a description of the DC power-up sequence. Failures during DC power-up are indicated by the power supply subsystem LEDs.

Additional error information is displayed on the PSC Fault ID display. Refer to Appendix B for PSC fault display information.

Power-On Diagnostics and System LEDs 2–29

Page 54

Figure 2–16 DC Power-Up Sequence

DC on/off switch set to on (1)

PSC starts DC power-up sequence and status check

PSC checks temperature sensor

OK FAILED

PSC checks overtemperature status (onboard)

OK FAILED

PSC commands FEU to start fans by asserting FAN_POWER_ENABLE H.

All fans are started at maximum speed, rotation speed is verified.

OK FAILED

PSC negates ASYNC_RESET signal to system CPU PSC commands FEU to turn on +48 VDC BUS_SWITCHED output PSC waits 100 ms for FEU to assert BUS_SWTCHD_OK signal

OK FAILED

FEU +48 VDC switched output (+V-V) goes to local disk converters (LDCs) and Futurebus+ slots

PSC commands DC3 to turn on +3.3 VDC output PSC waits 50 ms for +3.3 VDC to reach regulation

Failed PSC fault LED is turned on

Fans operate at full speed

Fans kept running while orderly shutdown is initiated

Fan Failure LED is turned on

Fans turned off after 30-sec. delay

One or more fans fail to start

Fans kept running while orderly shutdown is initiated

Overtemperature shutdown LED is turned on and fan number is displayed

and fan number is displayed

Fans turned off after 30-sec. delay

BUS_SWTCHD_OK did not assert within 100 ms

Fans are turned off

FEU OK LED is turned off

FEU failure LED is turned on

PSC latches in shutdown mode

OK FAILED

PSC commands DC5 to turn on +5.1 VDC output

Go to next page

2–30 Power-On Diagnostics and System LEDs

Output did not reach regulation in time

Fans and active DC outputs are turned off

Failure LED on DC3 module is turned on

PSC latches in shutdown mode

LJ-02485-TI0A

Page 55

Figure 2–17 DC Power-Up Sequence (Continued)

PSC waits 30 ms for +5.1 VDC to reach regulation

OK FAILED

DC5 OK LED is turned on PSC commands DC3 to turn on +2.1 VDC output PSC waits 20 ms for +2.1 VDC to reach regulation

OK FAILED

PSC commands DC3 to turn on +12 VDC output PSC waits 100 ms for +12 VDC to reach regulation

Output did not reach regulation in time

Fans and active DC outputs are turned off

Failure LED on DC5 module is turned on

PSC latches in shutdown mode

Output did not reach regulation in time

Fans and active DC outputs are turned off

Failure LED on DC3 module is turned on

PSC latches in shutdown mode

OK FAILED

DC3 OK LED is turned on All DC outputs except LDCs are energized PSC checks status of entire power system and delays for 45 ms

PSC negates ASYNC_REST_L and asserts POK_H; begins powering LDCs Each LDC has an enable bit that, when asserted, starts a timer. The LDC has 50 ms to respond with its LDC_OK signal asserted.

OK FAILED

LDC_OK is received within 50 ms, a 5-sec. timeout is initiated for disk spin-up time. System power-up is complete

PSC microprocessor begins ongoing status monitoring

Output did not reach regulation in time

Fans and active DC outputs are turned off

Failure LED on DC3 module is turned on

PSC latches in shutdown mode

-OK FAILED

One of the above outputs has failed;

failure mode indicated as described

above for the appropriate output.

LDC did not respond in time allowed

Disk power failure LED is turned on

Corresponding letter (A, B, C, or D) is displayed on fault ID display

The next LDC is tested

LJ-02485-TI0B

Power-On Diagnostics and System LEDs 2–31

Page 56

2.3.3 Firmware Power-Up Diagnostics

After successful completion of AC and DC power-up sequences, the processor performs its power-up diagnostics. These tests verify system operation, load the system console, and test the kernel system, including all boot path devices. These tests are performed as two distinct sets of diagnostics:

1. Serial ROM diagnostics—These tests are loaded from the serial ROM located on the CPU module into the CPU’s instruction cache (I-cache). They check the basic functionality of the system and load the console code from the FEPROM on the I/O module into system memory.

Failures during these tests are indicated by LEDs on the operator control panel.

2. Console ﬁrmware-based diagnostics—These tests are executed by the console code. They test the kernel system, including all boot path devices.

Failures during these tests are reported to the console terminal (via the power-up screen or console event log).

2.3.3.1 Serial ROM Diagnostics

The serial ROM diagnostics are loaded into the CPU’s I-cache from the serial ROM on the CPU module. They test the system in the following order:

1. Test the CPU and backup cache located on the CPU module.

2. Test the CPU module’s system bus interface.

3. Check the access to the I/O module.

4. Locate the largest memory module in the system and test the ﬁrst 4 MB of memory on the module. Only the ﬁrst 4 MB of memory are tested. If there is more than one memory module of the same size, the one closest to the CPU is tested ﬁrst.

If the memory test fails, the next largest memory module in the system is tested. Testing continues until a good memory module is found. If a good memory module is not found, the corresponding LEDs on the OCP are illuminated and the power-up diagnostics are terminated.

5. After ﬁnding the ﬁrst memory module with a good ﬁrst 4 MB of memory, the console program is loaded into memory from the FEPROM on the I/O module. At this time control is passed to the console code and the console ﬁrmware-based diagnostics are run.

2–32 Power-On Diagnostics and System LEDs

Page 57

2.3.3.2 Console Firmware-Based Diagnostics

Console ﬁrmware-based tests are executed once control is passed to the console code in memory. They check the system in the following order:

1. Perform a complete check of system memory. If a system has more than one memory module, the modules are checked in parallel.

2. Set memory interleave to maximize interleave factor across as many memory modules as possible (one, two, or four-way interleaving). During this time the console ﬁrmware is moved into backup cache on the primary CPU module. After memory interleave is set, the console ﬁrmware is moved back into memory.

Steps 3–7 may be completed in parallel.

3. Start the I/O drivers for mass storage devices and tapes. At this time a complete functional check of the machine is made. After the I/O drivers are started, the console program continuously polls the bus for devices (approximately every 20 or 30 seconds).

4. Size, conﬁgure, and test the Futurebus+ options.

5. Exercise memory.

6. Check that the SCSI continuity card or a storage device is installed in the removable-media storage bus (Bus E, connectors J6 and J7).

7. Run exercisers on the disk drives currently seen by the system.

This step does not currently ensure that all disks in the system will be tested or that any device drivers will be completely tested. To ensure complete testing of disk devices, use the

8. Enter console mode or boot the operating system. This action is determined by the auto_action environment variable.

2.4 Boot Sequence

Bootstrapping is the process of loading a program image into memory and transferring control to the loaded program. The system ﬁrmware uses the bootstrap procedure deﬁned by the Alpha AXP architecture and described in the Alpha System Reference Manual. On a DEC 4000 AXP system, bootstrap can be attempted only by the primary processor or boot processor. The ﬁrmware uses

Note

test

command.

Power-On Diagnostics and System LEDs 2–33

Page 58

device and optional ﬁlename information speciﬁed either on the command line or in appropriate environment variables.

There are only three conditions under which the boot processor attempts to bootstrap the operating system:

1. The

2. The system is reset or powered up and AUTO_ACTION is set to boot (and the

3. An operating system restart is attempted and fails.

The ﬁrmware’s function in a bootstrap is to load a program into memory and begin its execution. This program may be a primary bootstrap program, such as Alpha Primary Boot (APB), Ultrixboot, or any other applicable program speciﬁed by the user or residing in the boot block, MOP server, or TCP/IP server.

boot

command is typed on the console terminal.

halt switch is not set to halt).

2.4.1 Cold Bootstrapping in a Uniprocessor Environment

This section describes a cold bootstrap in a uniprocessor environment. A system bootstrap will be a cold bootstrap when any of the follow occur:

• Power is ﬁrst applied to the system

• A console variable is set to ‘‘Boot.’’

• The boot_reset environment variable is set to ‘‘On.’’

• A cold bootstrap is requested by system software.

The console must perform the following steps in the cold bootstrap sequence:

1. Perform a system initialization

2. Size memory

initialize

command is issued and the auto_action environment

3. Test sufﬁcient memory for bootstrapping

4. Load PALcode

5. Build a valid Hardware Restart Parameter Block (HWRPB)

6. Build a valid Memory Data Descriptor Table in the HWRPB

7. Initialize bootstrap page tables and map initial regions

8. Locate and load the system software primary bootstrap image

9. Initialize processor state on all processors

10. Transfer control to the system software primary bootstrap image

2–34 Power-On Diagnostics and System LEDs

Page 59

The steps leading to the transfer of control to system software may be performed in any order. The ﬁnal state seen by system software is deﬁned, but the implementation-speciﬁc sequence of these steps is not. Prior to beginning a bootstrap, the console must clear any internally pended restarts to any processor.

2.4.2 Loading of System Software

The console uses the boot_dev environment variable to determine the bootstrap device and the path to that device. These environment variables contain lists of bootstrap devices and paths; each list element speciﬁes the complete path to a given bootstrap device. If multiple elements are speciﬁed, the console attempts to load a bootstrap image from each in turn.

The console uses the bootdef_dev, boot_dev, and booted_dev environment variables as follows:

1. At console initialization, the console sets the bootdef_dev and boot_dev environment variables to be equivalent. The format of these environment variables is determined by the console implementation and is independent of the console presentation layer; the value may be interpreted and modiﬁed by system software.

2. When a bootstrap results from a device list, the console uses the list speciﬁed with the command. The console modiﬁes boot_dev to contain the speciﬁed device list. Note that this may require conversion from the presentation layer format to the registered format.

3. When a bootstrap is the result of a bootstrap device list, the console uses the bootstrap device list contained in the bootdef_dev environment variable. The console copies the value of bootdef_dev to boot_dev.

4. When a bootstrap is not the result of a bootstrap device list contained in the boot_dev environment variable. The console does not modify the contents of boot_dev.

5. The console attempts to load a bootstrap image from each element of the bootstrap device list. If the list is exhausted prior to successfully transferring control to system software, the bootstrap attempt fails and the subsequent console action is determined by auto_action.

6. The console indicates the actual bootstrap path and device used in the booted_dev environment variable. The console sets booted_dev after loading the primary bootstrap image and prior to transferring control to system software. The booted_dev format follows that of a boot_dev list element.

boot

command that speciﬁes a bootstrap

boot

command that does not specify a

boot

command, the console uses the

Power-On Diagnostics and System LEDs 2–35

Page 60

7. If the bootstrap device list is empty, bootdef_dev or boot_dev are null, and the action is implementation-speciﬁc. The console may remain in console I/O mode or attempt to locate a bootstrap device in an implementation-speciﬁc manner.

The boot_ﬁle and boot_osﬂags environment variables are used as default values for the bootstrap ﬁlename and option ﬂags. The console indicates the actual bootstrap image ﬁlename (if any) and option ﬂags for the current bootstrap attempt in the booted_ﬁle and booted_osﬂags and environment variables. The boot_ﬁle default bootstrap image ﬁlename is used whenever the bootstrap requires a ﬁlename and either none was speciﬁed on the bootstrap was initiated by the console as the result of a major state transition. The console never interprets the bootstrap option ﬂags, but simply passes them between the console presentation layer and system software.

boot

command or the

2.4.3 Warm Bootstrapping in a Uniprocessor Environment

The actions of the console on a warm bootstrap are a subset of those for a cold bootstrap. A system bootstrap will be a warm bootstrap whenever the boot_ reset environment variable is set to ‘‘Off’’ (46 4E4F16) and console internal state permits.

The console program performs the following steps in the warm bootstrap sequence.

1. Locates and validates the Hardware Reset Parameter Block (HWRPB)

2. Locates and loads the system software primary bootstrap image

3. Initializes processor state on all processors

4. Initializes bootstrap page tables and maps initial regions

5. Transfers control to the system software primary bootstrap image

At warm bootstrap, the console does not load PALcode, does not modify the Memory Data Descriptor Table, and does not reinitialize any environment variables. If the console cannot locate and validate the previously initialized HWRPB, the console must initiate a cold bootstrap. Prior to beginning a bootstrap, the console must clear any internally pended restarts to any processor.

2–36 Power-On Diagnostics and System LEDs

Page 61

2.4.4 Multiprocessor Bootstrapping

Multiprocessor bootstrapping differs from uniprocessor bootstrapping primarily in synchronization between processors. In a shared memory system, processors cannot independently load and start system software; bootstrapping is controlled by the primary processor.

DEC 4000 AXP systems always select CPU0 as the primary processor. The secondary processor polls a mailbox for a start address.

2.4.5 Boot Devices

The supported boot devices shown in Table 2–8 are determined by the console’s device drivers.

Table 2–8 Supported Boot Devices

Adapter Bus Device Name

I/O module Ethernet TGEC EZAn I/O module DSSI/SCSI Disk DUan/DKan I/O module DSSI/SCSI Tape MUan/MKan

Power-On Diagnostics and System LEDs 2–37

Page 62

Page 63

Running System Diagnostics

This chapter provides information on how to run system diagnostics.

• Section 3.1 describes how to run ROM-based diagnostics, including error reporting utilities, and loopback tests.

• Section 3.2 describes how to run DSSI internal device tests.

• Section 3.3 describes the DEC VET veriﬁer and exerciser software.

• Section 3.4 describes how to run UETP environmental test package software.

• Section 3.5 describes acceptence testing and initialization procedures.

3.1 Running ROM-Based Diagnostics

DEC 4000 AXP ROM-based diagnostics (RBDs), which are part of the console ﬁrmware that is loaded from the FEPROM on the I/O module, offer many powerful diagnostic utilities, including the ability to examine error logs from the console environment and run system- or device-speciﬁc exercisers.

Unlike previous systems, DEC 4000 AXP RBDs rely on exerciser modules, rather than functional tests to isolate errors. The exercisers are designed to run concurrently, providing a maximum bus interaction between the console drivers and the target devices.

The multitasking ability of the console ﬁrmware allows you to run diagnostics in the background (using the background operator ‘‘&’’ at the end of the command). You run RBDs by using console commands.

RBDs can be separated into four types of utilities:

1. System or device diagnostic test/exercisers using the (Section 3.1.1).

The

test

command is the primary diagnostic for acceptance testing and

console environment diagnosis.

test

command

Running System Diagnostics 3–1

Page 64

2. Three related commands are used to list system bus FRUs, report the status of RBDs in progress, and report errors:

• The

3. Several commands allow you to perform extended testing and exercising of speciﬁc system components. These commands are used for troubleshooting and are not needed for routine acceptance testing:

• The

show fru

part numbers, hardware and software revision numbers, and summary error information.

show_status

status of RBD test/exercisers currently in progress.

show error

test-directed diagnostics (TDD), via the RBDs, and by symptom-directed diagnostics (SDD), via the operating system.

memexer

speciﬁed number of memory tests. The tests are run in the background.

memexer_mp

multiprocessor system by running a speciﬁed number of memory exerciser sets. The tests are run in the background.

exer_read

random reads on the device.

exer_write

random writes to the speciﬁed device.

fbus_diag

command (Section 3.1.2) reports system bus FRUs, module

command (Section 3.1.3) reports the error count and

command (Section 3.1.4) reports errors captured by

command (Section 3.1.5) exercises memory by running a

command (Section 3.1.6) tests memory in a

command (Section 3.1.7) tests a disk by performing

command (Section 3.1.8) tests a disk by performing

command (Section 3.1.9) tests the Futurebus+ modules.

• The

4. Loopback tests for testing console and Ethernet ports (Section 3.1.12)

In addition to the four utilities listed above, there are two diagnostic-related commands. The terminate diagnostics.

3–2 Running System Diagnostics

show_mop_counters

MOP counters.

clear_mop_counters

MOP counters.

kill

and

kill_diags

command (Section 3.1.10) is used to read the

command (Section 3.1.11) is used to reset the

commands (Section 3.1.13) are used to

Page 65

3.1.1 test

The

test

command runs ﬁrmware diagnostics for the entire system, speciﬁed subsystems, or speciﬁc devices. These ﬁrmware diagnostics are run in the background. When the tests are successfully completed, the message ‘‘tests done’’ is displayed. If any of the tests fail, a failure message is displayed.

If you do not specify an argument with the

test

command, all tests except those

for tape drives are performed.

Note

By default, no write tests are performed on disk; and read and write tests are performed for tape drives. You need a scratch tape to test tape drives.

Early systems may not support RBD testing for tape drives.

All tests run concurrently for a minimum of 30 seconds. Tests complete when all component tests have completed at least one pass. Test passes are repeated for any component that completes its test before other components.

The run time of a test is proportional to the amount of memory to be tested and the number of disk and tape drives to be tested. Running

test all

on a system with fully conﬁgured 512-MB memory takes approximately 10 minutes to complete.

Synopsis:

test ([all] [cpu] [disk] [tape] [dssi] [scsi] [fbus] [memory] [ethernet] [device_list])

Arguments:

[all] Firmware diagnostics will test/exercise all the devices present in

[cpu] Firmware diagnostics will test backup cache and memory coherency. [disk] Firmware diagnostics will perform read-only tests of all disk drives

[tape] Firmware diagnostics will perform read and write tests of all the tape

[dssi] Firmware diagnostics will test the DSSI subsystem, including read-only

the system conﬁguration: CPU, disk, tape, DSSI subsystem, SCSI subsystem, Futurebus+ subsystem, memory, Ethernet, and I/O devices.

present in the system. One pass consists of seeking to a random block on the disk and reading a packet of 2048 bytes and repeating until 512 packets are read.

devices present in the system. Testing the tape drives requires that a scratch tape be loaded in the tape drive.

tests of all DSSI disks, and read-write tests for tape drives.

Running System Diagnostics 3–3

Page 66

[scsi] Firmware diagnostics will test the SCSI subsystem, including read-only

[fbus] Firmware diagnostics will instruct all Futurebus+ modules to perform

[memory] Firmware diagnostics will test memory modules present in the system. [ethernet] Firmware diagnostics will test the Ethernet logic. [device_list] Use the device_list argument to specify disk, tape, or Futurebus+ devices

tests of all SCSI disks and read-write tests for SCSI tape drives.

extended category default self-tests.

to be tested. As with all the RBDs, uses the exer script to perform readonly tests on the speciﬁed disk devices, and read-write tests for tape drives. Legal devices are disk, tape, and Futurebus+ device names.

Examples:

>>> test tests done >>>

>>> test *** Soft Error - Error #1 - Lower SCSI Continuity Card Missing

Diagnostic Name ID Device Pass Test Hard/Soft 31-JUL-1992 io_test 0000032d scsi_low_con 1 1 0 1 14:23:18

*** End of Error *** >>>

3–4 Running System Diagnostics

Page 67

3.1.2 show fru

The

show fru

FRUs based on the serial control bus EEPROM data:

• CPU modules

• Memory modules

• I/O modules

• Futurebus+ modules For each of the above FRUs, the slot position, option, part, revision, and serial

numbers, as well as any reported symptom-directed diagnostics (SDD) and test-directed diagnostics (TDD) event logs are displayed.

Synopsis:

show fru ([target [target . . . ]])

Arguments:

[target] CPU{0,1}, mem{0,1,2,3}, io, fbus, and fban.

Examples:

>>>

show fru

!" # $ % &

Slot Option Part# Hw Sw Serial# SDD TDD

1 IO B2101-AA D3 2 AY21739158 00 00 2 3 CPU0 B2001-AA D1 0 AY21328712 00 00 4 5 6 7 MEM3 B2002-BA B1 0 GA21700025 00 00

Futurebus+ Nodes

Slot Option Part# Hw Fw Serial# Description

1 2 3 fbc0 B2102-AA B02 X1.53 ML22000053 Fbus+ Profile_B Exerciser 4 5 6

>>>

command reports FRU and error information for the following

Rev Events Logged

Rev

(

Slot number for FRU (slots 1–7 right to left)

Slot 1: I/O module Slot 2, 3: CPU modules Slot 4–7: Memory modules

Running System Diagnostics 3–5

Page 68

Option name (I/O, CPU#, or MEM#)

Part number of option

Revision numbers (hardware and ﬁrmware)

Serial number

Events logged:

SDD: Number of symptom-directed diagnostic events logged by the operating system, or in the case of memory, by the operating system and ﬁrmware diagnostics. TDD: Number of test-directed diagnostic events logged by the ﬁrmware diagnostics.

Futurebus+ option name, fban, where:

fb indicates Futurebus+ option

a indicates corresponding Futurebus+ slot a–f (1–6) n indicates the Futurebus+ node number, 0 or 1

(

Description of Futurebus+ module

3–6 Running System Diagnostics

Page 69

3.1.3 show_status

The

show_status

diagnostic. The information includes ID, diagnostic program, device under test, error counts, passes completed, bytes written and read.

Many of the diagnostics run in the background and provide information only if an error occurs. Use the diagnostics.

The following command string is useful for periodically displaying diagnostic status information for diagnostics running in the background:

>>> while true;show_status;sleep n;done

command reports one line of information per executing

show_status

command to display the progress of

Where n is the number of seconds between

show_status

displays.

Synopsis:

show_status

Examples:

>>>

show_status

!" #$% & '

ID Program Device Pass Hard/Soft Bytes Written Bytes Read

-------- ------------ ------------ ------ --------- ------------- ------------00000001 idle system 0 0 0 0 0 000000ea memtest memory 2 0 0 67108864 67108864 000000f1 exer_kid dub0.0.0.1.0 1 0 0 0 0 000000f2 exer_kid duc0.6.0.2.0 1 0 0 0 0 000000f3 exer_kid dud0.7.0.3.0 1 0 0 0 0 000000f4 exer_kid dka0.0.0.0.0 1 0 0 0 0 >>>

Process ID

Program module name

Device under test

Diagnostic pass count

Error count (hard and soft): Soft errors are not usually fatal; hard errors halt the system or prevent completion of the diagnostics.

Bytes successfully written by diagnostic

Bytes successfully read by diagnostic

Running System Diagnostics 3–7

Page 70

3.1.4 show error

The

show error

bus EEPROM data. Both the operating system and the ROM-based diagnostics log errors to the serial control bus EEPROMs. This functionality provides the ability to generate an error log from the console environment.

command reports error information based on the serial control

A closely related command,

show fru

(Section 3.1.2), reports FRU and error

information for FRUs.

Synopsis:

show error ([target [target . . . ]])

Arguments:

[target] CPU{0,1}, mem{0,1,2,3}, and io.

Examples:

>>> show error mem3 Test Directed Errors

No Entries Found Symptom Directed Entries

MEM3 Module EEROM Event Log

!"# $ % &

Entry Offest RAM # Bit Mask Multi-Chip Event Type

0 383 9 0001 0 10 1 402 10 0001 1 10 2 402 11 0001 1 10 3 402 2 0001 1 10 4 402 3 0001 1 10 5 404 0 0001 1 10 6 404 1 0001 1 10 7 408 12 0001 0 10

Entry Error Mask Device # Event Type

15 f01 71 0

>>>

Event log entry number

Offset address of fault in RAM

RAM number—indicates the RAM location on the board

Four-bit bit ﬁeld value, indicates bit in DRAM Using the offset, RAM number, and bitmask, you can determine the location

of the speciﬁc cell in memory.

3–8 Running System Diagnostics

Page 71

Multi-chip (0=no, 1=yes)—indicates that a group of entries are the result of a single error.

Event type:

11—DRAM hard-failure 01—Correctable read data (CRD) error 10—Uncorrectable error 00—Other (non-DRAM error)

Running System Diagnostics 3–9

Page 72

3.1.5 memexer

The

memexer

exercisers. The exercisers are run in the background and nothing is displayed unless an error occurs. Each exerciser tests all available memory in 2-MB blocks for each pass.

command tests memory by running a speciﬁed number of memory

To terminate the memory tests, use the diagnostic or the

show_status

kill_diags

command to terminate all diagnostics. Use the

display to determine the process ID when killing an individual

kill

command to terminate an individual

diagnostic test.

Synopsis:

memexer [number]

Arguments:

[number] Number of memory exercisers to start. The default is 1.

The number of exercisers, as well as the length of time for testing, depends on the context of the testing. Generally, running 3–5 exercisers for 15 minutes to 1 hour is sufﬁcient for troubleshooting most memory problems.

Examples:

>>>

memexer 4

>>>

show_status

ID Program Device Pass Hard/Soft Bytes Written Bytes Read

-------- ------------ ------------ ------ --------- ------------- ------------00000001 idle system 0 0 0 0 0 000000c7 memtest memory 3 0 0 635651584 62565154 000000cc memtest memory 2 0 0 635651584 62565154 000000d0 memtest memory 2 0 0 635651584 62565154 000000d1 memtest memory 3 0 0 635651584 62565154

kill_diags

>>> >>>

3–10 Running System Diagnostics

Page 73

3.1.6 memexer_mp

The

memexer_mp

system by running a speciﬁed number of memory exerciser sets. A set is a memory test that runs on each processor checking alternate longwords. The exercisers are run in the background and nothing is displayed unless an error occurs.

command tests memory cache coherency in a multiprocessor

To terminate the memory tests, use the diagnostic or the

show_status

kill_diags

command to terminate all diagnostics. Use the

display to determine the process ID when killing an individual

kill

command to terminate an individual

diagnostic test.

Synopsis:

memexer_mp [number]

Arguments:

[number] Number of memory exerciser sets to start. The default is 1.

The number of exercisers, as well as the length of time for testing, depends on the context of the testing. Generally, running 2 or 3 exercisers for 5 minutes is sufﬁcient.

Examples:

>>>

memexer_mp 2

>>>

kill_diags

>>>

Running System Diagnostics 3–11

Page 74

3.1.7 exer_read

The

exer_read

on one or more devices. The exercisers are run in the background and nothing is displayed unless an error occurs.

The tests continue until one of the following conditions occurs:

1. All blocks on the device have been read for a passcount of d_passes (default is

1).

command tests a disk by performing random reads of 2048 bytes

2. The exer_read process has been terminated via the

killorkill_diags

commands, or Ctrl/C.

3. The speciﬁed time has elapsed. To terminate the read tests, enter Ctrl/C, or use the

an individual diagnostic or the Use the

show_status

display to determine the process ID when killing an

kill_diags

command to terminate all diagnostics.

kill

command to terminate

individual diagnostic test.

Synopsis:

exer_read [-sec seconds] [device_name device_name . . . ]

Arguments:

[device_name] One or more device names to be tested. The default is du*.* dk*.* to test

all DSSI and SCSI disks that are on line.

Options:

[-sec seconds] Number of seconds to run exercisers. If you do not enter the number

of seconds, the tests will run until d_passes have completed (d_passes default is 1).

If you want to test the entire disk, run at least one pass across the disk. If you do not need to test the entire disk, run the test for 5 or 10 minutes.

3–12 Running System Diagnostics

Page 75

Examples:

>>>

exer_read

failed to send command to pkc0.1.0.2.0 failed to send Read to dkc100.1.0.2.0

*** Hard Error - Error #5 Diagnostic Name ID Device Pass Test Hard/Soft

31-JUL-1992 exer_kid 00000175 dkc100.1.0.2 0 0 1 0 14:54:18 Error in read of 0 bytes at location 014DD400 from device dkc100.1.0.2.0

*** End of Error *** >>>

Running System Diagnostics 3–13

Page 76

3.1.8 exer_write

The

exer_write

more devices. The exercisers are run in the background and nothing is displayed unless an error occurs.

The exer_write tests cause the device to seek to a random block and read a 2048-byte packet of data, write that same data back to the same location on the device, read the data again, and compare it to the data originally read.

The tests continue until one of the following conditions occurs:

1. All blocks on the device have been read for a passcount of d_passes (default is

1).

command tests a disk by performing random writes on one or

2. The exer_read process has been terminated via the

killorkill_diags

commands, or Ctrl/C.

3. The speciﬁed time has elapsed. To terminate the read tests, enter Ctrl/C, or use the

an individual diagnostic or the Use the

show_status

display to determine the process ID when killing an

kill_diags

command to terminate all diagnostics.

kill

command to terminate

individual diagnostic test.

Caution

Running the

exer_write

diagnostic may distroy data on the speciﬁed

disk.

Synopsis:

exer_write [-sec seconds] [device_name device_name...]

Arguments:

[device_name] One or more device names to be tested. The default is du*.* dk*.* to test

all DSSI and SCSI disks that are on line.

Options:

[-sec seconds] Number of seconds to run exercisers. If you do not enter the number

of seconds, the tests will run until d_passes have completed (d_passes default is 1).

If you want to test the entire disk, run at least one pass across the disk. If you do not need to test the entire disk, run the test for 5 or 10 minutes.

3–14 Running System Diagnostics

Page 77

Examples:

>>>

exer_write dka0

EXECUTING THIS COMMAND WILL DESTROY DISK DATA

OR DATA ON THE SPECIFIED DEVICES Do you really want to continue? [Y/(N)]: failed to send command to pkc0.1.0.2.0 failed to send Read to dkc100.1.0.2.0

*** Hard Error - Error #5 Diagnostic Name ID Device Pass Test Hard/Soft

31-JUL-1992 exer_kid 0000012e dka0.0.0.0 0 0 1 0 15:21:22 Error in read of 0 bytes at location 017B3400 from device dka0.0.0.0.0

*** End of Error *** failed to send command to pka0.0.0.0.0

failed to send Read to dka0.0.0.0.0 >>>

Running System Diagnostics 3–15

Page 78

3.1.9 fbus_diag

The

fbus_diag

onboard a speciﬁc Futurebus+ device. The

fbus_diag

initiate commands on speciﬁc Futurebus+ devices, waits for tests to complete, and then reports the results to the console. If an error is reported by the Futurebus+ node, the diagnostic issues a dump buffer command to gain any available extended information that will also be reported to the console.

Refer to documentation for the speciﬁc Futurebus+ option for the recommended test procedures and form of the diagnostics. For more information, consult the Futurebus+ Handbook.

Test categories that require a buffer pointer in the argument CSR will have a default buffer provided by this diagnostic if the user does not specify a buffer address.

Process options and command line arguments are used to specify the speciﬁc test or test script to be executed as well as the target Futurebus+ node for this command.

Synopsis:

fbus_diag [-rb] [-p pass_count] [-st test_number] [-cat test_group node [test_arg]

Arguments:

command is used to start execution of a diagnostic test script

comand uses the Futurebus+ standard test CSR interface to

fbus_diag

command to initiate module-resident

node Speciﬁes the device name of the Futurebus+ device to execute the test.

Use the command names.

[test_arg] Speciﬁes an argument to be passed to the Futurebus+ node in the test

argument CSR. If this parameter is not speciﬁed and the category is either extended or system, the routine allocates a buffer and passes the buffer address through the test argument CSR.

Options:

[-rb] Randomly allocates from memzone on each pass with a block size of

4096.

[-p] (pass_count) Speciﬁes the number of times to run the test. If 0, the

test runs continuously. This overrides the value of the pass_count environment variable. In the absence of this option, pass_count is used. The default for pass_count is 1.

[-st] (test_number) Speciﬁes the test number to be run. The default is 0,

which runs the default tests in the category.

3–16 Running System Diagnostics

show device fb

to display the Futurebus+ device

Page 79

[-cat] (test_group) Speciﬁes the test category to be executed. The possible

categories are as follows:

• Init: Initialization tests

• Extended: Extended tests (default category)

• System: System tests

• Manual: Manual tests

• x: Bit mask of the desired test categories

[-opt] (test_option) Specify the Test Start CSR Option ﬁeld bits to be set. The

possible option bits are as follows:

• Loop_error: Loop on test if an error is detected

• Loop_test: Loop on this test

• Cont_error: Continue if an error is detected

• x: Bit mask of the desired option bits The default value for this qualiﬁer is based on the current values in the

global enviroment variables as follows:

• Loop_test: 1 if D_PASSES == 0 ; 0 otherwise

• Loop_error: 1 if D_HARDERR == "Loop" ; 0 otherwise

• Cont_error: 1 if D_HARDERR == "Continue" ; 0 otherwise

Running System Diagnostics 3–17

Page 80

3.1.10 show_mop_counter

The

show_mop_counter

Ethernet port.

Synopsis:

show_mop_counter [port_name]

Arguments:

command displays the MOP counters for the speciﬁed

[port_name] Speciﬁes the Ethernet port for which to display MOP counters: eza0 for

Ethernet port 0; ezb0 for Ethernet port 1.

Examples:

>>>

show_mop_counter eza0

eza0 MOP Counters DEVICE SPECIFIC:

TI: 211 RI: 34834 RU: 1 ME: 0 TW: 0 RW: 0 BO: 0 HF: 0 UF: 0 TN: 0 LE: 0 TO: 0 RWT: 33535 RHF: 33536 TC: 56

PORT INFO: tx full: 0 tx index in: 2 tx index out: 2 rx index in: 3

MOP BLOCK:

Network list size: 0

MOP COUNTERS: Time since zeroed (Secs): 4588

TX:

Bytes: 117068 Frames: 210 Deferred: 1 One collision: 32 Multi collisions: 15

TX Failures:

Excessive collisions: 0 Carrier check: 0 Short circuit: 0 Open circuit: 0 Long frame: 0 Remote defer: 0 Collision detect: 0

RX:

Bytes: 116564 Frames: 194 Multicast bytes: 16730668 Multicast frames: 36953

RX Failures:

Block check: 0 Framing error: 0 Long frame: 0 Unknown destination: 36953 Data overrun: 0 No system buffer: 18 No user buffers: 0

>>>

3–18 Running System Diagnostics

Page 81

3.1.11 clear_mop_counter

The

clear_mop_counter

Ethernet port.

Synopsis:

show_mop_counter [port_name]

Arguments:

command initializes the MOP counters for the speciﬁed

[port_name] Speciﬁes the Ethernet port for which to initialize MOP counters: eza0

for Ethernet port 0; ezb0 for Ethernet port 1.

Examples:

>>>

clear_mop_counter eza0

>>>

Running System Diagnostics 3–19

Page 82

3.1.12 Loopback Tests

Internal and external loopback tests can be used to isolate a failure by testing segments of a particular control or data path. The loopback tests are a subset of the RBDs.

3.1.12.1 Testing the Auxiliary Console Port (exer)

Using a loopback connector (29–24795–00) and a form of the can test the auxiliary serial port. Before running the loopback test, you must set the tt_allow_login environment variable to 1; after the test is completed, you must set tt_allow_login to 0.

Use the following commands to send a ﬁxed data pattern through the auxiliary serial port:

>>> set tt_allow_login 1 >>> exer -bs 1 -a "wRc" -p 0 tta1 & >>> kill_diags >>> set tt_allow_login 0 >>>

In the above command, the portion in quotes (the write, read, and compare instruction) is case sensitive. The background operator &, at the end of the command, causes the loopback tests to run in the background. Nothing is displayed unless an error occurs.

exer

command, you

To terminate the console loopback test, use the individual diagnostic or the Use the individual diagnostic test.

3.1.12.2 Testing the Ethernet Ports (netexer)

The between eza0 and ezb0. The network ports must be connected and terminated.

The loopback tests are run in the background. Nothing is displayed unless an error occurs.

To terminate the console loopback test, use the individual diagnostic or the Use the individual diagnostic test.

3–20 Running System Diagnostics

show_status

netexer

command performs an Ethernet port-to-port MOP loopback test

show_status

kill_diags

display to determine the process ID when killing an

kill_diags

display to determine the process ID when killing an

kill

command to terminate the

command to terminate all diagnostics.

kill

command to terminate the

command to terminate all diagnostics.

Page 83

3.1.13 kill and kill_diags

The

kill

and

kill_diags

executing .

commands terminates diagnostics that are currently

• The

kill

command terminates a speciﬁed process.

kill_diags

command terminates all diagnostics.

Synopsis:

kill_diags kill [PID . . . ]

Arguments:

[PID . . . ] The process ID of the diagnostic to terminate. Use the

command to determine the process ID.

show_status

3.1.14 Summary of Diagnostic and Related Commands

Table 3–1 provides a summary of the diagnostic and related commands.

Table 3–1 Summary of Diagnostic and Related Commands

Command Function Reference Acceptance Testing

test Test the entire system, subsystem, or speciﬁc device. Section 3.1.1

Error Reporting and Diagnostic Status

show fru Reports system bus and Futurebus+ FRUs,

module identiﬁcation numbers, and summary error information.

show_status Reports the status of currently executing

test/exercisers.

show error Reports some errors captured by diagnostics and

operating system.

(continued on next page)

Section 3.1.2

Section 3.1.3

Section 3.1.4

Running System Diagnostics 3–21

Page 84

Table 3–1 (Cont.) Summary of Diagnostic and Related Commands

Command Function Reference Extended Testing/Troubleshooting

memexer Exercises memory by running a speciﬁed number of

memexer_mp Tests memory in a multiprocessor system by running

exer_read Tests a disk by performing random reads on the

exer_write Tests a disk by performing random writes to the

fbus_diag Initiates onboard tests for a speciﬁed Futurebus+

show_mop_ counter

clear_mop_ counter

Loopback Testing

exer Conducts loopback tests for the speciﬁed console

netexer Conducts loopback tests for the Ethernet ports. Section 3.1.12.2

Diagnostic-Related Commands

kill Terminates a speciﬁed process. Section 3.1.13 kill_diags Terminates all currently executing diagnostics. Section 3.1.13

memory tests. The tests are run in the background.

a speciﬁed number of memory exerciser sets. The tests are run in the background.

speciﬁed device.

device. Displays the MOP counters for the speciﬁed

Ethernet port. Initializes the MOP counters for the speciﬁed

Ethernet port.

port.

Section 3.1.5

Section 3.1.6

Section 3.1.7

Section 3.1.8

Section 3.1.9

Section 3.1.10

Section 3.1.11

Section 3.1.12.1

3.2 DSSI Device Internal Tests

A DSSI storage device may fail either during initial power-up or during normal operation. In both cases, the failure is indicated by the lighting of the red Fault LED on the drive’s front panel.

If the drive is unable to execute the Power-On Self-Test (POST) successfully, the red Fault LED remains on and the Run/Ready LED does not come on, or both LEDs remain on.

3–22 Running System Diagnostics

Page 85

POST is also used to handle two types of error conditions in the drive:

• Controller errors are caused by the hardware associated with the controller

function of the drive module. A controller error is fatal to the operation of the drive, since the controller cannot establish a logical connection to the host. The red Fault LED comes on. If this occurs, replace the drive module.

• Drive errors are caused by the hardware associated with the drive control

function of the drive module. These errors are not fatal to the drive, since the drive can establish a logical connection and report the error to the host. Both LEDs go out for about 1 second, then the red Fault LED comes on. In this case, run either DRVTST, DRVEXR, or PARAMS via the

set host -dup

command, as described in the drive’s service documentation, to determine the error code.

Three conﬁguration errors are often the cause of drive errors:

• More than one node with the same bus node ID number

• Identical node names

• Identical MSCP unit numbers The ﬁrst error cannot be detected by software. Use the

show device

command (Section 6.2) to display the second and third types of errors. This command displays each device along with such information as bus node ID, unit number, and node name.

If the device is connected to the front panel of the storage compartment, you must install a bus node ID plug in the corresponding socket on the front panel. If the device is not connected to the front panel, it reads the bus node ID from the three-switch DIP switch on the side of the drive.

DSSI storage devices contain the following local programs:

DIRECT A directory, in DUP-speciﬁed format, of available local programs DRVTST A comprehensive drive functionality veriﬁcation test DRVEXR A utility that exercises the device HISTRY A utility that saves information retained by the drive, including the

ERASE A utility that erases all user data from the disk VERIFY A utility that is used to determine the amount of ‘‘margin’’ remaining in

DKUTIL A utility that displays disk structures and disk data PARAMS A utility that allows you to look at or change drive status, history,

internal error log

on-disk structures

parameters, and the internal error log

Running System Diagnostics 3–23

Page 86

Use the

set host -dup

command to access the local programs listed above. Example 3–1 provides an abbreviated example of running DRVTST for a device (Bus node 2 on Bus 0).

Caution

When running internal drive tests, always use the default (0 = No) in responding to the ‘‘Write/read anywhere on medium?’’ prompt. Answering Yes could destroy data.

Example 3–1 Running DRVTST

>>>

set host -dup -task drvtst dub0

5 minutes to complete. GAMMA::MSCP$DUP 17-MAY-1992 12:51:20 DRVTST CPU= 0 00:00:09.29 PI=160 GAMMA::MSCP$DUP 17-MAY-1992 12:51:40 DRVTST CPU= 0 00:00:18.75 PI=332 GAMMA::MSCP$DUP 17-MAY-1992 12:52:00 DRVTST CPU= 0 00:00:28.40 PI=503

. .

. GAMMA::MSCP$DUP 17-MAY-1992 12:55:42 DRVTST CPU= 0 00:02:13.41 PI=2388 Test passed.

Stopping DUP server... >>>

Return

Example 3–2 provides an abbreviated example of running DRVEXR for an RF-series disk (Bus node 2 on Bus 0).

3–24 Running System Diagnostics

Page 87

Example 3–2 Running DRVEXR

>>>

set host -dup -task drvexr dub0

Starting DUP server... Copyright (C) 1992 Digital Equipment Corporation Write/read anywhere on medium? [1=Yes/(0=No)] Test time in minutes? [(10)-100] Number of sectors to transfer at a time? [0 - 50] Compare after each transfer? [1=Yes/(0=No)]: Test the DBN area? [2=DBN only/(1=DBN and LBN)/0=LBN only]:

10 minutes to complete. GAMMA::MSCP$DUP 17-MAY-1992 13:02:40 DRVEXR CPU= 0 00:00:25.37 PI=1168 GAMMA::MSCP$DUP 17-MAY-1992 13:03:00 DRVEXR CPU= 0 00:00:29.53 PI=2503 GAMMA::MSCP$DUP 17-MAY-1992 13:03:20 DRVEXR CPU= 0 00:00:33.89 PI=3835

. . .

GAMMA::MSCP$DUP 17-MAY-1992 13:12:24 DRVEXR CPU= 0 00:02:24.19 PI=40028

13332 operations completed. 33240 LBN blocks (512 bytes) read.

0 LBN blocks (512 bytes) written.

33420 DBN blocks (512 bytes) read.

0 DBN blocks (512 bytes) written. 0 bytes in error (soft). 0 uncorrectable ECC errors.

Complete. Stopping DUP server...

>>>

Return

Refer to the RF-Series Integrated Storage Element Service Guide for instructions on running these programs.

3.3 DEC VET

Digital’s DEC Veriﬁer and Exerciser Tool (DEC VET) software is a multipurpose system maintenance tool that performs exerciser-oriented maintenance testing. DEC VET runs on both OpenVMS AXP and DEC OSF/1 operating systems. DEC VET consists of a manager and exercisers that test devices. The DEC VET manager controls these exercisers.

DEC VET exercisers test system hardware and the operating system. DEC VET supports various exerciser conﬁgurations, ranging from a single device

exerciser to full system loading—that is, simultaneous exercising of multiple devices.

Refer to the DEC Veriﬁer and Exerciser Tool User’s Guide (AA–PTTMA–TE) for instructions on running DEC VET.

Running System Diagnostics 3–25

Page 88

3.4 Running UETP

The User Environment Test Package (UETP) tool is an OpenVMS AXP software package designed to test whether the OpenVMS AXP operating system is installed correctly. UETP software puts the system through a series of tests that simulate a typical user environment, by making demands on the system that are similar to demands that might occur in everyday use.

Run UETP after system installation when OpenVMS AXP is running; or when you need to run stress tests to pinpoint intermittent errors.

UETP is not a diagnostic program; it does not attempt to test every feature exhaustively. When UETP runs to completion without encountering unnrecoverable errors, the system being tested is ready for use.

UETP exercises devices and functions that are common to all VMS and OpenVMS AXP systems, with the exception of optional features, such as high-level language compilers. The system components tested include the following:

• Most standard peripheral devices

• The system’s multiuser capability

• DECnet for OpenVMS AXP software

3.4.1 Summary of UETP Operating Instructions

This section summarizes the procedure for running all phases of UETP with default values.

1. Log in to the SYSTEST account as follows:

Username: SYSTEST Password:

Because the SYSTEST and SYSTEST_CLIG accounts have privileges, unauthorized use of these accounts might compromise the security of your system.

3–26 Running System Diagnostics

Caution

Page 89

2. Make sure no user programs are running and no user volumes are mounted.

Caution

By design, UETP assumes and requests the exclusive use of system resources. If you ignore this restriction, UETP may interfere with applications that depend on these resources.

3. After you log in, check all devices to be sure that the following conditions

exist:

• All devices you want to test are powered up and are on line to the system.

• Scratch disks are mounted and initialized.

• Disks contain a directory named [SYSTEST] with OWNER_ UIC=[1,7]. (You can create this directory with the DCL command CREATE/DIRECTORY.)

• Scratch magnetic tape reels are physically mounted on each drive you want tested and are initialized with the label UETP (using the DCL command INITIALIZE). Make sure magnetic tape reels contain at least 600 feet of tape.

• Scratch tape cartridges have been inserted in each drive you want to test and are initialized with the label UETP.

• Line printers and hardcopy terminals have plenty of paper.

• Terminal characteristics and baud rate are set correctly (see the user’s guide for your terminal).

4. To start UETP, enter the following command and press Return:

$ @UETP

UETP responds with the following question:

Run "ALL" UETP phases or a "SUBSET" [ALL]?

Press Return to choose the default response enclosed in brackets. UETP responds with three more questions in the following sequence:

How many passes of UETP do you wish to run [1]? How many simulated user loads do you want [n]? Do you want Long or Short report format [Long]?

Use the default values when acceptance testing with UETP. For stress testing, enter your own values.

Running System Diagnostics 3–27

Page 90

Press Return after each prompt. After you answer the last question, UETP initiates its entire sequence of tests, which run to completion without further input. The ﬁnal message should look like the following:

***************************************************** **

END OF UETP PASS 1 AT 20-JUL-1992 16:30:09.38 ** *****************************************************

5. After UETP runs, check the log ﬁles for errors. If testing completes successfully, the OpenVMS AXP operating system is working properly.

Note

After a run of UETP, you should run the Error Log Utility to check for hardware problems that can occur during a run of UETP. For information on running the Error Log Utility, refer to the VMS Error Log Utility Manual.

If UETP does not complete successfully, refer to Section 3.4.11.

3.4.2 System Disk Requirements

Before running UETP, be sure that the system disk has at least 1200 blocks available. Systems running more than 20 load test processes may require a minimum of 2000 available blocks. If you run multiple passes of UETP, log ﬁles will accumulate in the default directory and further reduce the amount of disk space available for subsequent passes.

If disk quotas are enabled on the system disk, you should disable them before you run UETP.

3.4.3 Preparing Additional Disks

To prepare each disk drive in the system for UETP testing, use the following procedure:

1. Place a scratch disk in the drive and spin up the drive. If a scratch disk is not available, use any disk with a substantial amount of free space. UETP does not overwrite existing ﬁles on any volume. If your scratch disk contains ﬁles that you want to keep, do not initialize the disk; go to step 3.

2. If the disk does not contain ﬁles you want to save, initialize it. For example:

$ INITIALIZE DUA1: TEST1

3–28 Running System Diagnostics

Page 91

This command initializes DUA1, and assigns the volume label TEST1 to the disk. All volumes must have unique labels.

3. Mount the disk. For example:

$ MOUNT/SYSTEM DUA1: TEST1

This command mounts the volume labeled TEST1 on DUA1. The /SYSTEM qualiﬁer indicates that you are making the volume available to all users on the system.

4. UETP uses the [SYSTEST] directory when testing the disk. If the volume does not contain the directory [SYSTEST], you must create it. For example:

$ CREATE/DIRECTORY/OWNER_UIC=[1,7] DUA1:[SYSTEST]

This command creates a [SYSTEST] directory on DUA1 and assigns a user identiﬁcation code (UIC) of [1,7]. The directory must have a UIC of [1,7] to run UETP.

If the disk you have mounted contains a root directory structure, you can create the [SYSTEST] directory in the [SYS0.] tree.

3.4.4 Preparing Magnetic Tape Drives

Set up magnetic tape drives that you want to test by doing the following:

1. Place a scratch magnetic tape with at least 600 feet of magnetic tape in the tape drive. Make sure that the write-enable ring is in place.

2. Position the magnetic tape at the beginning-of-tape (BOT) and put the drive on line.

3. Initialize each scratch magnetic tape with the label UETP. For example, if you have physically mounted a scratch magnetic tape on MTA1, enter the following command and press Return:

$ INITIALIZE MTA1: UETP

Magnetic tapes must be labeled UETP to be tested. As a safety feature, UETP does not test tapes that have been mounted with the MOUNT command.

3.4.5 Preparing Tape Cartridge Drives

Set up tape cartridge drives that you want to test by doing the following:

1. Insert a scratch tape cartridge in the tape cartridge drive.

2. Initialize the tape cartridge. For example:

$ INITIALIZE MKE0: UETP

Running System Diagnostics 3–29

Page 92

Tape cartridges must be labeled UETP to be tested. As a safety feature, UETP does not test tape cartridges that have been mounted with the MOUNT command.

3.4.5.1 TLZ06 Tape Drives

During the initialization phase, UETP sets a time limit of 6 minutes for a TLZ06 unit to complete the UETTAPE00 test. If the device does not complete the UETTAPE00 test within the alloted time, UETP displays a message similar to the following:

-UETP-E-TEXT, UETTAPE00.EXE testing controller MKA was stopped ($DELPRC) at 16:23:23.07

To increase the timeout value, type a command similar to the following before running UETP:

$ DEFINE/GROUP UETP$INIT_TIMEOUT "0000 00:08:00.00"

This example deﬁnes the initialization timeout value as 8 minutes.

because the time out period (UETP$INIT_TIMEOUT) expired or because it seemed hung or because UETINIT01 was aborted.

3.4.6 Preparing RRD42 Compact Disc Drives

To run UETP on an RRD42 compact disc drive, you must ﬁrst load the test disc that you received with your compact disc drive unit.

3.4.7 Preparing Terminals and Line Printers

Terminals and line printers must be turned on to be tested by UETP. They must also be on line. Check that line printers and hardcopy terminals have enough paper. The amount of paper required depends on the number of UETP passes that you plan to execute. Each pass requires two pages for each line printer and hardcopy terminal.

Check that all terminals are set to the correct baud rate and are assigned appropriate characteristics (see the user’s guide for your terminal).

Spooled devices and devices allocated to queues fail the initialization phase of UETP and are not tested.

3.4.8 Preparing Ethernet Adapters

Make sure that no other processes are sharing the Ethernet adapter device when you run UETP.

3–30 Running System Diagnostics

Page 93

Note

UETP will not test your Ethernet adapter if DECnet for OpenVMS AXP or another application has the device allocated.

Because either DECnet for OpenVMS AXP or the LAT terminal server might also try to use the Ethernet adapter (a shareable device), you must shut down DECnet for OpenVMS AXP and the LAT terminal server before you run the device test phase, if you want to test the Ethernet adapter.

3.4.9 DECnet for OpenVMS AXP Phase

The DECnet for OpenVMS AXP phase of UETP uses more system resources than other tests. You can, however, minimize disruptions to other users by running the test on the ‘‘least busy’’ node.

By default, the ﬁle UETDNET00.COM speciﬁes the node from which the DECnet for OpenVMS AXP test will be run. To run the DECnet for OpenVMS AXP test on a different node, enter the following command before you invoke UETP:

$ DEFINE/GROUP UETP$NODE_ADDRESS node_address

This command equates the group logical name UETP$NODE_ADDRESS to the node address of the node in your area on which you want to run the DECnet for OpenVMS AXP phase of UETP.

For example:

$ DEFINE/GROUP UETP$NODE_ADDRESS 9.999

When you use the logical name UETP$NODE_ADDRESS, UETP tests only the ﬁrst active circuit found by NCP. Otherwise, UETP tests all active testable circuits.

When you run UETP, a router node attempts to establish a connection between your node and the node deﬁned by UETP$NODE_ADDRESS. Occasionally, the connection between your node and the router node might be busy or nonexistent. When this happens, the system displays the following error messages:

%NCP-F-CONNEC, Unable to connect to listener

-SYSTEM-F-REMRSRC, resources at the remote node were insufficient %NCP-F-CONNEC, Unable to connect to listener

-SYSTEM-F-NOSUCHNODE, remote node is unknown

Note

Running System Diagnostics 3–31

Page 94

3.4.10 Termination of UETP

At the end of a UETP pass, the master command procedure UETP.COM displays the time at which the pass ended. In addition, UETP.COM determines whether UETP needs to be restarted.

At the end of an entire UETP run, UETP.COM deletes temporary ﬁles and does other cleanup activities.

Pressing Ctrl/Y or Ctrl/C lets you terminate a UETP run before it completes normally. Normal completion of a UETP run, however, includes the deletion of miscellaneous ﬁles that have been created by UETP for the purpose of testing. The use of Ctrl/Y or Ctrl/C might interrupt or prevent these cleanup procedures.

3.4.11 Interpreting UETP VMS Failures

When UETP encounters an error, it reacts like a user program. It either returns an error message and continues, or it reports a fatal error and terminates the image or phase. In either case, UETP assumes the hardware is operating properly and it does not attempt to diagnose the error.

If the cause of an error is not readily apparent, use the following methods to diagnose the error:

• VMS Error Log Utility—Run the Error Log Utility to obtain a detailed report of hardware and system errors. Error log reports provide information about the state of the hardware device and I/O request at the time of each error. For information about running the Error Log Utility, refer to the VMS Error Log Utility Manual and Chapter 4 of this manual.

• Diagnostic facilities—Use the diagnostic facilities to test exhaustively a device or medium to isolate the source of the error.

3.4.12 Interpreting UETP Output

You can monitor the progress of UETP tests at the terminal from which they were started. This terminal always displays status information, such as messages that announce the beginning and end of each phase and messages that signal an error.

The tests send other types of output to various log ﬁles, depending on how you started the tests. The log ﬁles contain output generated by the test procedures. Even if UETP completes successfully, with no errors displayed at the terminal, it is good practice to check these log ﬁles for errors. Furthermore, when errors are displayed at the terminal, check the log ﬁles for more information about their origin and nature.

3–32 Running System Diagnostics

Page 95

3.4.12.1 UETP Log Files

UETP stores all information generated by all UETP tests and phases from its current run in one or more UETP.LOG ﬁles, and it stores the information from the previous run in one or more OLDUETP.LOG ﬁles. If a run of UETP involves multiple passes, there will be one UETP.LOG or one OLDUETP.LOG ﬁle for each pass.

At the beginning of a run, UETP deletes all OLDUETP.LOG ﬁles, and renames existing UETP.LOG ﬁles to OLDUETP.LOG. Then UETP creates a new UETP.LOG ﬁle and stores the information from the current pass in the new ﬁle. Subsequent passes of UETP create higher versions of UETP.LOG. Thus, at the end of a run of UETP that involves multiple passes, there is one UETP.LOG ﬁle for each pass. In producing the ﬁles UETP.LOG and OLDUETP.LOG, UETP provides the output from the two most recent runs.

If the run involves multiple passes, UETP.LOG contains information from all the passes. However, only information from the latest run is stored in this ﬁle. Information from the previous run is stored in a ﬁle named OLDUETP.LOG. Using these two ﬁles, UETP provides the output from its tests and phases from the two most recent runs.

The cluster test creates a NETSERVER.LOG ﬁle in SYS$TEST for each pass on each system included in the run. If the test is unable to report errors (for example, if the connection to another node is lost), the NETSERVER.LOG ﬁle on that node contains the result of the test run on that node. UETP does not purge or delete NETSERVER.LOG ﬁles; therefore, you must delete them occasionally to recover disk space.

If a UETP run does not complete normally, SYS$TEST might contain other log ﬁles. Ordinarily these log ﬁles are concatenated and placed within UETP.LOG. You can use any log ﬁles that appear on the system disk for error checking, but you must delete these log ﬁles before you run any new tests. You may delete these log ﬁles yourself or rerun the entire UETP, which checks for old UETP.LOG ﬁles and deletes them.

3.4.12.2 Possible UETP Errors

This section is intended to help you identify problems you might encounter running UETP.

The following are the most common failures encountered while running UETP:

• Wrong quotas, privileges, or account

• UETINIT01 failure

• Ethernet device allocated or in use by another application

Running System Diagnostics 3–33

Page 96

• Insufﬁcient disk space

• Incorrect cluster setup

• Problems during the load test

• DECnet for OpenVMS AXP error

• Lack of default access for the FAL object

• Errors logged but not displayed

• No PCB or swap slots

• Hangs

• Bugchecks and machine checks

For more information refer to the VAX 3520, 3540 VMS Installation and Operations (ZKS166) manual.

3.5 Acceptance Testing and Initialization

Perform the acceptance testing procedure listed below, after installing a system, or whenever adding or replacing the following:

CPU modules Memory modules I/O module Backplane Storage devices Futurebus+ options

1. Run the RBD acceptance tests using the

2. Bring up the operating system.

3. Run DEC VET or UETP to test that the operating system is correctly installed. Refer to Section 3.3 for information on DEC VET. Refer to Section 3.4 for instructions on running UETP.

3–34 Running System Diagnostics

test

command.

Page 97

Error Log Analysis

This chapter provides information on how to interpret error logs reported by the operating system.

• Section 4.1 describes machine check/interrupts and how these errors are detected and reported.

• Section 4.2 describes the entry format used by the ERF/UERF error formatters.

• Section 4.3 describes how to translate the error log information using the OpenVMS AXP and DEC OSF/1 error formatters.

• Section 4.4 describes how to interpret the system error log to isolate the failing FRU.

4.1 Fault Detection and Reporting

Table 4–1 provides a summary of the fault detection and correction components of DEC 4000 AXP systems.

Generally, PALcode handles exceptions as follows:

• The PALcode determines the cause of the exception.

• If possible, it corrects the problem and passes control to the operating system for reporting before returning the system to normal operation.

• If a problem is not correctable, or if error/event logging is required, control is passed through the system control block (SCB) to the appropriate exception handler.

Error Log Analysis 4–1

Page 98

Table 4–1 DEC 4000 AXP Fault Detection and Correction

Component Fault Detection/Correction Capability KN430 Processor Module

DECchip 21064 microprocessor

Backup cache (B-cache) EDC check bits on the data store; and parity on the tag

MS430 Memory Modules

Memory module EDC logic protects data by detecting and correcting up to

KFA40 I/O Module

I/O module DSSI/SCSI buses: Data parity is checked and generated.

System Bus

System bus Longword parity on command, address, and data.

Error Detection and Correction (EDC) logic. For all data entering the 21064 microprocessor, single bits are checked and corrected; for all data exiting the 21064 microprocessor, the appropriate check bits are generated. A single-bit error on any of the four longwords being read can be corrected (per cycle).

store and control store.

2 bits per DRAM chip per gate array. The four bits of data per DRAM are spread across two gate arrays (one for even longwords, the other for odd longwords).

Lbus data transfers to Ethernet and SCSI/DSSI controllers: Data parity is checked and generated.

Futurebus+ data transfers: Parity is checked and passed on.

4.1.1 Machine Check/Interrupts

The exceptions that result from hardware system errors are called machine check/interrupts. They occur when a system error is detected during the processing of a data request. There are three types of machine check/interrupts related to system events:

1. Processor machine check

2. System machine check

3. Processor corrected machine check

4–2 Error Log Analysis

Page 99

The causes for each of the machine check/interrupts are as follows. The system control block (SCB) vector through which PALcode transfers control to the operating system is shown in parentheses.

Processor Machine Check (SCB: 670)

Processor machine check errors are fatal system errors and immediately crash the system.

• The DECchip 21064 microprocessor detected one or more of the following uncorrectable data errors:

– Uncorrectable B-cache data error – Uncorrectable memory data error (CU_ERR asserted) – Uncorrectable data from other CPU’s B-cache (CU_ERR asserted)

• A B-cache tag or tag control parity error occurred

• Hard error status was asserted in response to: – A read data parity error – System bus timeouts (NOACK error bit asserted)—The bus responder

detected a write data or command address error and did not acknowledge the bus cycle.

System Machine Check (SCB: 660)

A system machine check is a system detected error, external to the DECchip 21064 microprocessor and possibly not related to the activities of the microprocessor. It occurs when C_ERROR is asserted on the system bus.

Fatal errors:

• The I/O module detected a system bus error while serving as system bus commander:

– System bus timeouts (NOACK error bit asserted)—The bus responder

detected a write data or command address error and did not acknowledge the bus cycle

– Uncorrectable data (CU-ERR asserted) from responder

• Any system bus device detected a command/address parity error

• A bus responder detected a write data parity error

• Memory or I/O system bus gate array detected an internal error (SYNC error)

Error Log Analysis 4–3

Page 100

Nonfatal errors:

• A memory module correctable error occurred

• Correctable B-cache errors were detected while the B-cache was providing data to the system bus (errors from other CPU)

• Duplicate tag store parity errors occurred

Processor Corrected Machine Check (SCB: 630)

Processor corrected machine checks are caused by B-cache errors that are detected and corrected by the DECchip 21064 microprocessor. These errors are nonfatal and result in an error log entry.

4.1.2 System Bus Transaction Cycle

In order to interpret error logs for system bus errors, you need a basic understanding of the system bus transaction cycle and the function of the commander, responder, and bystanders.

For any particular bus transaction cycle there is one commander (either CPU or I/O) that initiates bus transactions and one responder (memory, CPU, or I/O) that accepts or supplies data in response to a command/address from the system bus commander. A bystander is a system bus node (CPU, I/O, or memory) that is not addressed by a current system bus commander.

There are four system bus transaction types: read, write, exchange, and nut.

• Read and write transactions consist of a command/address cycle followed by two data cycles.

• Exchange transactions are used to replace the cache block when a cache block resource conﬂict occurs. They consist of a command/address cycle followed by four data cycles: two writes and two reads.

• Nut transactions consist of a command/address cycle and two dummy data cycles for which no data is transferred.

For more information, refer to the DEC 4000 Model 600 Series Technical Manual.

4.2 Error Logging and Event Log Entry Format

The OpenVMS AXP and DEC OSF/1 error handlers can generate several entry types. All error entries, with the exception of correctable memory errors, are logged immediately. Entries can be of variable length based on the number of registers within the entry.

4–4 Error Log Analysis

DEC 4000 AXP Service Manual

Specifications and Main Features

Frequently Asked Questions

User Manual