DEC AlphaServer 1000 DEC AlphaServer 1000A Service Guide

AlphaServer1000A ServiceGuide

Order Number: EK–ALPSV–SV. B01

Digital Equipment Corporation Maynard, Massachusetts

First Printing, March 1996 Second Printing, October 1996

Digital Equipment Corporation makes no representations that the use of its products in the manner described in this publication will not infringe on existing or future patent rights, nor do the descriptions contained in this publication imply the granting of licenses to make, use, or sell equipment or software in accordance with the description.

Possession, use, or copying of the software described in this publication is authorized only pursuant to a valid written license from Digital or an authorized sublicensor.

VET, Digital, OpenVMS, StorageWorks, VAX DOCUMENT, and the DIGITAL logo. Digital UNIX Version 3.0 is an X/Open UNIX 93 branded product. Windows NT is a trademark of

Microsoft Corp. All other trademarks and registered trademarks are the property of their respective holders. FCC NOTICE: The equipment described in this manual generates, uses, and may emit radio

frequency energy. The equipment has been type tested and found to comply with the limits for a Class B computing device pursuant to Subpart J of Part 15 of FCC Rules, which are designed to provide reasonable protection against such radio frequency interference when operated in a commercial environment. Operation of this equipment in a residential area may cause interference, in which case the user at his own expense may be required to take measures to correct the interference.

This document was prepared using VAX DOCUMENT Version 2.1.

S3197

Contents

Preface ................................................ xi

1 Troubleshooting Strategy

1.1 Troubleshooting the System . ....................... 1–1

1.1.1 Problem Categories ........................... 1–2

1.2 Service Tools and Utilities . . ....................... 1–8

1.3 Information Services ............................. 1–10

2 Power-Up Diagnostics and Display

2.1 Interpreting Error Beep Codes ...................... 2–2

2.2 Model 5/xxx SROM Error Codes..................... 2–10

2.3 Power-Up Screen . ............................... 2–13

2.3.1 Console Event Log ............................ 2–16

2.4 Model 4/xxx SROM Memory Power-Up Tests ........... 2–16

2.5 Mass Storage Problems Indicated at Power-Up . . . ...... 2–21

2.6 Storage Device LEDs ............................. 2–24

2.7 EISA Bus Problems Indicated at Power-Up ............ 2–27

2.7.1 Additional EISA Troubleshooting Tips ............. 2–28

2.8 PCI Bus Problems Indicated at Power-Up ............. 2–29

2.8.1 Additional PCI Troubleshooting Tips .............. 2–29

2.9 Fail-Safe Loader . . ............................... 2–30

2.9.1 Fail-Safe Loader Functions ..................... 2–30

2.9.2 Activating the Fail-Safe Loader . . . ............... 2–31

2.10 Power-Up Sequence .............................. 2–35

2.10.1 AC Power-Up Sequence . ....................... 2–35

2.10.2 DC Power-Up Sequence . ....................... 2–36

2.11 Firmware Power-Up Diagnostics .................... 2–36

2.11.1 Serial ROM Diagnostics . ....................... 2–37

2.11.2 Console Firmware-Based Diagnostics .............. 2–37

iii

3 Running System Diagnostics

3.1 Running ROM-Based Diagnostics ................... 3–1

3.2 Command Summary ............................. 3–2

3.3 Command Reference ............................. 3–3

3.3.1 test . ....................................... 3–4

3.3.2 sys_exer .................................... 3–8

3.3.3 cat el and more el ............................ 3–10

3.3.4 memexer ................................... 3–11

3.3.5 memory .................................... 3–13

3.3.6 netew ...................................... 3–17

3.3.7 network .................................... 3–19

3.3.8 net-s ...................................... 3–21

3.3.9 net-ic...................................... 3–22

3.3.10 kill and kill_diags ............................ 3–23

3.3.11 show_status . . ............................... 3–24

3.4 Acceptance Testing and Initialization. . ............... 3–25

3.5 DECVET...................................... 3–25

4 Error Log Analysis

4.1 Fault Detection and Reporting ...................... 4–1

4.1.1 Machine Check/Interrupts ...................... 4–2

4.2 Error Logging and Event Log Entry Format ........... 4–4

4.3 Event Record Translation. . . ....................... 4–5

4.3.1 OpenVMS Alpha Translation Using DECevent ...... 4–5

4.3.2 Digital UNIX Translation Using DECevent . . . ...... 4–6

5 System Conﬁguration and Setup

5.1 Verifying System Conﬁguration ..................... 5–2

5.1.1 System Firmware ............................. 5–3

5.1.2 Switching Between Interfaces ................... 5–5

5.1.3 Verifying Conﬁguration: ARC Menu Options for

Windows NT . ............................... 5–6

5.1.3.1 Display Hardware Conﬁguration .............. 5–6

5.1.3.2 Set Default Variables ....................... 5–9

5.1.4 Verifying Conﬁguration: SRM Console Commands for

Digital UNIX and OpenVMS .................... 5–10

5.1.4.1 show conﬁg............................... 5–11

5.1.4.2 show device .............................. 5–19

5.1.4.3 show memory ............................. 5–20

5.1.4.4 Setting and Showing Environment Variables ..... 5–20

5.2 System Bus Options .............................. 5–28

5.2.1 CPU Daughter Board . . . ....................... 5–29

5.2.2 Memory Modules ............................. 5–29

5.3 Motherboard ................................... 5–31

5.4 EISA Bus Options ............................... 5–32

5.5 ISA Bus Options . ............................... 5–32

5.5.1 Identifying ISA and EISA options . ............... 5–33

5.6 EISA Conﬁguration Utility . ....................... 5–33

5.6.1 Before You Run the ECU ....................... 5–34

5.6.2 How to Start the ECU . . ....................... 5–35

5.6.3 Conﬁguring EISA Options ...................... 5–37

5.6.4 Conﬁguring ISA Options ....................... 5–38

5.7 PCI Bus Options . ............................... 5–39

5.7.1 PCI-to-PCI Bridge ............................ 5–40

5.8 SCSI Buses .................................... 5–41

5.8.1 Internal StorageWorks Shelf .................... 5–41

5.8.2 External SCSI Expansion ...................... 5–42

5.8.3 SCSI Bus Conﬁgurations ....................... 5–42

5.9 Power Supply Conﬁgurations ....................... 5–46

5.10 Console Port Conﬁgurations. ....................... 5–49

5.10.1 set console . . . ............................... 5–49

5.10.2 set tt_allow_login ............................. 5–50

5.10.3 set tga_sync_green ............................ 5–51

5.10.4 Setting Up a Serial Terminal to Run ECU .......... 5–51

5.10.5 Using a VGA Controller Other than the Standard

On-Board VGA ............................... 5–52

6 AlphaServer 1000A FRU Removal and Replacement

6.1 AlphaServer 1000A FRUs . . ....................... 6–1

6.2 Removal and Replacement . . ....................... 6–7

6.2.1 Cables ..................................... 6–9

6.2.2 Power Supply DC Cable Assembly . ............... 6–12

6.2.3 CPU Daughter Board . . . ....................... 6–21

6.2.4 Fans ....................................... 6–22

6.2.5 StorageWorks Drive ........................... 6–23

6.2.6 Internal StorageWorks Backplane . ............... 6–24

6.2.7 Memory Modules ............................. 6–26

6.2.8 Interlock Switch .............................. 6–30

6.2.9 Motherboard . ............................... 6–31

6.2.10 NVRAM Chip (E14) and NVRAM TOY Clock Chip

(E78) ...................................... 6–36

6.2.11 OCP Module . . ............................... 6–36

6.2.12 Power Supply . ............................... 6–39

6.2.13 Speaker .................................... 6–40

6.2.14 Removable Media ............................. 6–41

A Default Jumper Settings

A.1 Motherboard Jumpers ............................ A–2

A.2 CPU Daughter Board (J3 and J4) Supported Settings .... A–4

A.3 CPU Daughter Board (J1 or J4 Jumper) .............. A–9

Glossary

Index

Examples

5–1 Sample Hardware Conﬁguration Display ........... 5–8

Figures

2–1 Model 4/xxx Systems: Jumper J1 on the CPU

Daughter Board .............................. 2–7

2–2 Model 5/xxx Systems: Jumper J4 on the CPU

Daughter Board .............................. 2–8

2–3 Model 5/xxx Systems: Jumper J1 on the CPU

Daughter Board .............................. 2–9

2–4 AlphaBIOS Boot Menu. . ....................... 2–15

2–5 Model 4/xxx: Jumper J1 on the CPU Daughter

Board ...................................... 2–20

2–6 Model 4/xxx: AlphaServer 1000A Memory Layout .... 2–21

2–7 StorageWorks Disk Drive LEDs (SCSI) ............ 2–25

2–8 Floppy Drive Activity LED ...................... 2–25

2–9 CD–ROM Drive Activity LED ................... 2–26

2–10 Model 4/xxx: Jumper J1 on the CPU Daughter

Board ...................................... 2–32

2–11 Model 5/xxx Systems: Jumper J4 on the CPU

Daughter Board .............................. 2–33

2–12 Model 5/xxx Systems: Jumper J1 on the CPU

Daughter Board .............................. 2–34

3–1 Model 4/xxx: AlphaServer 1000A Memory Layout .... 3–16

3–2 Model 5/xxx: AlphaServer 1000A Memory Layout .... 3–16

5–1 System Architecture: AlphaServer 1000A Model 4/xxx

Systems .................................... 5–2

5–2 System Architecture: AlphaServer 1000A Model 5/xxx

Systems .................................... 5–3

5–3 Device Name Convention ....................... 5–19

5–4 Card Cages and Bus Locations................... 5–28

5–5 Model 4/xxx Memory Layout on the Motherboard .... 5–30

5–6 Model 5/xxx Memory Layout on the Motherboard .... 5–31

5–7 EISA and ISA Boards . . ....................... 5–33

5–8 PCI Board . . . ............................... 5–40

5–9 Single Controller Conﬁguration . . . ............... 5–43

5–10 Dual Controller Conﬁguration with Split StorageWorks

Backplane . . . ............................... 5–44

5–11 Triple Controller Conﬁguration with Split

StorageWorks Backplane ...................... 5–45

5–12 Power Supply Conﬁgurations .................... 5–47

5–13 Power Supply Cable Connections . . ............... 5–48

6–1 FRUs, Front Right ............................ 6–5

6–2 FRUs, Rear Left .............................. 6–6

6–3 Opening Front Door ........................... 6–7

6–4 Removing Top Cover and Side Panels ............. 6–8

6–5 Floppy Drive Cable (34-Pin) ..................... 6–9

6–6 OCP Module Cable (10-Pin) ..................... 6–10

6–7 Power Cord . . ............................... 6–10

6–8 Power Supply Current Sharing Cable (3-Pin) . ...... 6–11

6–9 Removing Cable Channel Guide . . . ............... 6–12

6–10 Power Supply DC Cable Assembly . ............... 6–13

6–11 Power Supply Storage Harness (12-Pin)............ 6–14

6–12 Interlock/Server Management Cable (2-pin) . . . ...... 6–15

6–13 Internal StorageWorks Jumper Cable (68-Pin). ...... 6–16

6–14 Wide-SCSI (Controller to StorageWorks Shelf) Cable

(68-Pin) .................................... 6–17

6–15 Wide-SCSI (Controller to StorageWorks Shelf) Cable

(68-Pin) .................................... 6–18

6–16 Wide-SCSI (J10 to Bulkhead Connector) Cable

(68-Pin) .................................... 6–19

vii

6–17 SCSI (Embedded 8-bit) Removable-Media Cable

(50-Pin) .................................... 6–20

6–18 Removing CPU Daughter Board . . ............... 6–21

6–19 Removing Fans .............................. 6–22

6–20 Removing StorageWorks Drive ................... 6–23

6–21 Removing Power Supply ....................... 6–24

6–22 Removing Internal StorageWorks Backplane . . ...... 6–25

6–23 Memory Layout on Motherboard . . ............... 6–26

6–24 Removing SIMMs from Motherboard .............. 6–27

6–25 Installing SIMMs on Motherboard . ............... 6–28

6–26 Removing the Interlock Safety Switch ............. 6–30

6–27 Removing EISA and PCI Options . . ............... 6–31

6–28 Removing CPU Daughter Board . . ............... 6–32

6–29 Removing Motherboard . ....................... 6–33

6–30 Motherboard Layout . . . ....................... 6–35

6–31 Removing Front Door . . . ....................... 6–36

6–32 Removing Front Panel . . ....................... 6–37

6–33 Removing the OCP Module ..................... 6–38

6–34 Removing Power Supply ....................... 6–39

6–35 Removing Speaker ............................ 6–40

6–36 Removing a CD–ROM Drive .................... 6–41

6–37 Removing a Tape Drive . ....................... 6–42

6–38 Removing a Floppy Drive ....................... 6–43

A–1 Motherboard Jumpers (Default Settings)........... A–2

A–2 AlphaServer 1000A 5/400 CPU Daughter Board

(Jumper J3) . . ............................... A–4

A–3 AlphaServer 1000A 5/333 CPU Daughter Board

(Jumper J3) . . ............................... A–5

A–4 AlphaServer 1000A 5/300 CPU Daughter Board

(Jumper J3) . . ............................... A–6

A–5 AlphaServer 1000A 4/266 CPU Daughter Board

(Jumpers J3 and J4) . . . ....................... A–7

A–6 AlphaServer 1000A 4/233 CPU Daughter Board

(Jumpers J3 and J4) . . . ....................... A–8

A–7 Model 5/xxx Systems: Jumper J4 on the CPU

Daughter Board .............................. A–9

A–8 Model 5/xxx Systems: Jumper J1 on the CPU

Daughter Board .............................. A–10

viii

A–9 Model 4/xxx Systems: Jumper J1 on the CPU

Daughter Board .............................. A–11

Tables

1–1 Diagnostic Flow for Power Problems .............. 1–3

1–2 Diagnostic Flow for Problems Getting to Console

Mode ...................................... 1–4

1–3 Diagnostic Flow for Problems Reported by the Console

Program .................................... 1–5

1–4 Diagnostic Flow for Boot Problems ............... 1–7

1–5 Diagnostic Flow for Errors Reported by the Operating

System ..................................... 1–8

2–1 Interpreting Error Beep Codes ................... 2–2

2–2 Model 5/xxx SROM Test/Status Codes ............. 2–10

2–3 Console Power-Up Countdown Description and Field

Replaceable Units (FRUs) ...................... 2–14

2–4 SROM Memory Tests, CPU Jumper J1 ............ 2–17

2–5 Mass Storage Problems . ....................... 2–22

2–6 Troubleshooting RAID Problems . . ............... 2–24

2–7 EISA Troubleshooting . . ....................... 2–27

2–8 PCI Troubleshooting........................... 2–29

3–1 Summary of Diagnostic and Related Commands ..... 3–2

4–1 AlphaServer 1000 Fault Detection and Correction .... 4–2

5–1 Listing the ARC Firmware Device Names .......... 5–7

5–2 ARC Firmware Device Names ................... 5–7

5–3 ARC Firmware Environment Variables ............ 5–9

5–4 Environment Variables Set During System

Conﬁguration . ............................... 5–21

5–5 Operating System Memory Requirements .......... 5–30

5–6 Summary of Procedure for Conﬁguring EISA Bus

(EISA Options Only) . . . ....................... 5–37

5–7 Summary of Procedure for Conﬁguring EISA Bus with

ISA Options . ............................... 5–38

5–8 SCSI Storage Conﬁgurations .................... 5–42

6–1 AlphaServer 1000A FRUs ...................... 6–2

6–2 Power Cord Order Numbers ..................... 6–11

Preface

This guide describes the procedures and tests used to service AlphaServer 1000A systems. AlphaServer 1000A systems use a deskside ‘‘wide-tower’’ enclosure.

Intended Audience

This guide is intended for use by Digital Equipment Corporation service personnel and qualiﬁed self-maintenance customers.

Conventions

The following conventions are used in this guide:

Convention Meaning

Return Ctrl/x Ctrl/x

Warning Warnings contain information to prevent personal injury. Caution Cautions provide information to prevent damage to equipment

Note A note calls the reader’s attention to any information that may

boot

[]

show config

italic type In console command sections, italic type indicates a variable. < > In console mode online help, angle brackets enclose a

{ } In command descriptions, braces containing items separated by

A key name enclosed in a box indicates that you press that key.

indicates that you hold down the Ctrl key while you press another key, indicated here by x. In examples, this key combination is enclosed in a box, for example,

or software.

be of special importance. Console and operating system commands are shown in this

special typeface. In command format descriptions, brackets indicate optional

elements. Console command abbreviations must be entered exactly as

shown. Commands shown in lowercase can be entered in either uppercase or lowercase.

placeholder for which you must specify a value.

commas imply mutually exclusive items.

Ctrl/C

Related Documentation

• AlphaServer 1000A Owner’s Guide, EK-ALPSV-OG

• AlphaServer 1000/1000A Model 5/xxx Owner’s Guide Supplement, EKAL530-OG

• DEC Veriﬁer and Exerciser Tool User’s Guide, AA-PTTMD-TE

• Guide to Kernel Debugging, AA-PS2TD-TE

• OpenVMS Alpha System Dump Analyzer Utility Manual, AA-PV6UB-TE

• DECevent Translation and Reporting Utility for OpenVMS Alpha, User and Reference Guide, AA-Q73KC-TE

xii

• DECevent Translation and Reporting Utility for Digital UNIX, User and Reference Guide AA-QAA3A-TE

• DECevent Analysis and Notiﬁcation Utility for OpenVMS Alpha, User and Reference Guide, AA-Q73LC-TE

• DECevent Analysis and Notiﬁcation Utility for Digital UNIX, User and Reference Guide AA-QAA4A-TE

• StorageWorks RAID Array 200 Subsystems Controller Installation and Standalone Conﬁguration Utility User’s Guide, EK-SWRA2-IG

xiii

Troubleshooting Strategy

This chapter describes the troubleshooting strategy for AlphaServer 1000A systems.

• Section 1.1 provides questions to consider before you begin troubleshooting an AlphaServer 1000A system.

• Tables 1–1 through 1–5 provide a diagnostic ﬂow for each category of system problem.

• Section 1.2 lists the product tools and utilities.

• Section 1.3 lists available information services.

1.1 Troubleshooting the System

Before troubleshooting any system problem, check the site maintenance log for the system’s service history. Be sure to ask the system manager the following questions:

• Has the system been used before and did it work correctly?

• Have changes to hardware or updates to ﬁrmware or software been made to the system recently? If so, are the revision numbers compatible for the system? (Refer to the hardware and operating system release notes).

• What is the state of the system—is the operating system running? If the operating system is down and you are not able to bring it up, use

the console environment diagnostic tools, such as the power-up display and ROM-based diagnostics (RBDs).

If the operating system is running, use the operating system environment diagnostic tools, such as the DECevent event management utility (to translate and interpret error logs), crash dumps, and exercisers (DEC VET).

Troubleshooting Strategy 1–1

1.1.1 Problem Categories

System problems can be classiﬁed into the following ﬁve categories. Using these categories, you can quickly determine a starting point for diagnosis and eliminate the unlikely sources of the problem.

1. Power problems (Table 1–1)

2. No access to console mode (Table 1–2)

3. Console-reported failures (Table 1–3)

4. Boot failures (Table 1–4)

5. Operating system-reported failures (Table 1–5)

1–2 Troubleshooting Strategy

Table 1–1 Diagnostic Flow for Power Problems

Symptom Action

System does not power on.

• Check the power source and power cord.

• Check that the system’s top cover is properly secured. A safety interlock switch shuts off power to the system if the top cover is removed.

• If there are two power supplies, make sure both power supplies are plugged in.

• Check the On/Off switch setting on the operator control panel.

• Check that the ambient room temperature is within environmental speciﬁcations (10–40°C, 50–104°F).

• Check that internal power supply cables are plugged in at both the power supply and system motherboard (Section 5.9).

Power supply shuts down after a few seconds (fan failure).

Using a ﬂashlight, look through the front (to the left of the internal StorageWorks shelf) to determine if the fans are spinning at power-up. A failure of either fan causes the system to shut down after a few seconds.

Troubleshooting Strategy 1–3

Table 1–2 Diagnostic Flow for Problems Getting to Console Mode

Symptom Action

Power-up screen is not displayed. Interpret the error beep codes at power-up (Section 2.1)

for a failure detected during self-tests. In addition to beep codes, model 5/xxx systems display error codes on the OCP (Section 2.2).

Check that the keyboard and monitor are properly connected and turned on.

If the power-up screen is not displayed, yet the system enters console mode when you press

console

the you are using a VGA monitor as the console terminal, the console variable should be set to ‘‘graphics.’’ If you are using a serial console terminal, the console variable should be set to ‘‘serial.’’

If a VGA controller other than the standard on-board VGA controller is being used, refer to Section 5.10 for more information.

console

If routed to the COM1 serial communication port (Section 5.10) and cannot be viewed from the VGA monitor.

Try connecting a console terminal to the COM1 serial communication port (Section 5.10). If necessary use an MMJ-to-9-pin adapter (H8571-J). Check the baud rate setting for the console terminal and the system. The system baud rate setting is 9600. When using the COM1 port, you must set the variable to ‘‘serial.’’

For certain situations, power up using the fail-safe loader (Section 2.9) to load new console ﬁrmware from a diskette.

environment variable is set correctly. If

is set to serial, the power-up screen is

Return

console

, check that

environment

1–4 Troubleshooting Strategy

Table 1–3 Diagnostic Flow for Problems Reported by the Console Program

Symptom Action

Power-up tests do not complete. Interpret the error beep codes at power-up (Section 2.1)

Console program reports error:

• Error beep codes report an error at power-up.

• Power-up screen includes error messages.

• Model 5/xxx display error codes on the OCP display.

and check the power-up screen (Section 2.3) for a failure detected during self-tests. In addition, model 5/xxx systems display error codes on the OCP (Section 2.2).

Use the error beep codes (Section 2.1) and/or console terminal (Section 2.3) to determine the error. In addition, model 5/xxx systems display error codes on the OCP (Section 2.2).

Examine the console event log (enter the command) (Section 2.3.1) or the power-up screen (Section 2.3) to check for embedded error messages recorded during power-up.

If the power-up screen or console event log indicates problems with mass storage devices, or if storage devices are missing from the the troubleshooting tables (Section 2.5) to determine the problem.

show config

more el

display, use

Note

The external SCSI terminator must be installed on the SCSI port at the rear of the enclosure. Without the termination, some SCSI drives will not be available– these drives will be missing from the

config

display.

show

If the power-up screen or console event log indicates problems with EISA devices, or if EISA devices are missing from the troubleshooting table (Section 2.7) to determine the problem.

If the power-up screen or console event log indicates problems with PCI devices, or if PCI devices are missing from the troubleshooting table (Section 2.8) to determine the problem.

show config

(continued on next page)

Troubleshooting Strategy 1–5

display, use the

Table 1–3 (Cont.) Diagnostic Flow for Problems Reported by the Console

Program

Symptom Action

Run the ROM-based diagnostic (RBD) tests (Section 3.1) to verify the problem.

1–6 Troubleshooting Strategy

Table 1–4 Diagnostic Flow for Boot Problems

Symptom Action

System cannot ﬁnd boot device. Check the system conﬁguration for the correct device

parameters (node ID, device name, and so on).

• For Digital UNIX and OpenVMS, use the

show config

(Section 5.1).

• For Windows NT, use the Display Hardware Conﬁguration display and the Set Default Environment Variables display (Section 5.1).

Check the system conﬁguration for the correct environment variable settings.

• For Digital UNIX and OpenVMS, examine the auto_action, bootdef_dev, boot_osﬂags, and os_type environment variables. Also, make sure that the bus_probe_algorithm environment variable is set to ‘‘new’’ (Section 5.1.4.4).

For problems booting over a network, check the ew*0_protocols or er*0_protocols environment variable settings: Systems booting from a Digital UNIX server should be set to bootp; systems booting from an OpenVMS server should be set to mop (Section 5.1.4.4).

• For Windows NT, examine the FWSEARCHPATH, AUTOLOAD, and COUNTDOWN environment variables (Section 5.1.4.4).

and

show device

commands

Device does not boot. For problems booting over a network, check the ew*0_

protocols or er*0_protocols environment variable settings: Systems booting from a Digital UNIX server should be set to bootp; systems booting from an OpenVMS server should be set to mop (Section 5.1.4.4).

For systems running Digital UNIX and OpenVMS, make sure that the bus_probe_algorithm environment variable is set to ‘‘new’’ (Section 5.1.4.4).

Run the device tests (Section 3.1) to check that the boot device is operating.

Troubleshooting Strategy 1–7

Table 1–5 Diagnostic Flow for Errors Reported by the Operating System

Symptom Action

System is hung or has crashed. Examine the crash dump ﬁle.

Refer to OpenVMS Alpha System Dump Analyzer Utility Manual (AA-PV6UB-TE) for information on how to interpret OpenVMS crash dump ﬁles.

Refer to the Guide to Kernel Debugging (AA–PS2TD– TE) for information on using the Digital UNIX Krash Utility.

Errors have been logged and the operating system is up.

Examine the operating system error log ﬁles to isolate the problem (Chapter 4).

If the problem occurs intermittently, run an operating system exerciser, such as DEC VET, to stress the system.

Refer to the DEC Veriﬁer and Exerciser Tool User’s Guide (AA–PTTMD–TE) for instructions on running DEC VET.

1.2 Service Tools and Utilities

This section lists the array of service tools and utilities available for acceptance testing, diagnosis, and serviceability and provides recommendations for their use.

Error Handling/Logging Tools

Digital UNIX, OpenVMS, and Microsoft Windows NT operating systems provide recovery from errors, fault handling, and event logging. The DECevent Translation and Reporting Utility provides bit-to-text translation of event logs for interpretation for Digital UNIX and Open VMS error logs.

RECOMMENDED USE: Analysis of error logs is the primary method of diagnosis and fault isolation. If the system is up, or you are able to bring it up, look at this information ﬁrst.

ROM-Based Diagnostics (RBDs)

Many ROM-based diagnostics and exercisers are embedded in AlphaServer 1000A systems. ROM-based diagnostics execute automatically at power-up and can be invoked in console mode using console commands.

1–8 Troubleshooting Strategy

RECOMMENDED USE: ROM-based diagnostics are the primary means of testing the console environment and diagnosing the CPU, memory, Ethernet, I/O buses, and SCSI and DSSI subsystems. Use ROM-based diagnostics in the acceptance test procedures when you install a system, add a memory module, or replace the following components: CPU module, memory module, motherboard, I/O bus device, or storage device. Refer to Chapter 3 for information on running ROM-based diagnostics.

Loopback Tests

Internal and external loopback tests are used to isolate a failure by testing segments of a particular control or data path. The loopback tests are a subset of the ROM-based diagnostics.

RECOMMENDED USE: Use loopback tests to isolate problems with the COM2 serial port, the parallel port, and Ethernet controllers. Refer to Chapter 3 for instructions on performing loopback tests.

Firmware Console Commands

Console commands are used to set and examine environment variables and device parameters, as well as to invoke ROM-based diagnostics and exercisers. For example, the

device

dev, auto_action, and boot_osﬂags) commands are used to set environment variables; and the

commands are used to examine the conﬁguration; the

cdp

command is used to conﬁgure DSSI parameters.

show memory,show configuration

, and

set

(bootdef_

show

RECOMMENDED USE: Use console commands to set and examine environment variables and device parameters and to run RBDs. Refer to Section 5.1 for information on conﬁguration-related ﬁrmware commands and Chapter 3 for information on running RBDs.

Operating System Exercisers (DEC VET)

The Digital Veriﬁer and Exerciser Tool (DEC VET) is supported by the Digital UNIX, OpenVMS, and Windows NT operating systems. DEC VET performs exerciser-oriented maintenance testing of both hardware and operating system.

RECOMMENDED USE: Use DEC VET as part of acceptance testing to ensure that the CPU, memory, disk, tape, ﬁle system, and network are interacting properly. Also use DEC VET to stress test the user’s environment and conﬁguration by simulating system operation under heavy loads to diagnose intermittent system failures.

Troubleshooting Strategy 1–9

Crash Dumps

For fatal errors, such as fatal bugchecks, Digital UNIX and OpenVMS operating systems will save the contents of memory to a crash dump ﬁle.

RECOMMENDED USE: Crash dump ﬁles can be used to determine why the system crashed. To save a crash dump ﬁle for analysis, you need to know the proper system settings. Refer to the OpenVMS Alpha System Dump Analyzer Utility Manual (AA-PV6UB-TE) or the Guide to Kernel Debugging (AA–PS2TD–TE) for Digital UNIX.

1.3 Information Services

Several information resources are available, including online information for servicers and customers, computer-based training, and maintenance documentation database services. A brief description of some of these resources follows.

Fast Track Service Help File

The information contained in this guide, including the ﬁeld-replaceable unit (FRU) procedures and illustrations, is available in online format. You can download the hypertext ﬁle (A1000A-S.HLP) or a self-extracting .HLP ﬁle from TIMA, or order the diskette (AK-QQRMB-CA) or the AlphaServer 1000A Maintenance Kit (QZ-OOUAB-GC). The maintenance kit includes hardcopy, diskette, and illustrated parts breakdown.

Alpha Firmware Updates

Under certain circumstances, such as a CPU upgrade or replacement of the system backplane, you need to update your system ﬁrmware. An Alpha Firmware CD–ROM is shipped on an ‘‘as released’’ basis with Digital UNIX, OpenVMS, and Windows NT operating systems. The Alpha ﬁrmware ﬁles can also be downloaded from the Internet as follows:

http://ftp.digital.com/pub/DEC/Alpha/ﬁrmware/ New versions of ﬁrmware released between shipments of the Alpha Firmware

CD–ROM are available in an interim directory: ftp://ftp.digital.com/pub/Digital/Alpha/ﬁrmware/interim/

1–10 Troubleshooting Strategy

ECU Revisions

The EISA Conﬁguration Utility (ECU) is used for conﬁguring EISA options on AlphaServer systems. Systems are shipped with an ECU kit, which includes the ECU license. Customers who already have the ECU and license, but need the latest revision of the ECU, can order a separate kit. Call 1-800-DIGITAL to order.

If the customer plans to migrate from Digital UNIX or OpenVMS to Windows NT, you must re-run the appropriate ECU. Failure to run the operatingspeciﬁc ECU will result in system failure.

OpenVMS Patches

Software patches for the OpenVMS operating system are available from the World Wide Web as follows:

http://www.service.digital.com/html/patch_service.html Choose the ‘‘Contract Access’’ option if you have a valid software contract

with Digital or you wish to become a software contract customer. Choose the ‘‘Public Access’’ options if you do not have a sofware service contract.

Late-Breaking Technical Information

You can download up-to-date ﬁles and late-breaking technical information from the Internet for managing AlphaServer 1000A systems.

• FTP address:

ftp.digital.com cd /pub/DEC/Alpha/systems/as1000/docs

• World Wide Web address:

http://www.service.digital.com/alpha/server/1000.html

The information includes ﬁrmware updates, the latest conﬁguration utilities, software patches, lists of supported options, Wide SCSI information and more.

Supported Options

Refer to the AlphaServer 1000A Supported Options List for a list of options supported under Digital UNIX, OpenVMS, and Windows NT. The options list is available from the Internet as follows:

• FTP address:

ftp://ftp.digital.com/pub/Digital/Alpha/systems/

• World Wide Web address:

http://www.service.digital.com/alpha/server/

Troubleshooting Strategy 1–11

You can obtain information about hardware conﬁgurations for the AlphaServer 1000A from the Digital Systems and Options Catalog. The catalog is regularly published to assist in ordering and conﬁguring systems and hardware options. Each printing of the catalog presents all of the products that are announced, actively marketed, and available for ordering.

Access printable postscript ﬁles of any section of the catalog from the Internet as follows (Be sure to check the Readme ﬁle):

•

ftp://ftp.digital.com/pub/Digital/info/SOC/

Training

The following Computer Based Training (CBT) and lecture lab courses are available from the Digital training center:

• Alpha Concepts

• DSSI Concepts: EY-9823E

• ISA and EISA Bus Concepts: EY-I113E-P0

• RAID Concepts: EY-N935E

• SCSI Concepts and Troubleshooting: EY-P841E, EY-N838E

Digital Assisted Services

Digital Assisted Services (DAS) offers products, services, and programs to customers who participate in the maintenance of Digital computer equipment. Components of Digital assisted services include:

• Spare parts and kits

• Diagnostics and service information/documentation

• Tools and test equipment

• Parts repair services, including Field Change Orders

1–12 Troubleshooting Strategy

Power-Up Diagnostics and Display

This chapter provides information on how to interpret error beep codes and the power-up display on the console screen. In addition, a description of the power-up and ﬁrmware power-up diagnostics is provided as a resource to aid in troubleshooting.

• Section 2.1 describes how to interpret error beep codes at power-up.

• Section 2.4 describes SROM memory tests that can be run at power-up to isolate failing SIMM memory.

• Section 2.3 describes how to interpret the power-up screen display.

• Section 2.5 describes how to troubleshoot mass-storage problems indicated at power-up or storage devices missing from the

• Section 2.6 shows the location of storage device LEDs.

• Section 2.7 describes how to troubleshoot EISA bus problems indicated at power-up or EISA devices missing from the

• Section 2.8 describes how to troubleshoot PCI bus problems indicated at power-up or PCI devices missing from the

show config

display.

• Section 2.9 describes the use of the Fail-Safe Loader.

• Section 2.10 describes the power-up sequence.

• Section 2.11 describes power-on self-tests.

Power-Up Diagnostics and Display 2–1

2.1 Interpreting Error Beep Codes

If errors are detected at power-up, audible beep codes are emitted from the system. For example, if the SROM code could not ﬁnd any good memory, you would hear a 1-3-3 beep code (one beep, a pause, a burst of three beeps, a pause, and another burst of three beeps).

Be sure to check that the CPU daughter board is properly seated in its connector if errors are reported.

Note

A single beep is emitted for model 5/xxx systems when the SROM code has successfully completed. The console ﬁrmware then continues with its power-up tests.

The beep codes are the primary diagnostic tool for troubleshooting problems when console mode cannot be accessed. Refer to Table 2–1 for information on interpreting error beep codes.

Table 2–1 Interpreting Error Beep Codes

Beep Code Problem Corrective Action

1-1-2 ROM data path error detected while

loading ARC/SRM console code.

2–2 Power-Up Diagnostics and Display

1. Use the Fail-Safe Loader to

load new ARC/SRM console code (Section 2.9).

2. If successfully loading new

console ﬁrmware does not solve the problem, replace the motherboard (Chapter 6).

(continued on next page)

Table 2–1 (Cont.) Interpreting Error Beep Codes

Beep Code Problem Corrective Action

1-1-4 The SROM code is unable to load the

console code: Flash ROM header area or checksum error detected.

1. Use the Fail-Safe Loader to

load new ARC/SRM console code (Section 2.9).

2. If successfully loading new

console ﬁrmware does not solve the problem, replace the motherboard (Chapter 6).

1-2-1 TOY NVRAM failure. Replace the TOY NVRAM chip (E78)

1-2-4 Backup cache error. Replace the CPU daughter board

on system motherboard (Chapter 6).

(Chapter 6). Model 5/xxx systems can be operated

with the Bcache disabled until a replacement CPU daughter board is available. Bank 4 of the J1 or J4 jumper on the CPU daughter board is used to disable the Bcache (Figures 2–2 and 2–2).

(continued on next page)

Power-Up Diagnostics and Display 2–3

Table 2–1 (Cont.) Interpreting Error Beep Codes

Beep Code Problem Corrective Action

1-3-3 No usable memory detected.

1. Verify that the memory modules

are properly seated and try powering up again.

2. Swap bank 0 memory with

known good memory and run SROM memory tests at powerup (Section 2.4).

3. If populating bank 0 with known

good memory does not solve the problem, replace the CPU daughter board (Chapter 6).

4. If replacing the CPU daughter

board does not solve the problem, replace the motherboard (Chapter 6).

3-1-2 J1 jumper on CPU daughter board set

incorrectly or failure of native SCSI controller (NCR810).

2–4 Power-Up Diagnostics and Display

1. Check that the J1 jumper on the

CPU daughter board is set at bank 1 for AlphaServer 1000A systems, as opposed to bank 0, reserved for AlphaServer 1000 systems (Figure 2–5).

Note that model 5/xxx systems can use either standard boot setting, bank 0 or 1, regardless of system, and that model 5/300 systems use jumper designator J4, rather than J1.

2. If the J1 jumper setting is

not the problem, replace the motherboard (Chapter 6).

(continued on next page)

Table 2–1 (Cont.) Interpreting Error Beep Codes

Beep Code Problem Corrective Action

3-3-1 Generic system failure. Possible problem

sources include the TOY NVRAM chip (Dallas DS1287A) or PCI-to-EISA bridge chipset (Intel 82375EB).

3-3-2 J1 jumper on CPU daughter board set

incorrectly or failure of the PCI-to-PCI bridge (DECchip 21050).

1. Replace the TOY NVRAM chip

(E78) on system motherboard (Chapter 6.)

2. If replacing the TOY NVRAM

chip did not solve the problem, replace the motherboard (Chapter 6).

1. Check that the J1 jumper on the

CPU daughter board is set at bank 1 for AlphaServer 1000A systems, as opposed to bank 0, reserved for AlphaServer 1000 systems (Figure 2–5).

Note that model 5/xxx systems can use either standard boot setting, bank 0 or 1, regardless of system, and that model 5/300 systems use jumper designator J4, rather than J1.

2. If the J1 jumper setting is

not the problem, replace the motherboard (Chapter 6).

(continued on next page)

Power-Up Diagnostics and Display 2–5

Table 2–1 (Cont.) Interpreting Error Beep Codes

Beep Code Problem Corrective Action

3-3-3 J1 jumper on the CPU daughter board

set incorrectly or failure of the native SCSI controller (NCR810) on the system motherboard.

1. Check that the J1 jumper on the

CPU daughter board is set at bank 1 for AlphaServer 1000A systems, as opposed to bank 0, reserved for AlphaServer 1000 systems (Figure 2–5).

Note that model 5/xxx systems can use either standard boot setting, bank 0 or 1, regardless of system, and that model 5/300 systems use jumper designator J4, rather than J1.

2. If the J1 jumper setting is

not the problem, replace the motherboard (Chapter 6).

2–6 Power-Up Diagnostics and Display

Figure 2–1 Model 4/xxx Systems: Jumper J1 on the CPU Daughter Board

MA00926

Bank Jumper Setting

0 Standard boot setting (AlphaServer 1000 systems) 1 Standard boot setting (AlphaServer 1000A systems) 2 Mini-console setting: Internal use only 3 SROM CacheTest: backup cache test 4 SROM BCacheTest: backup cache and memory test 5 SROM memTest: memory test with backup and data cache disabled 6 SROM memTestCacheOn: memory test with backup and data cache enabled 7 Fail-Safe Loader setting: selects fail-safe loader ﬁrmware

0 1 2 3 4 5 6 7

Power-Up Diagnostics and Display 2–7

Figure 2–2 Model 5/xxx Systems: Jumper J4 on the CPU Daughter Board

0 1 2 3 4 5 6 7

MLO-013462

Bank Jumper Setting

0 Standard boot setting (AlphaServer 1000/1000A systems) 1 Standard boot setting (AlphaServer 1000/1000A systems) 2 Mini-console setting: Internal use only 3 Mini-console setting: Internal use only 4 Power up with no Bcache: Power up with Bcache disabled allows the system to run

5 Mini-console setting: Internal use only 6 Mini-console setting: Internal use only 7 Fail-Safe Loader setting: selects fail-safe loader ﬁrmware

despite bad Bcache until a replacement daughter board is available

2–8 Power-Up Diagnostics and Display

Figure 2–3 Model 5/xxx Systems: Jumper J1 on the CPU Daughter Board

0 1 2 3 4 5 6 7

MLO-013469

Bank Jumper Setting

5 Mini-console setting: Internal use only 6 Mini-console setting: Internal use only 7 Fail-Safe Loader setting: selects fail-safe loader ﬁrmware

despite bad Bcache until a replacement daughter board is available

Power-Up Diagnostics and Display 2–9

2.2 Model 5/xxx SROM Error Codes

Model 5/xxx systems report errors and status to the OCP display during SROM power-up tests. Table 2–2 provides an explanation of the status and error codes that may be displayed:

• Fatal error codes identify errors that prevent the system from accessing the cosole and booting the operating system.

• Nonfatal error codes identify errors that may not provent the system from accessing the console, but may prevent the system from successfully booting the operating system.

• Execution status codes identify the process tht is currently underway.

Note

If errors are reported, be sure that the CPU daughter board is properly seated in its connectors.

Table 2–2 Model 5/xxx SROM Test/Status Codes

OCP Code Description Likely FRU Fatal Error Codes

FF No s-cache bits set in sc_ctl register CPU daughter board FD Floppy load error Bad or wrong diskette in drive FA No usable memory detected SIMM memory or backplane F9 System init failure CPU daughter board F8 PCI data path error CPU daughter board F7 CIA/PCEB I/O reister init failure CPU daughter board F6 Bad CIA memory csr was detected CPU daughter board F4 Bcache data path error CPU daughter board F3 Bcache address line error CPU daughter board F1 Flash ROM data path read error CPU daughter board

2–10 Power-Up Diagnostics and Display

(continued on next page)

Table 2–2 (Cont.) Model 5/xxx SROM Test/Status Codes

OCP Code Description Likely FRU Nonfatal Error Codes

EB CPU speed error detected CPU daughter board EA PCI-to-PCI (PPB) data path error Motherboard E9 No real-time clock (TOY) TOY/NVRAM chip E6 EISA conﬁguraton NVRAM E5 Main memory data path error E4 Q-logic SCSI data path error E3 Main memory address lines error E2 Super I/O error E1 Main memory cell test error E0 Flash ROM checksum error

(continued on next page)

Power-Up Diagnostics and Display 2–11

Table 2–2 (Cont.) Model 5/xxx SROM Test/Status Codes

OCP Code Description Likely FRU Execution Status Codes

DF SROM program beginning to initialize

DE Initialize CPU and system interface DD Sizing CPU speed DC Sizing S-cache DB Initializing and testing the PCI bus DA Sizing B-cache D9 Sizing memory D8 Conﬁguring memory D7 Initializing Bchache D6 Testing memory D5 Testing Bcache bits D4 Testing memory bits D3 Testing Bcache address D2 Testing memory address D1 Testing Bcache cells D0 Testing memory cells CF Initializing memory CD Loading Flash ROM code CC Re-initializing CPU and system interface CB SROM execution completed The system could hang here if

the EV5 CPU

EV4 console code is used with 5/xxx (EV5) systems.

2–12 Power-Up Diagnostics and Display

2.3 Power-Up Screen

During power-up self-tests, the test status and result are displayed on the console terminal. Information similar to the following example should be displayed on the screen.

AlphaServer 1000 Model 4/xxx Systems

ff.fe.fd.fc.fb.fa.f9.f8.f7.f6.f5. ef.df.ee.f4.ed.ec.initializing keyboard

eb.....ea.e9.e8.e7.e6.e5.e4.e3.e2.e1.e0.

X4.4-5365, built on Oct 27 1995 at 09:26:04 >>>

AlphaServer 1000 Model 5/xxx Systems

ff.fe.fd.fc.fb.fa.f9.f8.f7.f6.f5.ef.df.ee.f4. probing hose 0, PCI probing PCI-to-PCI bridge, bus 1 bus 1, slot 0 -- pka -- QLogic ISP1020 bus 0, slot 11 -- ewa -- DECchip 21040-AA probing hose 1, EISA ECU error, slot 0, found DEC5000, expected nothing

EISA Configuration Error Run the EISA Configuration Utility

ed.ec.eb.....ea.e9.e8.e7.e6.e5.e4.e3.e2.e1.e0.

X4.6-8189, built on Jul 29 1996 at 03:21:03 Memory Testing and Configuration Status

32 Meg of System Memory Bank 0 = 32 Mbytes(8 MB Per SIMM) Starting at 0x00000000 Bank 1 = No Memory Detected Bank 2 = No Memory Detected Bank 3 = No Memory Detected

Testing the System Change mode to Internal loopback. Change to Normal Operating Mode. >>>

Table 2–3 provides a description of the power-up countdown for output to the serial console port. If the power-up display stops, use the beep codes (Table 2–1) and Table 2–3 to isolate the likely ﬁeld-replaceable unit (FRU).

Power-Up Diagnostics and Display 2–13

Table 2–3 Console Power-Up Countdown Description and Field Replaceable

Units (FRUs)

Countdown Number Description Likely FRU

ff Console initialization started Non-speciﬁc/Status message fe Initialized idle PCB Non-speciﬁc/Status message fd Initializing semaphores Non-speciﬁc/Status message fc,fb,fa Initializing heap Non-speciﬁc/Status message f9 Initializing driver structures Non-speciﬁc/Status message f8 Initializing idle process PID Non-speciﬁc/Status message f7 Initializing ﬁle system TOY chip (E78) f6 Initializing timer data structures Non-speciﬁc/Status message f5 Lowering IPL Non-speciﬁc/Status message f4 Entering idle loop TOY chip (E78) ef Start memory conﬁguration (heap) SIMM memory or backplane df Conﬁgure PCI and EISA bus PCI or EISA option ee Start phase 1 drivers: NVRAM and

PCICFG drivers

ed Start phase 2 drivers: IIC bus and OCP

drivers

ec Start phase 3 drivers (console select):

tt serial line class, tga graphics, vga

graphics, and keyboard drivers eb Run power-up memory test SIMM memory ea Start phase 4 drivers Non-speciﬁc/Status message e9 Phase 4 drivers complete Non-speciﬁc/Status message e8 Initialize environment variables Non-speciﬁc/Status message e7 Start SCSI class driver Backplane (on-board Qlogic

e6 Start phase 5 drivers: I/O drivers PCI or EISA option e5 Restore timers TOY chip (E78)

NVRAM chip (E14) or PCI option

Non-speciﬁc/Status message

Keyboard, VGA or TGA option, or backplane

1020A)

Digital UNIX or OpenVMS Systems

Digital UNIX and OpenVMS operating systems are supported by the SRM ﬁrmware (see Section 5.1.1). The SRM console prompt follows:

>>>

2–14 Power-Up Diagnostics and Display

Windows NT for Model 4/xxx Systems

The Windows NT operating system is supported by the ARC ﬁrmware for model 4/xxx systems. (see Section 5.1.1). Model 4/xxx systems using Windows NT power up to the ARC boot menu as follows:

Boot menu:

Boot Windows NT Boot an alternate operating system... Run a program... Supplementary menu...

Use the arrow keys to select, then press Enter.

Windows NT for Model 5/xxx Systems

The Windows NT operating system is supported by the AlphaBIOS ﬁrmware for model 5/xxx systems. (see Section 5.1.1). Model 5/xxx systems using Windows NT power up to the AlpahBIOS boot menu as follows:

AlphaBIOS Version 5.11 

Figure 2–4 AlphaBIOS Boot Menu

Please select the operating system to start:

Windows NT Workstation 3.51

n.nn

Use and to move the highlight to your choice. Press Enter to choose.

Alpha

Press <F2> to enter SETUP

PK-0728-96

Refer to the AlphaServer 1000/1000A Model 5/xxx Owner’s Guide Supplement for information on the AlphaBIOS ﬁrmware.

Power-Up Diagnostics and Display 2–15

2.3.1 Console Event Log

AlphaServer 1000A systems maintain a console event log consisting of status messages received during power-on self-tests. If problems occur during power-up, standard error messages indicated by asterisks (***) may be embedded in the console event log. To display a console event log, use the command.

Note

more elorcat el

To stop the screen display from scrolling, press

Ctrl/Q

press You can also use the command,

more el

, to display the console event log

Ctrl/S

. To resume scrolling,

one screen at a time.

The following example shows a console event log that contains two standard error messages. The ﬁrst indicates that the mouse is not plugged in or is not working, and the second indicates that SROM tests detected a bad SIMM (bank1, SIMM3).

>>> cat el ff.fe.fd.fc.fb.fa.f9.f8.f7.f6.f5.ef.df.ee.f4. probing hose 0, PCI probing PCI-to-EISA bridge, bus 1 probing PCI-to-PCI bridge, bus 2 bus 2, slot 0 -- pka -- QLogic ISP1020 bus 0, slot 11 -- ewa -- DECchip 21040-AA ed.ec. ** mouse error ** *** Bad memory detected by serial rom *** SROM failing Bank 1, SIMM 3

eb.....ea.e9.e8.e7.e6.e5.e4.e3.e2.e1.e0.

X4.6-10166, built on Aug 30 1996 at 16:18:06

. . .

>>>

2.4 Model 4/xxx SROM Memory Power-Up Tests

If the power-up tests or ROM-based diagnostics indicate a memory error without identifying the failing bank and SIMM position, you can match the failing address to a table using the procedure in Chapter 3, or for model 4/xxx systems, you can run speciﬁc SROM power-up tests using jumper J1 (Figure 2–5) on the CPU daughter board. The progress and results of these tests are reported on the LCD display on the operator control panel (OCP).

2–16 Power-Up Diagnostics and Display

To thoroughly test memory and data paths, complete the SROM tests in the order presented in Table 2–4. If a SIMM is reported bad, replace the SIMM (Chapter 6) and resume testing at bank 4 (Memory Test).

Table 2–4 SROM Memory Tests, CPU Jumper J1

Bank # Test Description Test Results

3 Cache Test: Tests

backup cache.

5 Memory Test:

Tests memory with backup and data cache disabled.

Test status displays on OCP:

....done.

If the test takes longer than a few seconds to complete, there is a problem with the backup cache—replace the CPU daughter board (Chapter 6).

Test status displays on OCP:

12345.done.

If an error is detected, the bank number and failing SIMM position are displayed. The following OCP message indicates a failing SIMM at bank 0, SIMM position 2.

FAIL B:0 S:2

Test duration: Approximately 10 seconds per 8 megabytes of memory.

Figure 2–6 shows the bank and SIMM layout for AlphaServer 1000A systems. After determining the bad SIMM, refer to Chapter 6 for instructions on replacing FRUs.

Note: The memory tests do not test the ECC SIMMs. If the operating system logs ﬁve or more single-bit correctible errors, replace the suspected ECC SIMMs with good SIMMs and repeat the memory test.

ECC SIMMs cannot be used in the standard memory banks (banks 0–3). ECC SIMMs are specialized for use only in ECC banks.

(continued on next page)

Power-Up Diagnostics and Display 2–17

Table 2–4 (Cont.) SROM Memory Tests, CPU Jumper J1

Bank # Test Description Test Results

6 Memory Test,

Cache Enabled: Tests memory with backup and data cache enabled.

Test status displays on OCP:

12345.done.

If an error is detected, the bank number and failing SIMM position are displayed. The following OCP message indicates a failing SIMM at bank 0, SIMM position 2.

FAIL B:0 S:2

Test duration: Approximately 2 seconds per 8 megabytes of memory.

Figure 2–6 shows the bank and SIMM layout for AlphaServer 1000A systems. After determining the bad SIMM, refer to Chapter 6 for instructions on replacing FRUs.

Note: The memory tests do not test the ECC SIMMs. If the operating system logs ﬁve or more single-bit correctible errors, replace the suspected ECC SIMMs with good SIMMs and repeat the memory test.

ECC SIMMs cannot be used in the standard memory banks (banks 0–3). ECC SIMMs are specialized for use only in ECC banks.

(continued on next page)

2–18 Power-Up Diagnostics and Display

Table 2–4 (Cont.) SROM Memory Tests, CPU Jumper J1

Bank # Test Description Test Results

4 Backup Cache Test:

Tests backup cache alternatively with data cache enabled then disabled.

Test status displays on OCP:

d 12345.done. D 12345.done. D 12345.done. d 12345.done.

If an error is detected, the bank number and failing SIMM position are displayed. The following OCP message indicates a failing SIMM at bank 0, SIMM position 2.

FAIL B:0 S:2

Test duration: Approximately 2 seconds per 8 megabytes of memory.

Figure 2–6 shows the bank and SIMM layout for AlphaServer 1000A systems. After determining the bad SIMM, refer to Chapter 6 for instructions on replacing FRUs.

Note: The memory tests do not test the ECC SIMMs. If the operating system logs ﬁve or more single-bit correctible errors, replace the suspected ECC SIMMs with good SIMMs and repeat the memory test.

ECC SIMMs cannot be used in the standard memory banks (banks 0–3). ECC SIMMs are specialized for use only in ECC banks.

Power-Up Diagnostics and Display 2–19

Figure 2–5 Model 4/xxx: Jumper J1 on the CPU Daughter Board

MA00926

Bank Jumper Setting

0 1 2 3 4 5 6 7

2–20 Power-Up Diagnostics and Display

Figure 2–6 Model 4/xxx: AlphaServer 1000A Memory Layout

Bank 3

Bank 2

Bank 1

Bank 0

ECC Banks

SIMM 1 SIMM 0 SIMM 1 SIMM 0 SIMM 1 SIMM 0 SIMM 1

SIMM 0 ECC SIMM for Bank 2 ECC SIMM for Bank 0

SIMM 3 SIMM 2 SIMM 3 SIMM 2 SIMM 3 SIMM 2 SIMM 3

SIMM 2 ECC SIMM for Bank 3 ECC SIMM for Bank 1

MA00327

2.5 Mass Storage Problems Indicated at Power-Up

Mass storage failures at power-up are usually indicated by read fail messages. Other problems are indicated by storage devices missing from the display.

• Table 2–5 provides information for troubleshooting mass storage problems indicated at power-up or storage devices missing from the display.

• Table 2–6 provides troubleshooting tips for AlphaServer systems that use the RAID Array 200 Subsystem.

show config

• Section 2.6 provides information on storage device LEDs.

Use Tables 2–5 and 2–6 to diagnose the likely cause of the problem.

Power-Up Diagnostics and Display 2–21

Table 2–5 Mass Storage Problems

Problem Symptom Corrective Action

Drive failure Fault LED for drive is on

Duplicate SCSI IDs Drives with duplicate SCSI

SCSI ID set to 7 (reserved for host ID)

Duplicate host IDs on a shared bus

Missing or loose cables. Drives not properly seated on StorageWorks shelf

(steady) (Section 2.6).

IDs are missing from the

show config

Valid drives are missing from the display.

One drive may appear seven times on the

config

Valid drives are missing from the display.

One drive may appear seven times on the

config

Activity LEDs do not come on. Drive missing from the

show config

display.

show config

show

display.

show config

show

display.

Replace drive.

Correct SCSI IDs. May need to reconﬁgure internal StorageWorks backplane (Section 5.8).

Correct SCSI IDs.

Change host ID through the pk*0_host_id environment variable ( for systems running OpenVMS or Digital UNIX (SRM console). For systems running Windows NT (ARC console), choose ‘‘Set default conﬁguration’’ in the Setup Menu.

Remove device and inspect cable connections. Reseat drive on StorageWorks shelf.

set pk*0_host_id

(continued on next page)

)

2–22 Power-Up Diagnostics and Display

Table 2–5 (Cont.) Mass Storage Problems

Problem Symptom Corrective Action

SCSI bus length exceeded

Terminator missing or wrong terminator used

Extra terminator Devices produce errors or

SCSI storage controller failure

Drives may disappear intermittently from the

show config device

Read/write errors in the console event log; storage adapter port may fail.

If the bulkhead terminator for the removable-media bus is missing, removable media devices may not be recognized by the system and may be missing from

show config

the

device

device IDs are dropped.

Problems persist after eliminating the problem sources.

and

displays.

show

and

show

A SCSI bus extended to the internal StorageWorks shelf with the backplane conﬁgured as a single bus, cannot be extended outside of the enclosure.

A SCSI bus extended to the internal StorageWorks shelf with the backplane conﬁgured as a dual bus, can be extended 1 meter outside of the enclosure.

The entire SCSI bus length, from terminator to terminator, must not exceed 6 meters for singleended SCSI-2 at 5 MB/sec, or 3 meters for single-ended SCSI-2 at 10 MB/sec.

Attach appropriate terminators as needed (external SCSI terminator for use with the RAID Array 200 Subsystem, 12-4166704 (68-pin), 17-04166-02 (50-pin); external SCSI terminator for removable-bus, 12-41667-05).

Note: The SCSI terminator jumper (J51) on the system motherboard should be set to ‘‘on’’ to enable the onboard SCSI termination.

Check that bus is terminated only at beginning and end. Remove unnecessary terminators.

Note: The SCSI terminator jumper (J51) on the system motherboard should be set to ‘‘on’’ to enable the onboard SCSI termination.

Replace failing EISA or PCI storage adapter module (or motherboard for the native SCSI controller).

Table 2–6 provides troubleshooting hints for AlphaServer 1000A systems that have the StorageWorks RAID Array 200 Subsystem. The RAID subsystem

Power-Up Diagnostics and Display 2–23

includes either the KZESC-xx (SWXCR-Ex) or the KZPSC-xx (SWXCR-Px) PCI backplane RAID controller.

Table 2–6 Troubleshooting RAID Problems

Symptom Action

Some RAID drives do not appear

show device d

on the

Drives on the SWXCR controller power up with the amber Fault light on.

Cannot access disks connected to the RAID subsystem on Windows NT systems.

display.

Valid conﬁgured RAID logical drives will appear as DRA0–DRAn, not as DKn. Conﬁgure the drives by running the RAID Conﬁguration Utility (RCU), following the instructions in the StorageWorks RAID

Array 200 Subsystems Controller Installation and Standalone Conﬁguration Utility User’s Guide, EK-

SWRA2-IG. Reminder: several physical disks can be grouped as a

single logical DRAn device. External SCSI terminators used with the SWXCR

controller must be of the following type: 12-41667-04 (68-pin); 17-41667-02 (50-pin).

Whenever you move drives onto or off of the controller, run the RAID Conﬁguration Utility to set up the drives and logical units. Follow the instructions in the

StorageWorks RAID Array 200 Subsystems Controller Installation and Standalone Conﬁguration Utility User’s Guide.

External SCSI terminators used with the SWXCR controller must be of the following type: 12-41667-04 (68-pin); 17-41667-02 (50-pin).

On Windows NT systems, disks connected to the controller must be spun up before they can be accessed. While running the ECU, verify that the controller is set to spin up two disks every six seconds. This is the default setting if you are using the default conﬁguration ﬁles for the controller. If the settings are different, adjust them as needed.

2.6 Storage Device LEDs

Storage device LEDs indicate the status of the device.

• Figure 2–7 shows the LEDs for disk drives contained in a StorageWorks shelf. A failure is indicated by the Fault light on each drive.

• Figure 2–8 shows the Activity LED for the ﬂoppy drive. This LED is on when the drive is in use.

2–24 Power-Up Diagnostics and Display

• Figure 2–9 shows the Activity LED for the CD–ROM drive. This LED is on when the drive is in use.

For information on other storage devices, refer to the documentation provided by the manufacturer or vendor.

Figure 2–7 StorageWorks Disk Drive LEDs (SCSI)

Activity Fault

MA00927

Figure 2–8 Floppy Drive Activity LED

Activity LED

MA00330

Power-Up Diagnostics and Display 2–25

Figure 2–9 CD–ROM Drive Activity LED

Activity LED

MA00333

2–26 Power-Up Diagnostics and Display

2.7 EISA Bus Problems Indicated at Power-Up

EISA bus failures at power-up are usually indicated by the following messages displayed during power-up:

EISA Configuration Error. Run the EISA Configuration Utility.

Run the EISA Conﬁguration Utility (ECU) (Section 5.4) when this message is displayed. Other EISA bus problems are indicated by the absence of EISA devices from the

Table 2–7 provides steps for troubleshooting EISA bus problems that persist after you run the ECU.

Table 2–7 EISA Troubleshooting

Step Action

1 Conﬁrm that the EISA module and any cabling are properly seated. 2 Run the ECU to:

show config

• Conﬁrm that the system has been conﬁgured with the most recently installed controller.

• See what the hardware jumper and switch setting should be for each ISA controller.

• See what the software setting should be for each ISA and EISA controller.

• See if the ECU deactivated (<>) any controllers to prevent conﬂict.

display.

• See if any controllers are locked (!), which limits the ECU’s ability to change resource assignments.

3 Conﬁrm that the hardware jumpers and switches on ISA controllers reﬂect the

4 Run ROM-based diagnostics for the type of option:

5 Check for a bad slot by moving the last installed controller to a different slot. 6 Call the option manufacturer or support for help.

settings indicated by the ECU. Start with the last ISA module installed.

• Storage adapter—Run controller option (Section 3.3.1).

• Ethernet adapter—Run (Section 3.3.6, Section 3.3.7).

test

to exercise the storage devices off the EISA

netewornetwork

to exercise an Ethernet adapter

Power-Up Diagnostics and Display 2–27

2.7.1 Additional EISA Troubleshooting Tips

The following tips can aid in isolating EISA bus problems.

• Peripheral device controllers need to be seated (inserted) carefully, but ﬁrmly, into their slots to make all necessary contacts. Improper seating is a common source of problems for EISA modules.

• Be sure you run the correct version of the ECU for the operating system. For windows NT, use ECU diskette DECpc AXP (AK-PYCJ*-CA); for Digital UNIX and OpenVMS, use ECU diskette DECpc AXP (AK-Q2CR*-CA).

• The CFG ﬁles supplied with the option you want to install may not work on AlphaServer 1000A systems. Some CFG ﬁles call overlay ﬁles that are not required on this system or may reference inappropriate system resources, for example, BIOS addresses. Contact the option vendor to obtain the proper CFG ﬁle.

• Peripherals cannot share direct memory access (DMA) channels. Assignment of more than one peripheral to the same DMA channel can cause unpredictable results or even loss of function of the EISA module.

• Not all EISA products work together. EISA is an open standard, and not every EISA product or combination of products can be tested. Violations of speciﬁcations may matter in some conﬁgurations, but not in others.

Manufacturers of EISA options often test the most common combinations and may have a list of ISA and EISA options that do not function in combination with particular systems. Be sure to check the documentation or contact the option vendor for the most up-to-date information.

• EISA systems will not function unless they are ﬁrst conﬁgured using the ECU.

• The ECU will not notify you if the conﬁguration program diskette is writeprotected when it attempts to write the system conﬁguration ﬁle ( to the diskette.

2–28 Power-Up Diagnostics and Display

system.sci

)

2.8 PCI Bus Problems Indicated at Power-Up

PCI bus failures at power-up are usually indicated by the inability of the system to see the device. Table 2–8 provides steps for troubleshooting PCI bus problems. Use the table to diagnose the likely cause of the problem.

Note

Some PCI devices do not implement PCI parity, and some have a paritygenerating scheme in which parity is sometimes incorrect or is not compliant with the PCI Speciﬁcation. In such cases, the device functions properly as long as parity is not checked. The pci_parity environment variable for the SRM console, or the ENABLEPCIPARITY CHECKING environment variable for the ARC console, allow you to turn off parity checking so that false PCI parity errors do not result in machine check errors.

When you disable PCI parity, no parity checking is implemented for any PCI device, even those devices that produce correct, compliant parity.

Table 2–8 PCI Troubleshooting

Step Action

1 Conﬁrm that the PCI module and any cabling are properly seated. 2 Run ROM-based diagnostics for the type of option:

• Storage adapter—Run controller option (Section 3.3.1).

• Ethernet adapter—Run (Section 3.3.6, Section 3.3.7).

3 Check for a bad slot by moving the last installed controller to a different slot. 4 Call the option manufacturer or support for help.

test

to exercise the storage devices off the PCI

netewornetwork

to exercise an Ethernet adapter

2.8.1 Additional PCI Troubleshooting Tips

Some PCI options are restricted to the primary PCI bus, slots 11, 12, and 13. Refer to the following documents for restrictions on speciﬁc PCI options:

• AlphaServer 1000A READ THIS FIRST—shipped with the system.

• AlphaServer 1000A Supported Options List—The options list is available from the Internet at the following locations:

Power-Up Diagnostics and Display 2–29

ftp://ftp.digital.com/pub/DEC/Alpha/systems/ http://www.service.digital.com/alpha/server/

2.9 Fail-Safe Loader

The fail-safe loader (FSL) is a redundant or backup ROM that allows you to power up without running power-up diagnostics and load new SRM/ARC or SRM/AlphaBIOS and FSL console ﬁrmware from the ﬁrmware diskette.

Note

The fail-safe loader should be used only when a failure at power-up prohibits you from getting to the console program. You cannot boot an operating system from the fail-safe loader.

If a checksum error is detected when the SRM/ARC or SRM/AlphaBIOS console is loading at power-up (error beep code 1-1-4), you need to activate the fail-safe loader and reinstall the ﬁrmware.

The fail-safe loader (FSL) allows you to attempt to recover when one of the following is the cause of a problem getting to the console program under normal power-up:

• A hardware or power failure, or accidental power down during a ﬁrmware upgrade occurred.

• A conﬁguration error, such as an incorrect environment variable setting or an inappropriate nvram script.

• A driver error at power-up.

• A checksum error is detected when the SRM console is loading at power-up (corrupted ﬁrmware).

The fail-safe loader program is also available on diskette.

2.9.1 Fail-Safe Loader Functions

From the FSL program, you can update or load new SRM/ARC or SRM /AlphaBIOS console ﬁrmware and FSL console ﬁrmware.

Note

When installing new console ﬁrmware, the ﬂash ROM VPP enable jumper (J50) on the motherboard must be enabled.

2–30 Power-Up Diagnostics and Display

2.9.2 Activating the Fail-Safe Loader

To activate the FSL:

1. Install the jumper at bank 7 of the J1 or J4 jumper on the CPU daughter board. The jumper is normally installed in the standard boot setting (bank 1 for AlphaServer 1000A Model 4/xxx systems, bank 0 or 1 for Model 5/xxx systems) Refer to Figures 2–10 through 2–12.

2. Install the console ﬁrmware diskette and turn on the system. Two messages are displayed on the operator control panel (OCP) when the

FSL program loads the diskette:

OCP Message Meaning

Floppy Boot

Starting CPU

FSL ﬁrmware is executing.

FSL ﬁrmware found a valid boot block, loaded the program into memory, and is attempting to transfer control to the loaded program.

3. Reinstall the console ﬁrmware from a ﬁrmware diskette.

4. When you have ﬁnished, power down and return the J1 or J4 jumper to the standard boot setting (bank 1).

Power-Up Diagnostics and Display 2–31

Figure 2–10 Model 4/xxx: Jumper J1 on the CPU Daughter Board

MA00926

Bank Jumper Setting

0 1 2 3 4 5 6 7

2–32 Power-Up Diagnostics and Display

Figure 2–11 Model 5/xxx Systems: Jumper J4 on the CPU Daughter Board

0 1 2 3 4 5 6 7

MLO-013462

Bank Jumper Setting

5 Mini-console setting: Internal use only 6 Mini-console setting: Internal use only 7 Fail-Safe Loader setting: selects fail-safe loader ﬁrmware

despite bad Bcache until a replacement daughter board is available

Power-Up Diagnostics and Display 2–33

Figure 2–12 Model 5/xxx Systems: Jumper J1 on the CPU Daughter Board

0 1 2 3 4 5 6 7

MLO-013469

Bank Jumper Setting

5 Mini-console setting: Internal use only 6 Mini-console setting: Internal use only 7 Fail-Safe Loader setting: selects fail-safe loader ﬁrmware

despite bad Bcache until a replacement daughter board is available

2–34 Power-Up Diagnostics and Display

2.10 Power-Up Sequence

During the AlphaServer 1000A power-up sequence, the power supplies are stabilized and the system is initialized and tested through the ﬁrmware power-on self-tests.

The power-up sequence includes the following:

• Power supply power-up: – AC power-up – DC power-up

• Two sets of power-on diagnostics: – Serial ROM diagnostics – Console ﬁrmware-based diagnostics

Caution

The AlphaServer 1000A enclosure will not power up if the top cover is not securely attached. Removing the top cover will cause the system to shut down.

2.10.1 AC Power-Up Sequence

The following power-up sequence occurs when AC power is applied to the system (system is plugged in) or when electricity is restored after a power outage:

1. The front end of the power supply begins operation and energizes.

2. The power supply then waits for the DC power to be enabled.

Note

The top cover and side panels must be securely installed. A safety interlock prevents the system from being powered on with the cover and panels removed.

Power-Up Diagnostics and Display 2–35

2.10.2 DC Power-Up Sequence

DC power is applied to the system with the DC On/Off button on the operator control panel.

A summary of the DC power-up sequence follows:

1. When the DC On/Off button is pressed, the power supply checks for a POK_H condition.

2. 12V, 5V, 3.3V, and -12V outputs are energized and stabilized. If the outputs do not come into regulation, the power-up is aborted and the power supply enters the latching-shutdown mode.

2.11 Firmware Power-Up Diagnostics

After successful completion of AC and DC power-up sequences, the processor performs its power-up diagnostics. These tests verify system operation, load the system console, and test the core system (CPU, memory, and motherboard), including all boot path devices. These tests are performed as two distinct sets of diagnostics:

1. Serial ROM diagnostics—These tests are loaded from the serial ROM located on the CPU daughter board into the CPU’s instruction cache (I-cache). The tests check the basic functionality of the system and load the console code from the FEPROM on the motherboard into system memory.

Failures during these tests are indicated by audible error beep codes (Table 2–1), the console event log (Section 2.3.1), and for Model 5/xxx systems, OCP error codes (Section 2.2).

Failures of customized SROM tests for Model 4/xxx systems (Section 2.4), set using the J1 jumper on the CPU daughter board, are displayed on the operator control panel.

2. Console ﬁrmware-based diagnostics—These tests are executed by the console code. They test the core system, including all boot path devices.

Failures during these tests are reported to the console terminal through the power-up screen or console event log.

2–36 Power-Up Diagnostics and Display

2.11.1 Serial ROM Diagnostics

The serial ROM diagnostics are loaded into the CPU’s instruction cache from the serial ROM on the CPU daughter board. The diagnostics test the system in the following order:

1. Test the CPU and backup cache located on the CPU daughter board.

2. Test the CPU module’s system bus interface.

3. Test the system bus to PCI bus bridge and system bus to EISA bus bridge. If the PCI bridge fails or EISA bridge fails, an audible error beep code (3-3-1) sounds (Table 2–1). The power-up tests continue despite these errors.

4. Test the PCI-to-PCI bus bridge. If the bridge fails, an error beep code (3-3-2) sounds.

5. Test the native SCSI controller. If the controller fails, an error beep code (3-1-2) sounds.

6. Conﬁgure the memory in the system and test only the ﬁrst 16 MB of memory. If the memory test fails, the failing bank is mapped out and memory is

reconﬁgured and re-tested. Testing continues until good memory is found. If good memory is not found, an error beep code (1-3-3) is generated and the power-up tests are terminated.

7. Check the data path to the FEPROM on the motherboard.

8. The console program is loaded into memory from the FEPROM on the motherboard. A checksum test is executed for the console image. If the checksum test fails, an error beep code (1-1-4) is generated, and the power-up tests are terminated.

If the checksum test passes, control is passed to the console code, and the console ﬁrmware-based diagnostics are run.

2.11.2 Console Firmware-Based Diagnostics

Console ﬁrmware-based tests are executed once control is passed to the console code in memory. They check the system in the following order:

1. Perform a complete check of system memory. Steps 2–5 may be completed in parallel.

2. Start the I/O drivers for mass storage devices and tapes. At this time a complete functional check of the machine is made. After the I/O drivers are started, the console program continuously polls the bus for devices (approximately every 20 or 30 seconds).

Power-Up Diagnostics and Display 2–37

3. Check that EISA conﬁguration information is present in NVRAM for each EISA module detected and that no information is present for modules that have been removed.

4. Run exercisers on the drives currently seen by the system.

Note

This step does not ensure that all disks in the system will be tested or that any device drivers will be completely tested. Spin-up time varies for different drives, so not all disks may be on line at this point in the power-up sequence. To ensure complete testing of disk devices, use the

test

command (Section 3.3.1).

5. Enter console mode or boot the operating system. This action is determined by the auto_action environment variable.

If the os_type environment variable is set to NT, the ARC (Model 4/xxx systems) or AlphaBIOS (Model 5/xxx systems) console is loaded into memory, and control is passed to the ARC or AlphaBIOS console.

2–38 Power-Up Diagnostics and Display

Running System Diagnostics

This chapter provides information on how to run system diagnostics.

• Section 3.1 describes how to run ROM-based diagnostics, including error reporting utilities and loopback tests.

• Section 3.4 describes acceptance testing and initialization procedures.

• Section 3.5 describes the DEC VET operating system exerciser.

3.1 Running ROM-Based Diagnostics

ROM-based diagnostics (RBDs), which are part of the console ﬁrmware that is loaded from the FEPROM on the system motherboard, offer many powerful diagnostic utilities, including the ability to examine error logs from the console environment and run system- or device-speciﬁc exercisers.

AlphaServer 1000A RBDs rely on exerciser modules, rather than functional tests, to isolate errors. The exercisers are designed to run concurrently, providing a maximum bus interaction between the console drivers and the target devices.

The multitasking ability of the console ﬁrmware allows you to run diagnostics in the background (using the background operator ‘‘&’’ at the end of the command). You run RBDs by using console commands.

Note

ROM-based diagnostics, including the SRM console (ﬁrmware used by OpenVMS and Digital UNIX operating systems). If you are running a Windows NT system, refer to Section 5.1.2 for the steps used to switch between consoles.

RBDs report errors to the console terminal and/or the console event log.

test

command, are run from the

Running System Diagnostics 3–1

3.2 Command Summary

Table 3–1 provides a summary of the diagnostic and related commands.

Table 3–1 Summary of Diagnostic and Related Commands

Command Function Reference Acceptance Testing

test Quickly tests the core system. The

Error Reporting

cat el Displays the console event log. Section 3.3.3 more el Displays the console event log one screen at a time. Section 3.3.3

Extended Testing/Troubleshooting

memexer Exercises memory by running a speciﬁed number of

memory Runs memory exercises each time the command is

net -ic Initializes the MOP counters for the speciﬁed

net -s Displays the MOP counters for the speciﬁed

netew Runs external MOP loopback tests for speciﬁed

sys_exer Exercises core system for Model 5/xxx systems. Runs

is the primary diagnostic for acceptance testing and console environment diagnosis. For Model 4 /xxx systems, the tests are run concurrently and indeﬁnitely.

For Model 5/xxx systems, the one pass of the tests. To run tests concurrently and indeﬁnitely on Model 5/xxx systems, use the

sys_exer

memory tests on Model 5/xxx systems. The tests are run in the background.

entered. These exercises run concurrently in the background.

Ethernet port.

EISA- or PCI-based ew* (DECchip 21040, TULIP) Ethernet ports.

tests concurrently.

command.

test

command runs

command

Section 3.3.1

Section 3.3.4

Section 3.3.5

Section 3.3.9

Section 3.3.8

Section 3.3.6

Section 3.3.2

3–2 Running System Diagnostics

(continued on next page)

Table 3–1 (Cont.) Summary of Diagnostic and Related Commands

Command Function Reference Loopback Testing

netew Runs external MOP loopback tests for speciﬁed

EISA- or PCI-based ew* (DECchip 21040, TULIP) Ethernet ports.

sys_exer -lb Conducts loopback tests for COM2 and the parallel

port in addition to core system tests for Model 5/xxx systems.

test -lb Conducts loopback tests for COM2 and the parallel

port in addition to quick core system tests.

Diagnostic-Related Commands

kill Terminates a speciﬁed process. Section 3.3.10 kill_diags Terminates all currently executing diagnostics. Section 3.3.10 show_status Reports the status of currently executing test

/exercisers.

Section 3.3.6

Section 3.3.2

Section 3.3.1

Section 3.3.11

3.3 Command Reference

This section provides detailed information on the diagnostic commands and related commands.

Running System Diagnostics 3–3

3.3.1 test

The

test

command runs ﬁrmware diagnostics for the entire core system. The tests are run concurrently in the background. Fatal errors are reported to the console terminal.

The

cat el

examine test/error information reported to the console event log. For Model 4/xxx systems, the tests are run concurrently and indeﬁnitely (until

you stop them with the out intermittent hardware problems.

command should be used in conjunction with the

kill_diags

command). These test are useful in ﬂushing

test

command to

For Model 5/xxx systems, the tests concurrently and indeﬁnitely on Model 5/xxx systems, use the command.

By default, no write tests are performed on disk and tape drives. Media must be installed to test the ﬂoppy drive and tape drives. A loopback connector is required for the COM2 (9-pin loopback connector, 12-27351-

01) port. The test command does not test the DNSES, TGA card, reﬂective memory

option, nor third party options. When using the

you must initialize the system to a quiescent state. Enter the following commands at the SRM console:

>>> set auto_action halt >>> init ... >>> test

After testing is completed, set the auto_action environment variable to its previous value (usually, boot) and use the Reset button to reset the system.

test

command runs one pass of the tests. To run

sys_exer

Note

command after shutting down an operating system,

To terminate the tests, use the diagnostic or the

show_status

diagnostic test.

3–4 Running System Diagnostics

kill_diags

display to determine the process ID when terminating an individual

kill

command to terminate an individual

command to terminate all diagnostics. Use the

Note

A serial loopback connector (12-27351-01) must be installed on the COM2 serial port for the

kill_diags

command to successfully terminate system

tests.

The

test

script tests devices in the following order:

1. Console loopback tests if lb argument is speciﬁed: COM2 serial port and

parallel port.

2. Network external loopback tests for E*A0. This test requires that the

Ethernet port be terminated or connected to a live network; otherwise, the test will fail.

3. Memory tests (one pass).

4. Read-only tests: DK* disks, DR* disks, DU* disks, MK* tapes, DV* ﬂoppy.

5. VGA console tests. These tests are run only if the console environment

variable is set to ‘‘serial.’’ The VGA console test displays rows of the letter ‘‘digital’’.

Synopsis:

test [lb]

Argument:

[lb] The loopback option includes console loopback tests for the COM2 serial

port and the parallel port during the test sequence.

Examples:

In the following example, a Model 4/xxx system is tested and the tests complete successfully.

Note

Examine the console event log after running tests.

Running System Diagnostics 3–5

>>> test Requires diskette and loopback connectors on COM2 and parallel port type kill_diags to halt testing type show_status to display testing progress type cat el to redisplay recent errors Testing COM2 port Setting up network test, this will take about 20 seconds Testing the network

48 Meg of System Memory Bank 0 = 16 Mbytes(4 MB Per Simm) Starting at 0x00000000 Bank 1 = 16 Mbytes(4 MB Per Simm) Starting at 0x01000000 Bank 2 = 16 Mbytes(4 MB Per Simm) Starting at 0x02000000 Bank 3 = No Memory Detected

Testing the memory Testing parallel port Testing the SCSI Disks Non-destructive Test of the Floppy started dka400.4.0.6.0 has no media present or is disabled via the RUN/STOP switch file open failed for dka400.4.0.6.0 Testing the VGA(Alphanumeric Mode only) Printer offline file open failed for para

>>> show_status

ID Program Device Pass Hard/Soft Bytes Written Bytes Read

-------- ------------ ------------ ------ --------- ------------- ------------00000001 idle system 0 0 0 0 0 0000002d exer_kid tta1 0 0 0 1 0 0000003d nettest era0.0.0.2.1 43 0 0 1376 1376 00000045 memtest memory 7 0 0 424673280 424673280 00000052 exer_kid dka100.1.0.6 0 0 0 0 2688512 00000053 exer_kid dka200.2.0.6 0 0 0 0 922624 >>> kill_diags >>>

In the following example, the system is tested and the system reports a fatal error message. No network server responded to a loopback message. Ethernet connectivity on this system should be checked.

*** Error (era0), Mop loop message timed out from: 08-00-2b-3b-42-fd *** List index: 7 received count: 0 expected count 2

3–6 Running System Diagnostics

>>>

In the following example, a Model 5/xxx system is tested and tests terminate after successfully completing one pass of the diagnostics.

Note

Examine the console event log after running tests.

>>> test Testing the Memory Testing the DK* Disks(read only) No DU* Disks available for testing No DR* Disks available for testing No MK* Tapes available for testing No MU* Tapes available for testing Testing the DV* Floppy Disks(read only) Testing the VGA (Alphanumeric Mode only) Testing the EWA0 Network Testing the EWB0 Network >>>

Running System Diagnostics 3–7

3.3.2 sys_exer

The

sys_exer

model 5/xxx systems. The same tests that are run using the run with Nothing is displayed unless an error occurs.

command runs ﬁrmware diagnostics for the entire core system for

sys_exer

, only these tests are run concurrently and in the background.

Note

test

command are

The diagnostics started by the resources. The booting and operating system.

Because the sys_exer tests are run concurrently and indeﬁnitely (until you stop them with the hardware problems.

By default, no write tests are performed on disk and tape drives. Media must be installed to test the ﬂoppy drive and tape drives.

Media must be installed to test the ﬂoppy drive and tape drives. Certain memory errors that are reported by the OCP may not be reported

by the ROM-based diagnostics. Always check the power-up/diagnostic display before running diagnostic commands.

Synopsis:

sys_exer [lb]

Arguments:

[lb] The loopback option includes console loopback tests for the COM2 serial

init

command must be used to reconﬁgure memory before

init

command), they are useful in ﬂushing out intermittent

port and the parallel port during the test sequence.

sys_exer

Note

command require extra memory

3–8 Running System Diagnostics

Examples:

>>> sys_exer Default zone extended at the expense of memzone. Use INIT before booting Exercising the Memory Exercising the DK* Disks(read only) Exercising the Floppy(read only) Testing the VGA (Alphanumeric Mode only) Exercising the EWA0 Network Exercising the EWB0 Network

Type "init" in order to boot the operating system Type "show_status" to display testing progress Type "cat el" to redisplay recent errors

>>> show_status

ID Program Device Pass Hard/Soft Bytes Written Bytes Read

-------- ------------ ------------ ------ --------- ------------- ------------00000001 idle system 0 0 0 0 0 0000550b memtest memory 193 0 0 7243563008 7243563008 00005514 memtest memory 192 0 0 7222591488 7222591488 0000551d exer_kid dka100.1.0.2 0 0 0 0 2461184 0000551e exer_kid dka400.4.0.2 0 0 0 0 2460672 00005533 exer_kid dva0.0.0.100 0 0 0 0 2311168 00005608 nettest ewa0.0.0.200 1131 0 1 12160512 12159632 00005746 nettest ewb0.0.0.13. 1127 0 2 12116624 12115280 >>> init ff.fe.fd.fc.fb.fa.f9.f8.f7.f6.f5.ef.df.ee.f4. . . . >>>

Running System Diagnostics 3–9

3.3.3 cat el and more el

The

cat el

event log. Status and error messages (if problems occur) are logged to the console event log at power-up, during normal system operation, and while running system tests.

Standard error messages are indicated by asterisks (***).

and

more el

commands display the current contents of the console

When the

The

cat el

Ctrl/S

more el

is used, the contents of the console event log scroll by. You can use

combination to stop the screen from scrolling,

command allows you to view the console event log one screen at a

Ctrl/Q

to resume scrolling.

time.

Synopsis:

cat el or more el

Examples:

The following examples show abbreviated console event logs that contains a standard error message:

The error message indicates the keyboard is not plugged in or is not working.

>>> cat el *** keyboard not plugged in... ff.fe.fd.fc.fb.fa.f9.f8.f7.f6.f5. ef.df.ee.f4.ed.ec.eb.ea.e9.e8.e7.e6.port pka0.7.0.6.0 initialized, scripts are at 4f7faa0 resetting the SCSI bus on pka0.7.0.6.0 port pkb0.7.0.12.0 initialized, scripts are at 4f82be0 resetting the SCSI bus on pkb0.7.0.12.0 e5.e4.e3.e2.e1.e0. V1.1-1, built on Nov 4 1994 at 16:44:07 device dka400.4.0.6.0 (RRD43) found on pka0.4.0.6.0 >>>

3–10 Running System Diagnostics

3.3.4 memexer

The

memexer

exercisers for Model 5/xxx systems. The exercisers are run in the background and nothing is displayed unless an error occurs. Each exerciser tests all available memory in twice the backup cache size blocks for each pass.

command tests memory by running a speciﬁed number of memory

To terminate the memory tests, use the diagnostic or the

show_status

diagnostic test.

Synopsis:

memexer [number]

Arguments:

[number] Number of memory exercisers to start. The default is 1.

kill_diags

display to determine the process ID when terminating an individual

The number of exercisers, as well as the length of time for testing, depends on the context of the testing. Generally, running three to ﬁve exercisers for 15 minutes to 1 hour is sufﬁcient for troubleshooting most memory problems.

command to terminate all diagnostics. Use the

kill

command to terminate an individual

Running System Diagnostics 3–11

Examples:

Example with no errors.

>>> memexer 4 >>> show_status

ID Program Device Pass Hard/Soft Bytes Written Bytes Read

-------- ------------ ------------ ------ --------- ------------- ------------00000001 idle system 0 0 0 0 0 000000c7 memtest memory 3 0 0 635651584 62565154 000000cc memtest memory 2 0 0 635651584 62565154 000000d0 memtest memory 2 0 0 635651584 62565154 000000d1 memtest memory 3 0 0 635651584 62565154 >>> kill_diags >>>

The following is an example with a memory compare error indicating bad SIMMs. In most cases, the failing bank and SIMM position (Figures 3–1 and 3–2) are speciﬁed in the error message. If the failing SIMM information is not provided, use the procedure in Section 3.3.5 to isolate a failing SIMM.

>>> memexer 3

*** Hard Error - Error #41 - Memory compare error

Diagnostic Name ID Device Pass Test Hard/Soft 11-JUN-1996 memtest 00000193 brd0 114 1 1 0 12:00:01 Expected value: 25c07 Received value: 35c07 Failing addr: a11848

*** End of Error *** >>> kill_diags

>>>

3–12 Running System Diagnostics

3.3.5 memory

The

memory

command is entered. The exercisers are run in the background and nothing is displayed unless an error occurs.

command tests memory by running a memory exerciser each time the

To terminate the memory tests, use the diagnostic or the

show_status

kill_diags

command to terminate all diagnostics. Use the

display to determine the process ID when terminating an individual

kill

command to terminate an individual

diagnostic test.

Synopsis:

memory

Examples:

The following is an example with no errors.

>>> memory >>> memory >>> memory Testing the memory >>> show_status

ID Program Device Pass Hard/Soft Bytes Written Bytes Read

-------- ------------ ------------ ------ --------- ------------- ------------00000001 idle system 0 0 0 0 0 0000006b memtest memory 1 0 0 53477376 53477376 00000071 memtest memory 1 0 0 31457280 31457280 00000077 memtest memory 1 0 0 24117248 24117248 >>> kill_diags >>>

Running System Diagnostics 3–13

The following is an example with a memory compare error indicating bad SIMMs. In most cases, the failing bank and SIMM position (Figures 3–1 and 3–2 are speciﬁed in the error message. If the failing SIMM information is not provided, use the procedure following the example to isolate a failing SIMM.

>>> memory >>> memory >>> memory

*** Hard Error - Error #41 - Memory compare error

Diagnostic Name ID Device Pass Test Hard/Soft 11-JUN-1996 memtest 00000193 brd0 114 1 1 0 12:00:01 Expected value: 25c07 Received value: 35c07 Failing addr: a11848

*** End of Error *** >>> kill_diags

>>>

To ﬁnd the failing bank, compare the failing address (a11848 in this example) with the

show memory

display or memory portion of the

show config

command

display:

1. Banks with no memory present are eliminated as possible failing banks.

2. If the failing address is greater than the bank starting address, but less than

the starting address for the next bank, then the failing SIMM is within this bank. Bank 0 in the example using failing address a11848 and the following memory display.

>>> show memory Memory

32 Meg of System Memory Bank 0 = 16 Mbytes (4MB per SIMM) Starting at 0x00000000 Bank 1 = 16 Mbytes (4MB per SIMM) Starting at 0x01000000 Bank 2 = No Memory Detected Bank 3 = No Memory Detected

>>>

3–14 Running System Diagnostics

To determine the failing SIMM:

• Model 4/xxx Systems:

Match the least signiﬁcant nibble of the failing address to the failing SIMM using the table below.

Failing Address Least Signiﬁcant Nibble Failing SIMM

00 41 82 C3

In the example, a11848, the 8 would indicate SIMM 2 as the failing SIMM.

• Model 5/xxx Systems:

Match the least signiﬁcant nibble of the failing address and the bit range in which the bad data is received to the failing SIMM using the table below.

Failing Address Least Signiﬁcant Nibble

0 or 8 bits 15:0 0 0 or 8 bits 31:16 1 4 or C bits 15:0 2 4 or C bits 31:16 3

Data Miscompare in Bit Range Failing SIMM

In the example the least signiﬁcant nibble is the failing address is 8 (a11848). The expected data value was 25c07, the received value was 35c07.

The data miscompare occurred in bits 16–19 or within bits 31:16, therefore the failing SIMM would be SIMM 1.

Model 4/xxx systems have SROM power-up tests for memory that can report a failing bank and SIMM. This series of tests is set using the J1 jumper on the CPU daughter board (Section 2.4).

Running System Diagnostics 3–15

Figure 3–1 Model 4/xxx: AlphaServer 1000A Memory Layout

Bank 3

Bank 2

Bank 1

Bank 0

ECC Banks

SIMM 1 SIMM 0 SIMM 1 SIMM 0 SIMM 1 SIMM 0 SIMM 1

SIMM 0 ECC SIMM for Bank 2 ECC SIMM for Bank 0

SIMM 3 SIMM 2 SIMM 3 SIMM 2 SIMM 3 SIMM 2 SIMM 3

SIMM 2 ECC SIMM for Bank 3 ECC SIMM for Bank 1

Figure 3–2 Model 5/xxx: AlphaServer 1000A Memory Layout

Bank 3

Bank 2

Bank 1

Bank 0

SIMM 1 SIMM 0 SIMM 1 SIMM 0 SIMM 1 SIMM 0 SIMM 1 SIMM 0

SIMM 3

SIMM 2

SIMM 3

SIMM 2

SIMM 3

SIMM 2

SIMM 3

SIMM 2

MA00327

Unused

3–16 Running System Diagnostics

MLO-013455

3.3.6 netew

The

netew

based ew* (DECchip 21040, TULIP) Ethernet ports. The command can also be used to test a port on a ‘‘live’’ network.

The loopback tests are set to run continuously (-p pass_count set to 0). Use the

kill

command to terminate all diagnostics. Use the the process ID when terminating an individual diagnostic test.

While some results of network tests are reported directly to the console, you should examine the console event log (using the commands) for complete test results.

Synopsis:

netew

command is used to run MOP loopback tests for any EISA- or PCI-

command (or

Ctrl/C

) to terminate an individual diagnostic or the

show_status

Note

display to determine

cat elormore el

kill_diags

When the

net -sa ew*0>ndbr/lp_nodes_ew*0 set ew*0_loop_count 2 2>nl set ew*0_loop_inc 1 2>nl set ew*0_loop_patt ffffffff 2>nl set ew*0_loop_size 10 2>nl set ew*0_lp_msg_node 1 2>nl net -cm ex ew*0 echo "Testing the network" nettest ew*0 -sv 3 -mode nc -p 0 -w 1 &

The script builds a list of nodes for which to send MOP loopback packets, sets certain test environment variables, and tests the Ethernet port by using the following variation of the nettest exerciser:

netew ew*0 -sv 3 -mode nc -p 0 -w 1 &

netew

command is entered, the following script is executed:

Running System Diagnostics 3–17

Testing an Ethernet Port:

>>> netew >>> show_status

ID Program Device Pass Hard/Soft Bytes Written Bytes Read

-------- ------------ ------------ ------ --------- ------------- ------------00000001 idle system 0 0 0 0 0 000000d5 nettest ewa0.0.0.0.0 13 0 0 308672 308672 >>> kill_diags >>>

3–18 Running System Diagnostics

3.3.7 network

The

network

based er* (DEC 4220, LANCE) Ethernet ports. The command can also be used to test a port on a ‘‘live’’ network.

The loopback tests are set to run continuously (-p pass_count set to 0). Use the

kill

command (or command to terminate all diagnostics. Use the the process ID when terminating an individual diagnostic test.

While some results of network tests are reported directly to the console, you should examine the console event log (using the commands) for complete test results.

Synopsis:

network

command is used to run MOP loopback tests for any EISA- or PCI-

Ctrl/C

) to terminate an individual diagnostic or the

show_status

Note

display to determine

kill_diags

cat elormore el

When the

echo "setting up the network test, this will take about 20 seconds" net -stop er*0 net -sa er*0>ndbr/lp_nodes_er*0 net ic er*0 set er*0_loop_count 2 2>nl set er*0_loop_inc 1 2>nl set er*0_loop_patt ffffffff 2>nl set er*0_loop_size 10 2>nl set er*0_lp_msg_node 1 2>nl set er*0_mode 44 2>nl net -start er*0 echo "Testing the network" nettest er*0 -sv 3 -mode nc -p 0 -w 1 &

network er*0 -sv 3 -mode nc -p 0 -w 1 &

network

command is entered, the following script is executed:

Running System Diagnostics 3–19

Testing an Ethernet Port:

>>> network >>> show_status

ID Program Device Pass Hard/Soft Bytes Written Bytes Read

-------- ------------ ------------ ------ --------- ------------- ------------00000001 idle system 0 0 0 0 0 000000d5 nettest era0.0.0.0.0 13 0 0 308672 308672 >>> kill_diags >>>

3–20 Running System Diagnostics

3.3.8 net -s

The

net -s

Synopsis:

net -s ewa0

Example:

>>> net -s ewa0 Status counts:

ti: 72 tps: 0 tu: 47 tjt: 0 unf: 0 ri: 70 ru: 0 rps: 0 rwt: 0 at: 0 fd: 0 lnf: 0 se: 0 tbf: 0 tto: 1 lkf: 1 ato: 1 nc: 71 oc: 0

MOP BLOCK:

Network list size: 0

MOP COUNTERS: Time since zeroed (Secs): 42

TX:

Bytes: 0 Frames: 0 Deferred: 1 One collision: 0 Multi collisions: 0

TX Failures:

Excessive collisions: 0 Carrier check: 0 Short circuit: 71 Open circuit: 0 Long frame: 0 Remote defer: 0 Collision detect: 71

RX:

Bytes: 49972 Frames: 70 Multicast bytes: 0 Multicast frames: 0

RX Failures:

Block check: 0 Framing error: 0 Long frame: 0 Unknown destination: 0 Data overrun: 0 No system buffer: 0 No user buffers: 0

>>>

command displays the MOP counters for the speciﬁed Ethernet port.

Running System Diagnostics 3–21

3.3.9 net -ic

The

net -ic

port.

Synopsis:

net -ic ewa0

Example:

>>> net -ic ewa0 >>> net -s ewa0 Status counts: ti: 72 tps: 0 tu: 47 tjt: 0 unf: 0 ri: 70 ru: 0 rps: 0 rwt: 0 at: 0 fd: 0 lnf: 0 se: 0 tbf: 0 tto: 1 lkf: 1 ato: 1 nc: 71 oc: 0

MOP BLOCK:

Network list size: 0

MOP COUNTERS: Time since zeroed (Secs): 3

TX:

Bytes: 0 Frames: 0 Deferred: 0 One collision: 0 Multi collisions: 0

TX Failures:

Excessive collisions: 0 Carrier check: 0 Short circuit: 0 Open circuit: 0 Long frame: 0 Remote defer: 0 Collision detect: 0

RX:

Bytes: 0 Frames: 0 Multicast bytes: 0 Multicast frames: 0

RX Failures:

Block check: 0 Framing error: 0 Long frame: 0 Unknown destination: 0 Data overrun: 0 No system buffer: 0 No user buffers: 0

>>>

command initializes the MOP counters for the speciﬁed Ethernet

3–22 Running System Diagnostics

3.3.10 kill and kill_diags

The

kill

and

kill_diags

executing .

A serial loopback connector (12-27351-01) must be installed on the COM2 serial port for the tests.

commands terminate diagnostics that are currently

Note

kill_diags

command to successfully terminate system

• The

Synopsis:

kill_diags kill [PID . . . ]

Argument:

[PID . . . ] The process ID of the diagnostic to terminate. Use the

kill

command terminates a speciﬁed process.

kill_diags

command terminates all diagnostics.

command to determine the process ID.

show_status

Running System Diagnostics 3–23

3.3.11 show_status

The

show_status

diagnostic. The information includes ID, diagnostic program, device under test, error counts, passes completed, bytes written, and bytes read.

Many of the diagnostics run in the background and provide information only if an error occurs. Use the diagnostics.

The following command string is useful for periodically displaying diagnostic status information for diagnostics running in the background:

>>> while true;show_status;sleep n;done

command reports one line of information per executing

show_status

command to display the progress of

Where n is the number of seconds between

show_status

displays.

Synopsis:

show_status

Example:

>>> show_status

!" #$%& '

>>>show_status

ID Program Device Pass Hard/Soft Bytes Written Bytes Read

Process ID

Program module name

Device under test

Diagnostic pass count

Error count (hard and soft): Soft errors are not usually fatal; hard errors halt the system or prevent completion of the diagnostics.

Bytes successfully written by diagnostic

Bytes successfully read by diagnostic

3–24 Running System Diagnostics

3.4 Acceptance Testing and Initialization

Perform the acceptance testing procedure listed below after installing a system or whenever adding or replacing the following:

Memory modules Motherboard CPU daughter board Storage devices EISA or PCI options

1. Run the RBD acceptance tests using the

2. If you have added or moved, an EISA option or some ISA options, run the

EISA Conﬁguration Utility (ECU).

3. Bring up the operating system.

4. Run DEC VET to test that the operating system is correctly installed. Refer

to Section 3.5 for information on DEC VET.

testorsys_exer

command.

3.5 DEC VET

Digital’s DEC Veriﬁer and Exerciser Tool (DEC VET) software is a multipurpose system maintenance tool that performs exerciser-oriented maintenance testing. DEC VET runs on Digital UNIX, OpenVMS, and Windows NT operating systems. DEC VET consists of a manager and exercisers. The DEC VET manager controls the exercisers. The exercisers test system hardware and the operating system.

DEC VET supports various exerciser conﬁgurations, ranging from a single device exerciser to full system loading, that is, simultaneous exercising of multiple devices.

Refer to the DEC Veriﬁer and Exerciser Tool User’s Guide (AA–PTTMD–TE) for instructions on running DEC VET.

Running System Diagnostics 3–25

Error Log Analysis

This chapter provides information on how to interpret error logs reported by the operating system.

• Section 4.1 describes machine check/interrupts and how these errors are

detected and reported.

• Section 4.2 describes the entry format used by the error formatters.

• Section 4.3 describes how to generate a formatted error log using the

DECevent Translation and Reporting Utility available with OpenVMS and Digital UNIX.

4.1 Fault Detection and Reporting

Table 4–1 provides a summary of the fault detection and correction components of AlphaServer 1000A systems.

Generally, PALcode handles exceptions as follows:

• The PALcode determines the cause of the exception.

• If possible, it corrects the problem and passes control to the operating system

for reporting before returning the system to normal operation.

• If error/event logging is required, control is passed through the system control

block (SCB) to the appropriate exception handler.

Error Log Analysis 4–1

Table 4–1 AlphaServer 1000 Fault Detection and Correction

Component Fault Detection/Correction Capability KN22A Processor Module

DECchip 21064, 21064A, and 21164 microprocessors

Backup cache (B-cache) EDC check bits on the data store, and parity on the tag

Memory Subsystem

Memory SIMMs EDC logic protects data by detecting and correcting data

System Motherboard

SCSI Controller SCSI data parity is generated. EISA-to-PCI bridge chip PCI data parity is generated. PCI-to-PCI bridge chip PCI data parity is generated.

Contains error detection and correction (EDC) logic for data cycles. There are check bits associated with all data entering and exiting the 21064(A) microprocessor. A singlebit error on any of the four longwords being read can be corrected (per cycle). A double-bit error on any of the four longwords being read can be detected (per cycle).

address store and tag control store.

cycle errors. A single-bit error on any of the four longwords can be corrected (per cycle). A double-bit error on any of the four longwords being read can be detected (per cycle).

4.1.1 Machine Check/Interrupts

The exceptions that result from hardware system errors are called machine check/interrupts. They occur when a system error is detected during the processing of a data request. There are four types of machine check/interrupts related to system events:

1. Processor machine check (SCB 670)

2. System machine check (SCB 660)

3. Processor-corrected machine check (SCB 630)

4. System-corrected machine check (SCB 620) During the error handling process, errors are ﬁrst handled by the appropriate

PALcode error routine and then by the associated operating system error handler. The causes of each of the machine check/interrupts are as follows. The system control block (SCB) vector through which PALcode transfers control to the operating system is shown in parentheses.

4–2 Error Log Analysis

Processor Machine Check (SCB: 670)

Processor machine check errors are fatal system errors that result in a system crash. The error handling code for these errors is common across all platforms using the DECchip 21064, 21064A, and 21164 microprocessors.

• The DECchip 21064, 21064A, or 21164 microprocessor detected one or more of

the following uncorrectable data errors: – Uncorrectable B-cache data error – Uncorrectable memory data error

• A B-cache tag or tag control parity error occurred

• Hard error was asserted in response to:

– Double-bit Istream ECC error – Double-bit Dstream ECC error – System transaction terminated with CACK_HERR – I-cache parity errors – D-cache parity errors

System Machine Check (SCB: 660)

A system machine check is a system-detected error, external to the DECchip 21064, 21064A, or 21164 microprocessor and possibly not related to the activities of the CPU. These errors are speciﬁc to AlphaServer 1000A systems.

Fatal errors:

• System overtemperature failure

• System complete power supply failure

The power supply number is called out in the register: power supply 1 is the bottom supply; power supply 2 is the top supply.

• System fan failure

• I/O read/write retry timeout

• DMA data parity error

• I/O data parity error

• Slave abort PCI transaction

• DEVSEL not asserted

• Uncorrectable read error

Error Log Analysis 4–3

• Invalid page table lookup (scatter gather)

• Memory cycle error

• B-cache tag address parity error

• B-cache tag control parity error

• Non-existent memory error

• ESC NMI: IOCHK

Processor-Corrected Machine Check (SCB: 630)

Processor-corrected machine checks are caused by B-cache errors that are detected and corrected by the DECchip 21064, 21064A, or 21164 microprocessor. These are nonfatal errors that result in an error log entry. The error handling code for these errors is common across all platforms using the DECchip 21064, 21064A, and 21164 microprocessors.

• Single-bit Istream ECC error

• Single-bit Dstream ECC error

• System transaction terminated with CACK_SERR

System Machine Check (SCB: 620)

These errors (non-fatal) are AlphaServer 1000A-speciﬁc correctable errors. These errors result in the generation of the correctable machine check logout frame:

• Correctable read errors

• Single power supply failure when operating with redundant power supplies.

• System overtemperature warning

4.2 Error Logging and Event Log Entry Format

The Digital UNIX and OpenVMS error handlers can generate several entry types. All error entries, with the exception of correctable memory errors, are logged immediately. Entries can be of variable length based on the number of registers within the entry.

Each entry consists of an operating system header, several device frames, and an end frame. Most entries have a PAL-generated logout frame, and may contain frames for CPU, memory, and I/O.

4–4 Error Log Analysis

4.3 Event Record Translation

Systems running Digital UNIX and OpenVMS operating systems use the DECevent management utility to translate events into ASCII reports derived from system event entries (bit-to-text translations).

The DECevent utility has the following features relating to the translation of events:

• Translating event log entries into readable reports

• Selecting input and output sources

• Filtering input events

• Selecting alternate reports

• Translating events as they occur

• Maintaining and customizing the user environment with the interactive shell

commands

Note

Microsoft Windows NT does not currently provide bit-to-text translation of system errors.

• Section 4.3.1 summarizes the command used to translate the error log

information for the OpenVMS operating system using DECevent.

• Section 4.3.2 summarizes the command used to translate the error log

information for the Digital UNIX operating system using DECevent.

4.3.1 OpenVMS Alpha Translation Using DECevent

The kernel error log entries are translated from binary to ASCII using the DIAGNOSE command. To invoke the DECevent utility, enter the DCL command DIAGNOSE.

Format: DIAGNOSE/TRANSLATE [qualiﬁer][,...][inﬁle[, . . . ]] Example:

$ DIAGNOSE/TRANSLATE/SINCE=14-JUN-1995

For more information on generating error log reports using DECevent, refer to DECevent Translation and Reporting Utility for OpenVMS Alpha, User and Reference Guide, AA-Q73KC-TE.

Error Log Analysis 4–5

System faults can be isolated by examining translated system error logs or using the DECevent Analysis and Notiﬁcation Utility. Refer to the DECevent Analysis and Notiﬁcation Utility for OpenVMS Alpha, User and Reference Guide, AA-Q73LC-TE, for more information.

4.3.2 Digital UNIX Translation Using DECevent

The kernel error log entries are translated from binary to ASCII using the command. To invoke the DECevent utility, enter

Format: dia [-a -f inﬁle[...]] Example:

% dia -t s:14-jun-1995:10:00

For more information on generating error log reports using DECevent, refer to

DECevent Translation and Reporting Utility for Digital UNIX, User and Reference Guide, AA-QAA3-TE.

System faults can be isolated by examining translated system error logs or using the DECevent Analysis and Notiﬁcation Utility. Refer to the DECevent Analysis and Notiﬁcation Utility for Digital UNIX, User and Reference Guide, AA-QAA4A-TE, for more information.

dia

command.

dia

4–6 Error Log Analysis

System Conﬁguration and Setup

This chapter provides conﬁguration and setup information for AlphaServer 1000A systems and system options.

• Section 5.1 describes how to examine the system conﬁguration using the

console ﬁrmware. – Section 5.1.1 describes the function of the two ﬁrmware interfaces used

with AlphaServer 1000A systems. – Section 5.1.2 describes how to switch between ﬁrmware interfaces. – Sections 5.1.3 and 5.1.4 describe the commands used to examine system

conﬁguration for each ﬁrmware interface.

• Section 5.2 describes the system bus conﬁguration.

• Section 5.3 describes the motherboard.

• Section 5.4 describes the EISA bus.

• Section 5.5 describes how ISA options are compatible on the EISA bus.

• Section 5.6 describes the EISA conﬁguration utility (ECU).

• Section 5.7 describes the PCI bus.

• Section 5.8 describes SCSI buses and conﬁgurations.

• Section 5.9 describes power supply conﬁgurations.

• Section 5.10 describes the console port conﬁgurations.

System Conﬁguration and Setup 5–1

5.1 Verifying System Conﬁguration

Figures 5–1 and 5–2 illustrate the system architecture for AlphaServer 1000A systems.

Figure 5–1 System Architecture: AlphaServer 1000A Model 4/xxx Systems

Secondary

PCI Bus

Comanche

21064

Bcache

2MB

Memory

(16MB-1GB)

Decade

SROM

CPU Card

Epic

Primary PCI Bus

PCI-PCI

Bridge

PCI Slots

EISA Slots

PCI-EISA

Bridge

EISA Bus

QLOGIC

ISP1020A

TOY

Flash ROM

(1MB)

Buffers

SVGA Cirrus

5428

87332

X-Bus

Fast-Wide SCSI Bus

OCP

EISA

Config

RAM

8242

Keybd &

Mouse

Keyboard

Mouse

Serial Ports Floppy Port Parallel Port

MA00946

5–2 System Conﬁguration and Setup

Figure 5–2 System Architecture: AlphaServer 1000A Model 5/xxx Systems

Secondary

PCI Bus

21164

Bcache

2MB

Memory

(16MB-1GB)

DSW

SROM

CPU Card

CIA

Primary

PCI Bus

PCI-PCI

Bridge

PCI Slots

EISA Slots

PCI-EISA

Bridge

EISA Bus

QLOGIC

ISP1020A

TOY

Flash ROM

(1MB)

Buffers

SVGA Cirrus

5428

87332

X-Bus

Fast-Wide SCSI Bus

OCP

EISA

Config

RAM

8242

Keybd &

Mouse

Keyboard Mouse

Serial Ports Floppy Port Parallel Port

MLO-013494

5.1.1 System Firmware

The system ﬁrmware currently provides support for the following operating systems:

• Digital UNIX and OpenVMS Alpha are supported under the SRM command line interface, which can be serial or graphical. The SRM ﬁrmware is in compliance with the Alpha System Reference Manual (SRM).

• For Model 4/xxx systems, Windows NT is supported under the ARC menu interface, which is graphical. The ARC ﬁrmware is in compliance with the Advanced RISC Computing Standard Speciﬁcation (ARC).

• For Model 5/xxx systems, Windows NT is supported under the AlphaBIOS console. Refer to the AlphaServer 1000/1000A Model 5/xxx Owner’s Guide Supplement.

The console ﬁrmware provides the data structures and callbacks available to booted programs deﬁned in the SRM, ARC, and AlphaBIOS standards.

System Conﬁguration and Setup 5–3

SRM Command Line Interface

Systems running Digital UNIX or OpenVMS access the SRM ﬁrmware through a command line interface (CLI). The CLI is a UNIX style shell that provides a set of commands and operators, as well as a scripting facility. The CLI allows you to conﬁgure and test the system, examine and alter system state, and boot the operating system.

The SRM console prompt is Several system management tasks can be performed only from the SRM console

command line interface:

• All console test and reporting commands are run from the SRM console.

• Certain environment variables are changed using the SRM For example:

er*0_protocols ew*0_mode ew*0_protocols ocp_text pk*0_fast pk*0_host_id

To run the ECU, you must enter the ARC ﬁrmware and the ECU software, or in the case of AlphaBIOS, will boot the AlphaBIOS ﬁrmware.

ARC and AlphaBIOS Menu Interface

Systems running Windows NT access the ARC or AlphaBIOS console ﬁrmware through menus that are used to conﬁgure and boot the system, run the EISA Conﬁguration Utility (ECU), run the RAID Conﬁguration Utility (RCU), adapter conﬁguration utility, or set environment variables.

• You must run the EISA Conﬁguration Utility (ECU) whenever you add, remove, or move an EISA or ISA option in your AlphaServer system. The ECU is run from diskette. Two diskettes are supplied with your system shipment, one for Digital UNIX and OpenVMS and one for Windows NT. For more information about running the ECU, refer to Section 5.6.

>>>

set

command.

ecu

command. This command will boot the

• If you purchased a StorageWorks RAID Array 200 Subsystem for your server, you must run the RAID Conﬁguration Utility (RCU) to set up the disk drives and logical units. Refer to StorageWorks RAID Array 200 Subsystems Controller Installation and Standalone Conﬁguration Utility User’s Guide, included in your RAID kit.

5–4 System Conﬁguration and Setup

DEC AlphaServer 1000 DEC AlphaServer 1000A Service Guide

Specifications and Main Features

Frequently Asked Questions

User Manual