Digital Equipment Corporation
Maynard, Massachusetts
Page 2
Revised, July 1993
First Printing, December 1992
The information in this document is subject to change without notice and should not be construed
as a commitment by Digital Equipment Corporation.
Digital Equipment Corporation assumes no responsibility for any errors that may appear in this
document.
The software, if any, described in this document is furnished under a license and may be used or
copied only in accordance with the terms of such license. No responsibility is assumed for the use
or reliability of software or equipment that is not supplied by Digital Equipment Corporation or its
affiliated companies.
in preparing future documentation.
The following are trademarks of Digital Equipment Corporation: Alpha AXP, AXP, DEC, DECchip,
DECconnect, DECdirect, DECnet, DECserver, DEC VET, DESTA, MSCP, RRD40, ThinWire,
TMSCP, TU, UETP, ULTRIX, VAX, VAX DOCUMENT, VAXcluster, VMS, the AXP logo, and the
DIGITAL logo.
OSF/1 is a registered trademark of Open Software Foundation, Inc.
All other trademarks and registered trademarks are the property of their respective holders.
FCC NOTICE: The equipment described in this manual generates, uses, and may emit radio
frequency energy. The equipment has been type tested and found to comply with the limits for
a Class A computing device pursuant to Subpart J of Part 15 of FCC Rules, which are designed
to provide reasonable protection against such radio frequency interference when operated in a
commercial environment. Operation of this equipment in a residential area may cause interference,
in which case the user at his own expense may be required to take measures to correct the
interference.
This document was prepared using VAX DOCUMENT, Version 2.1.
This guide describes the procedures and tests used to service DEC 4000 AXP
systems.
Intended Audience
This guide is intended for use by Digital Equipment Corporation service personnel
and qualified self-maintenance customers.
Conventions
The following coventions are used in this guide.
ConventionMeaning
Return
Ctrl/xCtrl/x indicates that you hold down the Ctrl key while you
bold typeIn the online book (Bookreader), bold type in examples
lowercaseLowercase letters in commands indicate that commands can be
A key name enclosed in a box indicates that you press that key.
press another key, indicated here by x. In examples, this key
combination is enclosed in a box, for example,
indicates commands and other instructions that you enter
at the keyboard.
entered in uppercase or lowercase.
Ctrl/C
.
xiii
Page 12
In some illustrations, small drawings of the DEC 4000 AXP
system appear in the left margin. Shaded areas help you locate
components on the front or back of the system.
WarningWarnings contain information to prevent personal injury.
CautionCautions provide information to prevent damage to equipment
[]
console command
abbreviations
boot
italic typeItalic type in console command sections indicates a variable.
< >In console mode online help, angle brackets enclose a
{ }In command descriptions, braces containing items separated by
or software.
In command format descriptions, brackets indicate optional
elements.
Console command abbreviations must be entered exactly as
shown.
Console and operating system commands are shown in this
special typeface.
placeholder for which you must specify a value.
commas imply mutually exclusive items.
xiv
Page 13
1
System Maintenance Strategy
Any successful maintenance strategy is based on the proper understanding
and use of information services, service tools, service support and escalation
procedures, field feedback, and troubleshooting procedures. This chapter
describes the maintenance strategy for the DEC 4000 AXP system.
•Section 1.1 provides a diagnostic strategy you should use to troubleshoot a
DEC 4000 AXP system.
•Section 1.2 explains the service delivery methodology.
•Section 1.3 lists the product tools and utilities.
•Section 1.4 lists available information services.
•Section 1.5 describes field feedback procedures.
1.1 Troubleshooting the System
Before troubleshooting any system problem, check the site maintenance log for
the system’s service history. Be sure to ask the system manager the following
questions:
•Has the system been used before and did it work correctly?
•Have changes to hardware or updates to firmware or software been made to
the system recently?
•What is the state of the system—is the operating system up?
If the operating system is down and you are not able to bring it up, use the
console environment diagnostic tools, such as RBDs and LEDs.
If the operating system is up, use the operating system environment
diagnostic tools, such as error logs, crash dumps, DEC VET and UETP
exercisers, and other log files.
System Maintenance Strategy 1–1
Page 14
System problems can be classified into the following five categories:
1. Power problems
2. Problems getting to the console
3. Failures reported by the console subsystem
4. Boot failures
5. Failures reported by the operating system
Using these categories, you can quickly determine a starting point for diagnosis
and eliminate the unlikely sources of the problem. Table 1–1 provides the
recommended tools or resources you should use to isolate problems in each
category.
Table 1–1 Recommended Troubleshooting Procedures
Description
1. Power Problems (Table 1–2)
Diagnostic
Tools/Resources Reference
No power at system
enclosure or trouble with
power supply subsystem, as
indicated by LEDs.
2. Problems Getting to Console Mode (Table 1–3)
System powers up, but
does not display power-up
screen.
Power supply
subsystem
LEDs
OCP LEDsRefer to Section 2.1.2 for information on
Console
terminal
troubleshooting
flow
Power-up
sequence
description
Robust mode
power-up
Refer to Section 2.1.1 for information on
interpreting power supply LEDs.
interpreting OCP LEDs.
Refer to Table 1–3 for information
on troubleshooting console terminal
problems.
Refer to Section 2.3 and 2.3.3 for a
description of the power-up and self-test
sequence.
Refer to Section 2.2.3 for a description of
robust mode power-up and its functions.
5. Failures Reported by the Operating System (Table 1–6)
Diagnostic
Tools/Resources Reference
Operating system generates
error logs; process hangs or
operating system crashes.
Error logsRefer to Chapter 4 for information on
Crash dumpRefer to OpenVMS AXP Alpha System
DEC VET or
UETP
Other log filesRefer to Chapter 4 for information on
interpreting error logs.
Dump Analyzer Utility Manual for
information on how to interpret
OpenVMS crash dump files.
Refer to the Guide to Kernel Debugging
(AA–PS2TA–TE) for information on
using the DEC OSF/1 Krash Utility.
Refer to Section 3.3 for a description
of DEC VET, and Section 3.4 for
information on running UETP software
exercisers.
using log files such as SETHOST.LOG
and OPERATOR.LOG to aid in
troubleshooting.
Use the following tables to identify the diagnostic flow for the five types of system
problems:
•Table 1–2 provides the diagnostic flow for power problems.
•Table 1–3 provides the diagnostic flow for problems getting to console mode.
•Table 1–4 provides the diagnostic flow for problems reported by the console
program.
•Table 1–5 provides the diagnostic flow for boot problems.
•Table 1–6 provides the diagnostic flow for errors reported by the operating
system.
1–4 System Maintenance Strategy
Page 17
Table 1–2 Diagnostic Flow for Power Problems
SymptomActionReference
No AC power at system
as indicated by AC
present LED.
AC power is present, but
system does not power
on.
Check the power source and power cord.
Check the system AC circuit breaker
setting.
Check the DC on/off switch setting.
Examine power supply subsystem LEDs
to determine if a power supply unit
or fan has failed, or if the system has
shut down due to an overtemperature
condition.
Section 2.1.1
Table 1–3 Diagnostic Flow for Problems Getting to Console Mode
SymptomActionReference
Power-up screens (or
console event log) are
not displayed.
Check OCP LEDs for a failure during
self-tests. If two OCP LEDs remain lit,
either option could be at fault.
Check baud rate setting for console
terminal and system. The system default
baud rate setting is 9600.
Try connecting the console terminal to
the auxiliary console port.
Note: No console output is directed to
the auxiliary console port untill the
power-up self-tests have completed and
you press the Enter key or Ctrl/x.
For certain situations, power up under
robust mode to bypass the power-up
script and get to a low-level console.
From console mode, you can then edit the
nvram file, set and examine environment
variables, or initialize individual phases
of drivers.
Section 2.1.2
Section 6.5
Section 2.2.3
System Maintenance Strategy 1–5
Page 18
Table 1–4 Diagnostic Flow for Problems Reported by the Console Program
SymptomActionReference
Power-up screens are
displayed, but tests do
not complete.
Console program reports
error.
Use power-up display and/or OCP LEDs
to determine error.
Examine the console event log to check
for embedded error messages recorded
during power-up.
If power-up screens indicate problems
with mass storage devices, use the
troubleshooting flow charts to determine
the problems.
Run RBD tests to verify problem.Section 3.1
Use the
examine error information contained
in serial control bus EEPROMs.
show error
command to
Section 2.2 and
Section 2.1.2
Section 2.2.1
Section 2.2.2
Section 3.1.4
Table 1–5 Diagnostic Flow for Boot Problems
SymptomActionReference
System cannot find boot
device.
Device does not boot.Run device test to check that boot device
Check system configuration for correct
device parameters (node ID, device name,
and so on) and environment variables
(bootdef_dev, boot_file, boot_osflags).
is operating.
Section 6.2.1,
Section 6.3, and
Section 6.4
Section 3.2
1–6 System Maintenance Strategy
Page 19
Table 1–6 Diagnostic Flow for Errors Reported by the Operating System
SymptomActionReference
System is hung or has
crashed.
Operating system is up.Examine the operating system error log
Examine the crash dump file.Operating system
Use the
examine error information contained
in serial control bus EEPROMs (console
environment error log).
files to isolate the problem.
If the problem occurs intermittently, run
DEC VET or UETP to stress the system.
Examine other log files, such as
SETHOST.LOG, OPCOM.LOG, and
OPERATOR.LOG.
show error
command to
documentation
Section 3.1.4
Chapter 4
Section 3.3 and
Section 3.4
1.2 Service Delivery Methodology
Before beginning any maintenance operation, you should be familiar with the
following:
•The site agreement
•Your local and area geography support and escalation procedures
•Your Digital Services product delivery plan
System Maintenance Strategy 1–7
Page 20
Service delivery methods are part of the service support and escalation
procedure. When appropriate, remote services should be part of the initial
system installation. Methods of service delivery include the following:
•Local support
•Remote call screening
•Remote diagnosis (using modem support)
Recommended System Installation
The recommended system installation includes:
1. Hardware installation and acceptance testing. Acceptance testing includes
running ROM-based diagnostics.
2. Software installation and acceptance testing. For example, using OpenVMS
Factory Installed Software (FIS), and then acceptance testing with DEC VET
or UETP.
3. Installation of the remote service tools and equipment to allow a Digital
Service Center to dial in to the system. Refer to your remote service delivery
strategy.
If you do not follow your service delivery methodology, you risk incurring
excessive service expenses for any product.
1.3 Product Service Tools and Utilities
This section lists the array of service tools and utilities available for acceptance
testing, diagnosis, and serviceability and provides recommendations for their use.
Error Handling/Logging
OpenVMS and DEC OSF/1 operating systems provide recovery from errors,
fault handling, and event logging. The OpenVMS Error Report Formatter
(ERF) provides bit-to-text translation of the event logs for interpretation.
DEC OSF/1 uses UERF to capture the same kinds of information.
RECOMMENDED USE: Analysis of error logs is the primary method of
diagnosis and fault isolation. If the system is up, or the customer allows the
service representative to bring the system up, look at this information first.
Refer to Chapter 4 for information on using error logs to isolate faults.
1–8 System Maintenance Strategy
Page 21
ROM-Based Diagnostics (RBDs)
ROM-based diagnostics have significant advantages:
•There is no load time.
•The boot path is more reliable.
•Diagnosis is done in console mode.
RECOMMENDED USE: The ROM-based diagnostic facility is the primary
means of console environment testing and diagnosis of the CPU, memory,
Ethernet, Futurebus+, and SCSI and DSSI subsystems. Use ROM-based
diagnostics in the acceptance test procedures when you install a system,
add a memory module, or replace the following: CPU module, memory
module, backplane, I/O module, Futurebus+ device, or storage device. Refer
to Section 3.1 for information on running ROM-based diagnostics.
Loopback Tests
Internal and external loopback tests are used to isolate a failure by testing
segments of a particular control or data path. The loopback tests are a subset
of the ROM-based diagnostics.
RECOMMENDED USE: Use loopback tests to isolate problems with the
auxiliary console port and Ethernet controllers. Refer to Section 3.1.12 for
instructions on performing loopback tests.
Firmware Console Commands
Console commands are used to set and examine environment variables and
device parameters. For example, the
and
show device
set
(bootdef_dev, auto_action, and boot_osflags) commands are used to set
environment variables; and the
parameters.
RECOMMENDED USE: Use console commands to set and examine
environment variables and device parameters. Refer to Section 6.2 for
information on firmware commands and utilities.
commands are used to examine the configuration; the
show memory,show configuration
cdp
command is used to configure DSSI
System Maintenance Strategy 1–9
,
Page 22
Option LEDs During Power-Up
The power supply LEDs display pass/fail test results for the power supply
subsystem; the operator control panel (OCP) LEDs display pass/fail self-test
results for CPU, memory, I/O, and Futurebus+ modules. Storage devices and
Futurebus+ modules have their own LEDs as well.
RECOMMENDED USE: Monitor LEDs during power-up to see if the devices
pass their self-tests. Refer to Chapter 2 for information on LEDs and powerup tests.
Operating System Exercisers (DEC VET or UETP)
The Digital Verifier and Exerciser Tool (DEC VET) is supported by the
OpenVMS and DEC OSF/1 operating systems. DEC VET performs exerciseroriented maintenance testing of both hardware and operating system. UETP
is included with OpenVMS and is designed to test whether the OpenVMS
operating system is installed correctly.
RECOMMENDED USE: Use DEC VET or UETP as part of acceptance testing
to ensure that the CPU, memory, disk, tape, file system, and network are
interacting properly. Also use DEC VET or UETP to stress test the user’s
environment and configuration by simulating system operation under heavy
loads to diagnose intermittent system failures.
Crash Dumps
For fatal errors, such as fatal bugchecks, OpenVMS and DEC OSF/1 operating
systems will save the contents of memory to a crash dump file.
RECOMMENDED USE: The support representative should analyze crash
dump files. To save a crash dump file for analysis, you need to know
proper system settings. Refer to the OpenVMS AXP Alpha System DumpAnalyzer Utility Manual or the Guide to Kernel Debugging (AA–PS2TA–TE)
for instructions.
Other Log Files
Several types of log files, such as operator log, console event log, sethost log,
and accounting file (accounting.dat) are useful in troubleshooting.
RECOMMENDED USE: Use the sethost log and other log files to
capture/examine the console output and compare with event logs and crash
dumps in order to see what the system was doing at the time of the error.
1–10 System Maintenance Strategy
Page 23
1.4 Information Services
As a Digital service representative, you may access several information resources,
including advanced database applications, online training courses, and remote
diagnostic tools. A brief description of some of these resources follows.
Technical Information Management Architecture (TIMA)
TIMA is an online database that delivers technical and reference information
to service representatives. A key benefit of TIMA is the pooling of worldwide
knowledge and expertise.
DEC 4000 AXP Model 600 Series Information Set
The DEC 4000 AXP Model 600 Series Information Set consists of service
documentation that contains information on installing and using, servicing
and upgrading, and understanding the system. The guide you are reading
is part of the set. The hardcopy kit number is EK–KN430–DK. The set is
also available on TIMA. Refer to your DEC 4000 Model 600 Information Map
(EK–KN430–IN) for detailed information.
Training
Computer Based Training (CBT) and lecture lab courses are available from
the Digital training center:
•DEC 4000 System Installation and Troubleshooting (CBT course, EY–
Digital Services Product Delivery Plan (Hardware or Software)
The Product Delivery Plan documents Digital Services’ delivery commitments.
The plan is the communications vehicle used among the various groups
responsible for ensuring consistency between Digital Services’ delivery
strategies and engineering product strategies.
Blitzes
Technical updates are ‘‘blitzed’’ to the field using online mail and TIMA.
System Maintenance Strategy 1–11
Page 24
Storage and Retrieval System (STARS)
STARS is a worldwide database for storing and retrieving technical
information. The STARS databases, which contain more than 150,000 entries,
are updated daily.
Using STARS, you can quickly retrieve the most up-to-date technical
information via DSNlink or DSIN.
1.5 Field Feedback
Providing the proper feedback to the corporation is essential in closing the loop
on any service call. Consider the following when completing a service call:
•Fill out repair tags accurately and with as much symptom information as
possible so that repair centers can fix a problem.
•Provide accurate call closeout information for Labor Activity Reporting
System (LARS) or Call-Handling and Management Planning (CHAMP).
•Keep an up-to-date site maintenance log, whether hardcopy or electronic, to
provide a record of the performed maintenance.
1–12 System Maintenance Strategy
Page 25
2
Power-On Diagnostics and System
LEDs
This chapter provides information on how to interpret system LEDs and the
power-up console screens. In addition, a description of the power-up and
bootstrap sequence is provided as a resource to aid in troubleshooting.
•Section 2.1 describes how to interpret system LEDs.
•Section 2.2 describes how to interpret the power-up screens.
•Section 2.3 describes the power-up sequence.
•Section 2.3.3 describes power-on self-tests.
•Section 2.4 describes the boot sequence.
2.1 Interpreting System LEDs
DEC 4000 AXP systems have several diagnostic LEDs that indicate whether
modules and subsystems have passed self-tests. The power system controller
constantly monitors the power supply subsystem and can indicate several types
of failures. The system LEDs are used primarily to troubleshoot power problems
and problems getting to the console program.
This section describes the function of each of the following types of system LEDs,
and what action to take when a failure is indicated.
•Power supply LEDs
•Operator control panel (OCP) LEDs
•I/O panel LEDs
•Futurebus+ option LEDs
•Storage device LEDs
Power-On Diagnostics and System LEDs 2–1
Page 26
2.1.1 Power Supply LEDs
The power supply LEDs (Figure 2–1) are used to indicate the status of the
components that make up the power supply subsystem. The following types of
failures will cause the power system controller to shut down the system:
•Power system controller (PSC) failure
•Fan failure
•Overtemperature condition
•Power regulator failures (indicated by the DC3 or DC5 failure LEDs)
•Front end unit (FEU) failure
Note
The AC circuit breaker will also shut down the system. If a power surge
occurs, the breaker will trip, causing the switch to return to the off
position (0). If the circuit breaker trips, wait 30 seconds before setting the
switch to the on position (1).
Refer to Table 2–1 for information on interpreting the LEDs and determining
what actions to take when a failure is indicated.
Figure 2–2 shows the local disk converter (LDC) and fan locations as they
correspond to the fault ID display.
2–2 Power-On Diagnostics and System LEDs
Page 27
Figure 2–1 Power Supply LEDs
PSCDC3FEUDC5
MO
SI
SO
AC Circuit
Breaker
FEU Failure
FEU OK
DC3 Failure
DC3 OK
DC5 Failure
DC5 OK
PSC Failure
PSC OK
Over
Overtemperature
Shutdown
Fan Failure
Disk Power Failure
Fault ID Display
AC Present
LJ-02011-TI0
Power-On Diagnostics and System LEDs 2–3
Page 28
Table 2–1 Interpreting Power Supply LEDs
IndicatorMeaningAction on Error
Front End Unit (FEU)
AC PresentWhen lit, indicates AC power
is present at the AC input
connector (regardless of circuit
breaker position).
FEU OKWhen lit, indicates DC output
voltages for the FEU are above
the specified minimum.
FEU FailureWhen lit, indicates DC output
voltages for the FEU are less
than the specified minimum.
If AC power is not present, check
the power source and power cord.
If the system will not power up and
the AC LED is the only lit LED,
check if the system AC circuit
breaker has tripped. Replace the
front end unit (Chapter 5) if the
system circuit breaker is broken.
Replace front end unit (Chapter 5).
(continued on next page)
2–4 Power-On Diagnostics and System LEDs
Page 29
Table 2–1 (Cont.) Interpreting Power Supply LEDs
IndicatorMeaningAction on Error
Power System Controller (PSC)
PSC OKWhen blinking, indicates the
PSC FailureWhen lit, indicates the PSC has
Disk Power
Failure
Fan FailureWhen lit, indicates a fan has
Overtemperature
Shutdown
PSC is performing power-up
self-tests.
When steady, indicates the PSC
is functioning normally.
detected a fault in itself.
When lit, indicates a disk
power problem for the storage
compartment specified in the
hexadecimal fault ID display.
The most likely failing unit is
the local disk converter, but a
shorting cable or drive could also
be at fault.
failed or a cable guide is not
properly secured. The failure is
identified by a number displayed
in the hexadecimal fault ID
display.
When lit, indicates the PSC has
shut down the system due to
excessive internal temperature.
Replace power system controller
(Chapter 5).
To isolate the local disk converter,
disconnect the drives on the
specified bus and then power
up the system. If the Disk Power
Failure LED lights with the drives
disconnected, replace the failing
local disk converter (Chapter 5).
Refer to Figure 2–2 to locate the
local disk converter specified by
the fault ID display. A is the top
compartment, D is the bottom
compartment.
Refer to Figure 2–2 to locate the
failure specified by the fault ID
display.
Replace the failing fan (Chapter 5).
Set the AC circuit breaker to off (0)
and wait one minute before turning
on the system.
Make sure the air intake is
unobstructed and that the room
temperature does not exceed
maximum requirement as
described in the DEC 4000 SitePreparation Checklist.
(continued on next page)
Power-On Diagnostics and System LEDs 2–5
Page 30
Table 2–1 (Cont.) Interpreting Power Supply LEDs
IndicatorMeaningAction on Error
DC–DC Converter (DC3)
DC3 OKWhen lit, indicates that all the
DC3 output voltages are within
specified tolerances.
DC3 FailureWhen lit, indicates that one of
the output voltages is outside
Replace the DC3 converter
(Chapter 5).
specified tolerances.
DC–DC Converter (DC5)
DC5 OKWhen lit, indicates the DC5
output voltage is within specified
tolerances.
DC5 FailureWhen lit, indicates the DC5
output voltage is outside
Replace the DC5 converter
(Chapter 5).
specified tolerances.
Figure 2–2 LDC and Fan Unit Locations and Error Codes
Fan Error Codes
Local Disk
Converter A
Local Disk
Converter B
Local Disk
Converter C
Local Disk
Converter D
Fan 3Fan 4Fan 1
3
1 - Rear left
2 - Rear right
3 - Front left
4 - Front right
9 - A cable guide is not
properly secured or
two or more fans have
failed.
4
Fans are located
behind the cable guides
Fan 2
1
2
MLO-010872
2–6 Power-On Diagnostics and System LEDs
Page 31
2.1.2 Operator Control Panel LEDs
The OCP LEDs (Figure 2–3) are used to indicate the progress and result of
self-tests for Futurebus+, memory, CPU, and I/O modules. These LEDs are
the primary diagnostic tool for troubleshooting problems getting to the console
program.
Note
A failure in the CPU, memory module, or I/O module can cause both the
I/O and CPU LEDs or I/O and memory LEDs to indicate self-test failures
even if only one of the modules is failing. If two LEDs are lit, the I/O
module is the more likely source of the failure.
Figure 2–3 OCP LEDs
DC On/Off
Switch
DC Power
LED
Self-Test
Status LEDs
ResetHalt
6-1321 001
MEMCPUI/O
LJ-02008-TI0
Power-On Diagnostics and System LEDs 2–7
Page 32
Refer to Table 2–2 for information on interpreting the OCP LEDs and
determining what actions to take when a failure is indicated.
Figure 2–4 shows the module locations as they correspond to the LEDs.
Table 2–2 Interpreting OCP LEDs
IndicatorMeaningAction on Error
Futurebus+ 6–1Remains lit if a Futurebus+
option has failed power-on
diagnostics.
MEM 3, 2, 1, 0Remains lit if a memory module
has failed power-on diagnostics.
If no good memory is found, all
four memory LEDs may remain
lit even if there are less than
four memory modules present.
CPU 0, 1Remains lit if a CPU module has
failed power-on diagnostics.
I/ORemains lit if the I/O module
has failed power-on diagnostics.
DC PowerWhen lit indicates the proper
DC power is present. When
unlit, indicates no DC power is
present.
Examine LEDs on the Futurebus+
options to determine which option
to replace.
Replace the failed module
(Chapter 5).
Replace the failed module
(Chapter 5).
Replace the I/O module (Chapter 5).
If no DC power is indicated, set
the DC on/off switch to on (1) and
examine the power supply LEDs.
2–8 Power-On Diagnostics and System LEDs
Page 33
Figure 2–4 Module Locations Corresponding to OCP LEDs
F1
0
F2
F3
6
4321
5
3210
MEM
01
CPU
1
F4
I/O
LJ-02052-TI0
2.1.3 I/O Panel LEDs
The I/O panel LEDs (Figure 2–5) are used to indicate the status of ThinWire and
thickwire (standard) Ethernet fuses.
Refer to Table 2–3 for information on interpreting the LEDs and determining
what actions to take when a failure is indicated.
Power-On Diagnostics and System LEDs 2–9
Page 34
Figure 2–5 I/O Panel LEDs
F1
F2
F3
F4
ThinWire Ethernet Fuse OK
0
Thickwire Ethernet Fuse OK
ThinWire Ethernet Fuse OK
1
Thickwire Ethernet Fuse OK
LJ-02012-TI0
Table 2–3 Interpreting I/O Panel LEDs
IndicatorMeaningAction on Error
ThinWire
Ethernet Fuse
OK
Thickwire
Ethernet Fuse
OK
When lit, indicates ThinWire
fuse is good; unlit indicates fuse
has blown.
When lit, indicates thickwire
fuse is good; unlit indicates fuse
has blown.
Replace fuse (refer to Chapter 5).
Replace fuse (refer to Chapter 5).
2–10 Power-On Diagnostics and System LEDs
Page 35
2.1.4 Futurebus+ Option LEDs
The Futurebus+ option LEDs (Figure 2–6) are used to indicate the progress and
result of self-tests for a specific Futurebus+ option.
Refer to Table 2–4 for information on interpreting the LEDs and determining
what actions to take when a failure is indicated.
Figure 2–6 Futurebus+ Option LEDs
Fault
Run
LJ-02010-TI0
Power-On Diagnostics and System LEDs 2–11
Page 36
Table 2–4 Interpreting Futurebus+ Option LEDs
IndicatorMeaningAction on Error
FaultThe Fault indicator lights during
self-tests. If it remains lit, the
module has failed self tests.
RunThe Run indicator blinks during
self-tests and remains lit if the
module passes self-tests.
Replace module.
2.1.5 Storage Device LEDs
Storage device LEDs are used to indicate the status of the device. The LEDs for
fixed-media storage devices are shown in Figures 2–7 and Figure 2–8. Refer to
the DEC 4000 Model 600 Series Owner’s Guide for information on LEDs for the
removable-media devices.
Refer to Table 2–5 for information on interpreting the LEDs and determining
what actions to take when a failure is indicated.
2–12 Power-On Diagnostics and System LEDs
Page 37
Figure 2–7 Fixed-Media Mass Storage LEDs (SCSI)
Fast SCSI
3.5-Inch SCSI
5.25-Inch SCSI
Fault
Local Disk
Converter OK
Online
Fault
Local Disk
Converter OK
Online
SCSI
Terminator
Local Disk
Converter OK
SCSI
Terminator
LJ-02486-TI0
Power-On Diagnostics and System LEDs 2–13
Page 38
Figure 2–8 Fixed-Media Mass Storage LEDs (DSSI)
3.5-Inch DSSI
5.25-Inch DSSI
Fault
Local Disk
Converter OK
Online
DSSI Terminator
with LED
Fault
Write Protect
Local Disk
Converter OK
Run/Ready
DSSI Terminator
with LED
LJ-02483-TI0
Table 2–5 Interpreting Fixed-Media Mass Storage LEDs
IndicatorMeaningAction on Error
FaultWhen lit, indicates an error
condition in the device. The
Fault indicator may light
temporarily during self-tests.
OnlineDSSI: When lit, indicates the
device is on line and available
for use. Under normal operation,
flashes as seek operations are
performed.
SCSI: Flashes as seek operations
are performed; indicates drive
activity.
2–14 Power-On Diagnostics and System LEDs
Run device RBD tests and internal
device tests to determine the
nature of the error, and replace
device.
(continued on next page)
Page 39
Table 2–5 (Cont.) Interpreting Fixed-Media Mass Storage LEDs
IndicatorMeaningAction on Error
DSSI Terminator When lit, indicates DSSI
Local Disk
Converter OK
termination power is present.
When lit, indicates local disk
converter for the specified
storage compartment has power
(this LED is located on the
local disk power supply module
behind the front panel of the
storage compartment).
If the DSSI terminator LED does
not light, check the DSSI bus
connections for that bus. If bus
connections seem secure, the local
disk converter module or DC5
converter may need to be replaced
(Section 5.2):
•Local disk converters (located
in the fixed-media storage
compartments) supply
termination power for fixedmedia storage devices.
•The DC5 converter (part of
the power supply subsystem)
supplies termination power
for storageless fixed-media
compartments.
Confirm that the system power
supply is working properly (by
checking power supply LEDs).
Replace the local disk converter
module (Section 5.2).
2.2 Power-Up Screens
During power-up self-tests a screen similar to the one shown in Figure 2–9 is
displayed on the console terminal. The screen shows the status and result of the
self-tests.
A power-on self-test failure indicated under Storage A–E may represent
a failure of an embedded storage adapter (A–E) or failure of a drive on
the specified bus. Check the console event log for additional information
(Section 2.2.1).
Power-on self-tests failures indicated for all six Futurebus+ slots indicate
a failure of the Futurebus+ bridge on the I/O module. Replace the I/O
module in the event that all six Futurebus+ slots show failures.
When the power-up diagnostics are completed, a second screen similar to the
one shown in Figure 2–10 is displayed. This screen provides configuration
information for the system.
2–16 Power-On Diagnostics and System LEDs
Note
Page 41
Figure 2–10 Sample Power-Up Configuration Screen
Console Vn.n-nnnnVMS PALcode Xn.nnX, OSF PALcode Xn.nnX
CPU 0
DEC 4000 AXP systems maintain a console event log consisting of status
messages received during power-on self-tests. If there are problems during
power-up, standard error messages may be embedded in the console event log. To
display a console event log, use the
Use the
set screen_mode off
log during power-up, rather than the two power-up screens.
The following example shows an abbreviated console event log that contains two
standard error messages: The first (a hard error) indicates a failure with storage
bus B. This failure could be caused by a bad LDC, improperly seated storage
drawer, or a disconnected power cable within the storage drawer. The second (a
soft error) indicates a SCSI continuity card is missing from the removable-media
storage compartment.
*** End of Error ***
device mud9.5.0.3.0 (TF85) found on pud0.5.0.3.0
>>>
2.2.2 Mass Storage Problems Indicated at Power-Up
Mass storage failures at power-up are usually indicated in one of two ways:
•The power-up screens report a storage adapter port failure (indicated by an
‘‘F’’).
•One or more drives are missing from the configuration screen display (or too
many drives are displayed).
Figures 2–11 and 2–12 provide a flowchart for troubleshooting fixed-media mass
storage problems indicated at power-up. Use the flowchart to diagnose the likely
cause of the problem. Table 2–6 lists the symptoms and corrective action for each
of the possible problems.
2–18 Power-On Diagnostics and System LEDs
Page 43
Figure 2–11 Flowchart for Troubleshooting Fixed-Media Problems
Does the disk drive have power?
Check the Disk Power Failure LED on the PSC.
LED offLED onLikely LDC failure
Check the LDC OK LED on the storage compartment front panel.
LED onLED off
Continue
Has the disk drive failed?
Check the drive’s fault LED.
LED on (steady)Drive failure
LED off
LED flashing
Continue
Are bus node ID plugs improperly set?
Check that all drives on the bus have unique bus node ID numbers (no duplicates).
Duplicate bus node IDsConfiguration rule violation
Check that no drive is set to bus node ID 7 (reserved for host ID).
Drive set to host ID 7
Continue
Is the storage drawer properly seated?
Power down, remove drawer and inspect connectors, reseat drawer and power up.
LDC failure
Drive is performing
extended calibration;
wait for tests to complete
Configuration rule violation
Problems persist
Continue
Problems solvedDrawer not properly seated
LJ-02548-TI0A
Power-On Diagnostics and System LEDs 2–19
Page 44
Figure 2–12 Flowchart for Troubleshooting Fixed-Media Problems (Continued)
Are cables loose or missing?
Power down, remove drawer and check all cable connections, reseat drawer and power up.
Problems persist
Continue
Is the storage bus terminated?
Check that a terminator is in place.
Check that terminator power is present. For DSSI buses, check that the terminator LED is on.
For SCSI buses use a volt meter on the port connector (termination power is supplied by pin 38,
ground on pin 1).
Power present
Continue
Is the I/O module the source of the problem?
Swap the failing drive drawer to another compartment.
Likely problem with drive, drawer, or cables. Check again before continuing.
Is the backplane the source of the problem?
Eliminate all of the preceding problem sources before suspecting the backplane.
The backplane is the least likely to fail.
Disassemble the system as described in Section 5.4. Inspect the two
backplane interconnect cables.
Power-up screen reports a failing
storage adapter port.
(steady).
Drives with duplicate bus node
ID plugs are missing from the
configuration screen display.
A drive with no bus node ID plug
defaults to zero.
Valid drives are missing from
the configuration screen display.
One drive may appear seven
times on the configuration screen
display.
Disk power failure LED on PSC
is on.
LDC OK LED on storage
compartment front panel is
off.
Power-up screen reports a failing
storage adapter port.
Replace LDC.
Replace drive.
Correct bus node ID plugs.
Correct bus node ID plugs.
Remove drawer and check its
connectors. Reseat drawer.
(continued on next page)
Power-On Diagnostics and System LEDs 2–21
Page 46
Table 2–6 (Cont.) Fixed-Media Mass Storage Problems
ProblemSymptomCorrective Action
Missing or loose
cables
Terminator
missing
No termination
power
I/O module
failure
Backplane
failure
Cable: storage device to ID
panel—Bus node ID defaults to
zero; online LEDs do not come
on.
Flex circuit: LDC to storage
interface module—Disk power
failure LED on PSC is on;
LDC OK LED on storage
compartment front panel is
off; and power-up screen reports
a failing storage adapter port.
Cable: LDC to storage interface
module—Power-up screen
reports a failing storage adapter
port; drive LEDs do not come on
at power-up.
Cable: LDC to storage device—
Drive does not show up in
configuration screen display.
Read/write errors in console
event log; storage adapter port
may fail
DSSI terminator LED is off, or
no termination voltage measured
at SCSI connector (pin 38,
ground pin 1); Read/write errors;
storage adapter port may fail.
The storage drawer exhibits no
problems when moved to another
compartment.
Replacing the I/O module does
not solve problem. The port
continues to fail and the problem
is not with the storage drawer.
Remove storage drawer and inspect
cable connections.
Attach terminator to connector
port.
Replace LDC (termination power
source for fixed-media storage
compartments).
Replace DC5 converter (termination power source for storageless
fixed-media storage compartments).
Replace I/O module.
Disassemble system and inspect
backplane interconnect cables. If
the cables and cable connections
do not appear to be the problem,
replace the backplane.
Figures 2–13 and 2–14 provide a flowchart for troubleshooting removable-media
storage problems indicated at power-up. Use the flowchart to diagnose the likely
cause of the problem. Table 2–7 lists the symptoms and corrective action for each
of the possible problems.
2–22 Power-On Diagnostics and System LEDs
Page 47
Figure 2–13 Flowchart for Troubleshooting Removable-Media Problems
Has the drive failed?
Check the drive’s fault LED.
LED offLED on (steady)Drive failure
Continue
Are bus node ID plugs improperly set?
Check that all drives on the bus have unique bus node ID numbers (no duplicates).
Duplicate bus node IDsConfiguration rule violation
Check that no drive is set to bus node ID 7 (reserved for host ID).
Drive set to host ID 7
Continue
Is the SCSI continuity card missing?
Check the console event log for an error message indicating a SCSI continuity card
is missing. If the top and/or bottom storage compartments do not have half-height
drives, a SCSI continuity card is needed to continue the bus. Refer to Section 6.1.5.2
for more information.
Half-height drive or
SCSI continuity card
present
If console event log reports erroneously that the SCSI continuity card is missing,
replace the Vterm module. The Vterm module contains the logic for reporting
SCSI continuity card errors.
Continue
missing
Configuration rule violation
SCSI continuity card missingSCSI continuity card
LJ-02549-TI0A
Power-On Diagnostics and System LEDs 2–23
Page 48
Figure 2–14 Flowchart for Troubleshooting Removable-Media Problems
(Continued)
Are cables loose or missing?
Power down, remove drive and check all cable connections, replace drive and power up.
Problems persist
Continue
Is the storage bus terminated?
Check that a terminator is in place.
Check that terminator power is present. Use a voltmeter on the port connector
(termination power is supplied by pin 38, ground on pin 1).
Power present
Continue
Is the I/O module the source of the problem?
Replace the I/O module.
Likely problem with drive or cables. Check again before continuing.
Is the backplane the source of the problem?
Eliminate all of the preceding problem sources before suspecting the backplane.
The backplane is the least likely to fail.
Disassemble the system as described in Section 5.4. Inspect the two
Replace backplane assembly as described in Section 5.4.
Cable connections areBackplane interconnect cable failure
loose or damaged
2–24 Power-On Diagnostics and System LEDs
LJ-02549-TI0B
Page 49
Table 2–7 Removable-Media Mass Storage Problems
ProblemSymptomCorrective Action
Drive failureFault LED for drive is on
Duplicate bus
node ID plugs
(or a missing
plug)
Bus node ID set
to 7 (reserved
for host ID)
SCSI continuity
card missing
Missing or loose
cables
Terminator
missing
Vterm module
failure
(steady).
Drives with duplicate bus node
ID plugs are missing from the
configuration screen display.
A drive with no bus node ID plug
defaults to zero.
Valid drives are missing from
the configuration screen display.
One drive may appear seven
times on the configuration screen
display.
Power-up screen reports a
failing storage adapter port;
console event log contains soft
error message reporting a SCSI
continuity card is missing; drives
on Bus E are not displayed on
configuration screen; possible
read/write errors.
Cable: storage device to ID
panel—Bus node ID defaults to
zero; online LED does not come
on.
Cable: Power—Drive does not
show up in configuration screen
display.
Read/write errors in console
event log; storage adapter port
may fail
No termination voltage
measured at Bus E SCSI
connector (pin 38, ground pin
1); Read/write errors; storage
adapter port may fail; or
console erroneously reports
SCSI continuity card as missing.
Replace drive.
Correct bus node ID plugs.
Correct bus node ID plugs.
Attach SCSI continuity card
(Section 6.1.5.2).
If console erroneously reports
SCSI continuity card as missing,
replace the Vterm module. The
Vterm module contains the logic
for reporting SCSI continuity card
errors.
Remove device and inspect cable
connections.
Attach terminator to connector
port.
Replace Vterm module (termination power source for removablemedia storage compartment).
(continued on next page)
Power-On Diagnostics and System LEDs 2–25
Page 50
Table 2–7 (Cont.) Removable-Media Mass Storage Problems
ProblemSymptomCorrective Action
I/O module
failure
Backplane
failure
Problems persist after
eliminating the above problem
sources.
Replacing the I/O module does
not solve problem—the port
continues to fail and the problem
is not with the device or cables.
Replace I/O module.
Disassemble system and inspect
backplane interconnect cables. If
the cables and cable connections
do not appear to be the problem,
replace the backplane.
2.2.3 Robust Mode Power-Up
Robust mode allows you to power up without initiating drivers or running
power-up diagnostics.
Robust mode permits you to get to the console program when one of the following
is the cause of a problem getting to the console program under normal power-up:
•An error in the nonvolatile nvram file
•An incorrect environment variable setting
•A driver error
Note
The console program has limited functionality in robust mode.
Once in console mode, you can:
•Edit the nvram file (using the
•Assign a correct value to an environment variable (using the
commands)
•Start individual classes or sets of drivers, called phases (using the
-driver #
command. The pound sign (#) is the phase number 2, 3, 4, or 5,
and each phase is started individually in increasing order.
2–26 Power-On Diagnostics and System LEDs
edit
command)
show
and
init
set
Page 51
Note
The nonvolatile file, nvram, is shipped from the factory with no contents.
The customer can use the
command file that is executed as the last step of every power-up.
To set the system to robust mode, set the baud rate select switch located behind
the OCP to 0, as shown in Section 6.5. The robust mode setting uses a 9600
console baud rate.
edit
command to create a customized script or
2.3 Power-Up Sequence
During the DEC 4000 AXP power-up sequence, the power supplies are stabilized
and tested and the system is initialized and tested via the firmware power-on
self-tests.
The power-up sequence includes the following:
•Power supply power-up:
–Includes AC power-up and power supply self-test.
–Includes DC power-up and power supply self-tests.
•Two sets of power-on diagnostics:
–Serial ROM diagnostics
–Console firmware-based diagnostics
2.3.1 AC Power-Up Sequence
With no AC power applied, no energy is supplied to the entire enclosure. AC
power is applied to the system with the AC circuit breaker on the front end unit
(FEU) of the power supply (see Figure 2–1) . With just AC power applied, the AC
present LED is the only LED illuminated on the power supply.
Figure 2–15 provides a description of the AC power-up sequence.
Failures during AC power-up are indicated by the power supply subsystem LEDs.
Additional error information is displayed on the PSC Fault ID display. Refer to
Appendix B for PSC fault display information.
Power-On Diagnostics and System LEDs 2–27
Page 52
Figure 2–15 AC Power-Up Sequence
AC plug is inserted into wall outlet
AC circuit breaker is set to on (1)
AC power (country-specific voltage) enters FEU module
FEU creates two +48V outputs:
+48 VDC enters PSC, energizes microprocessor power system
PSC module verifies microprocessor power
OKFAILEDMicro power system output not valid
PSC microprocessor performs internal self-test and PSC interface test
OKFAILED
PSC microprocessor self-test passed, PSC OK LED is turned on
PSC verifies +48 VDC BUS_DIRECT output is okay, turns on FEU OK LED
PSC verifies input voltage conditions: AC_POWER, FEU_HVDC, DIRECT_48V
All three are okay
-
AC power
-
FEU high voltage (HVDC)
-
+48V BUS_DIRECT
1.BUS_DIRECT +48 VDC output (always on) immediately
goes to +48 DC inputs on DC5, DC3 and PSC modules
2.BUS_SWITCHED (+V-V) +48 VDC output (off) goes to
+48 VDC input on LDCs and Futurebus+ modules
-
-
FEU failure LED is turned on
-
PSC microprocessor latches into shutdown
-
PSC microprocessor failed self-test
-
PSC failure LED is turned on
-
PSC microprocessor latches into shutdown
If BUS_DIRECT and AC power are not okay,
the system is in AC low line condition
-
PSC waits for either output to become okay
-
NO FEU LEDs are turned on
PSC waits for power-up command
PSC loops in routine checking status
WAIT
2–28 Power-On Diagnostics and System LEDs
If +48 VDC BUS_DIRECT is not asserted,
but AC power is okay, FEU has failed
-
FEU failure LED comes on
-
PSC latches in shutdown
LJ-02484-TI0
Page 53
2.3.2 DC Power-Up Sequence
DC power is applied to the system with the DC on/off switch on the operator
control panel.
Figures 2–16 and 2–17 provide a description of the DC power-up sequence.
Failures during DC power-up are indicated by the power supply subsystem LEDs.
Additional error information is displayed on the PSC Fault ID display. Refer to
Appendix B for PSC fault display information.
Power-On Diagnostics and System LEDs 2–29
Page 54
Figure 2–16 DC Power-Up Sequence
DC on/off switch set to on (1)
PSC starts DC power-up sequence and status check
PSC checks temperature sensor
OKFAILED
PSC checks overtemperature status (onboard)
OKFAILED
PSC commands FEU to start fans by asserting FAN_POWER_ENABLE H.
All fans are started at maximum speed, rotation speed is verified.
OKFAILED
PSC negates ASYNC_RESET signal to system CPU
PSC commands FEU to turn on +48 VDC BUS_SWITCHED output
PSC waits 100 ms for FEU to assert BUS_SWTCHD_OK signal
OKFAILED
FEU +48 VDC switched output (+V-V) goes to local disk
converters (LDCs) and Futurebus+ slots
PSC commands DC3 to turn on +3.3 VDC output
PSC waits 50 ms for +3.3 VDC to reach regulation
-
Failed PSC fault LED is turned on
-
Fans operate at full speed
-
Fans kept running while orderly shutdown is initiated
-
Fan Failure LED is turned on
-
Fans turned off after 30-sec. delay
-
One or more fans fail to start
-
Fans kept running while orderly shutdown is initiated
-
Overtemperature shutdown LED is turned on
and fan number is displayed
and fan number is displayed
-
Fans turned off after 30-sec. delay
-
BUS_SWTCHD_OK did not assert within 100 ms
-
Fans are turned off
-
FEU OK LED is turned off
-
FEU failure LED is turned on
-
PSC latches in shutdown mode
OKFAILED
PSC commands DC5 to turn on +5.1 VDC output
Go to next page
2–30 Power-On Diagnostics and System LEDs
-
Output did not reach regulation in time
-
Fans and active DC outputs are turned off
-
Failure LED on DC3 module is turned on
-
PSC latches in shutdown mode
LJ-02485-TI0A
Page 55
Figure 2–17 DC Power-Up Sequence (Continued)
PSC waits 30 ms for +5.1 VDC to reach regulation
-
-
OKFAILED
DC5 OK LED is turned on
PSC commands DC3 to turn on +2.1 VDC output
PSC waits 20 ms for +2.1 VDC to reach regulation
OKFAILED
PSC commands DC3 to turn on +12 VDC output
PSC waits 100 ms for +12 VDC to reach regulation
-
Output did not reach regulation in time
-
Fans and active DC outputs are turned off
-
Failure LED on DC5 module is turned on
-
PSC latches in shutdown mode
-
-
Output did not reach regulation in time
-
Fans and active DC outputs are turned off
-
Failure LED on DC3 module is turned on
-
PSC latches in shutdown mode
OKFAILED
DC3 OK LED is turned on
All DC outputs except LDCs are energized
PSC checks status of entire power system and delays for 45 ms
PSC negates ASYNC_REST_L and asserts POK_H; begins powering LDCs
Each LDC has an enable bit that, when asserted, starts a timer.
The LDC has 50 ms to respond with its LDC_OK signal asserted.
OKFAILED
LDC_OK is received within 50 ms, a 5-sec. timeout is initiated for disk spin-up time.
System power-up is complete
PSC microprocessor begins ongoing status monitoring
-
Output did not reach regulation in time
-
Fans and active DC outputs are turned off
-
Failure LED on DC3 module is turned on
-
PSC latches in shutdown mode
-OKFAILED
One of the above outputs has failed;
failure mode indicated as described
above for the appropriate output.
-
-
LDC did not respond in time allowed
-
Disk power failure LED is turned on
-
Corresponding letter (A, B, C, or D) is
displayed on fault ID display
-
The next LDC is tested
LJ-02485-TI0B
Power-On Diagnostics and System LEDs 2–31
Page 56
2.3.3 Firmware Power-Up Diagnostics
After successful completion of AC and DC power-up sequences, the processor
performs its power-up diagnostics. These tests verify system operation, load the
system console, and test the kernel system, including all boot path devices. These
tests are performed as two distinct sets of diagnostics:
1. Serial ROM diagnostics—These tests are loaded from the serial ROM located
on the CPU module into the CPU’s instruction cache (I-cache). They check the
basic functionality of the system and load the console code from the FEPROM
on the I/O module into system memory.
Failures during these tests are indicated by LEDs on the operator control
panel.
2. Console firmware-based diagnostics—These tests are executed by the console
code. They test the kernel system, including all boot path devices.
Failures during these tests are reported to the console terminal (via the
power-up screen or console event log).
2.3.3.1 Serial ROM Diagnostics
The serial ROM diagnostics are loaded into the CPU’s I-cache from the serial
ROM on the CPU module. They test the system in the following order:
1. Test the CPU and backup cache located on the CPU module.
2. Test the CPU module’s system bus interface.
3. Check the access to the I/O module.
4. Locate the largest memory module in the system and test the first 4 MB of
memory on the module. Only the first 4 MB of memory are tested. If there is
more than one memory module of the same size, the one closest to the CPU is
tested first.
If the memory test fails, the next largest memory module in the system
is tested. Testing continues until a good memory module is found. If a
good memory module is not found, the corresponding LEDs on the OCP are
illuminated and the power-up diagnostics are terminated.
5. After finding the first memory module with a good first 4 MB of memory,
the console program is loaded into memory from the FEPROM on the I/O
module. At this time control is passed to the console code and the console
firmware-based diagnostics are run.
2–32 Power-On Diagnostics and System LEDs
Page 57
2.3.3.2 Console Firmware-Based Diagnostics
Console firmware-based tests are executed once control is passed to the console
code in memory. They check the system in the following order:
1. Perform a complete check of system memory. If a system has more than one
memory module, the modules are checked in parallel.
2. Set memory interleave to maximize interleave factor across as many memory
modules as possible (one, two, or four-way interleaving). During this time the
console firmware is moved into backup cache on the primary CPU module.
After memory interleave is set, the console firmware is moved back into
memory.
Steps 3–7 may be completed in parallel.
3. Start the I/O drivers for mass storage devices and tapes. At this time a
complete functional check of the machine is made. After the I/O drivers
are started, the console program continuously polls the bus for devices
(approximately every 20 or 30 seconds).
4. Size, configure, and test the Futurebus+ options.
5. Exercise memory.
6. Check that the SCSI continuity card or a storage device is installed in the
removable-media storage bus (Bus E, connectors J6 and J7).
7. Run exercisers on the disk drives currently seen by the system.
This step does not currently ensure that all disks in the system will be
tested or that any device drivers will be completely tested. To ensure
complete testing of disk devices, use the
8. Enter console mode or boot the operating system. This action is determined
by the auto_action environment variable.
2.4 Boot Sequence
Bootstrapping is the process of loading a program image into memory and
transferring control to the loaded program. The system firmware uses the
bootstrap procedure defined by the Alpha AXP architecture and described in the
Alpha System Reference Manual. On a DEC 4000 AXP system, bootstrap can be
attempted only by the primary processor or boot processor. The firmware uses
Note
test
command.
Power-On Diagnostics and System LEDs 2–33
Page 58
device and optional filename information specified either on the command line or
in appropriate environment variables.
There are only three conditions under which the boot processor attempts to
bootstrap the operating system:
1. The
2. The system is reset or powered up and AUTO_ACTION is set to boot (and the
3. An operating system restart is attempted and fails.
The firmware’s function in a bootstrap is to load a program into memory and
begin its execution. This program may be a primary bootstrap program, such as
Alpha Primary Boot (APB), Ultrixboot, or any other applicable program specified
by the user or residing in the boot block, MOP server, or TCP/IP server.
boot
command is typed on the console terminal.
halt switch is not set to halt).
2.4.1 Cold Bootstrapping in a Uniprocessor Environment
This section describes a cold bootstrap in a uniprocessor environment. A system
bootstrap will be a cold bootstrap when any of the follow occur:
•Power is first applied to the system
•A console
variable is set to ‘‘Boot.’’
•The boot_reset environment variable is set to ‘‘On.’’
•A cold bootstrap is requested by system software.
The console must perform the following steps in the cold bootstrap sequence:
1. Perform a system initialization
2. Size memory
initialize
command is issued and the auto_action environment
3. Test sufficient memory for bootstrapping
4. Load PALcode
5. Build a valid Hardware Restart Parameter Block (HWRPB)
6. Build a valid Memory Data Descriptor Table in the HWRPB
7. Initialize bootstrap page tables and map initial regions
8. Locate and load the system software primary bootstrap image
9. Initialize processor state on all processors
10. Transfer control to the system software primary bootstrap image
2–34 Power-On Diagnostics and System LEDs
Page 59
The steps leading to the transfer of control to system software may be performed
in any order. The final state seen by system software is defined, but the
implementation-specific sequence of these steps is not. Prior to beginning a
bootstrap, the console must clear any internally pended restarts to any processor.
2.4.2 Loading of System Software
The console uses the boot_dev environment variable to determine the bootstrap
device and the path to that device. These environment variables contain lists of
bootstrap devices and paths; each list element specifies the complete path to a
given bootstrap device. If multiple elements are specified, the console attempts to
load a bootstrap image from each in turn.
The console uses the bootdef_dev, boot_dev, and booted_dev environment variables
as follows:
1. At console initialization, the console sets the bootdef_dev and boot_dev
environment variables to be equivalent. The format of these environment
variables is determined by the console implementation and is independent of
the console presentation layer; the value may be interpreted and modified by
system software.
2. When a bootstrap results from a
device list, the console uses the list specified with the command. The console
modifies boot_dev to contain the specified device list. Note that this may
require conversion from the presentation layer format to the registered
format.
3. When a bootstrap is the result of a
bootstrap device list, the console uses the bootstrap device list contained
in the bootdef_dev environment variable. The console copies the value of
bootdef_dev to boot_dev.
4. When a bootstrap is not the result of a
bootstrap device list contained in the boot_dev environment variable. The
console does not modify the contents of boot_dev.
5. The console attempts to load a bootstrap image from each element of the
bootstrap device list. If the list is exhausted prior to successfully transferring
control to system software, the bootstrap attempt fails and the subsequent
console action is determined by auto_action.
6. The console indicates the actual bootstrap path and device used in the
booted_dev environment variable. The console sets booted_dev after loading
the primary bootstrap image and prior to transferring control to system
software. The booted_dev format follows that of a boot_dev list element.
boot
command that specifies a bootstrap
boot
command that does not specify a
boot
command, the console uses the
Power-On Diagnostics and System LEDs 2–35
Page 60
7. If the bootstrap device list is empty, bootdef_dev or boot_dev are null, and
the action is implementation-specific. The console may remain in console I/O
mode or attempt to locate a bootstrap device in an implementation-specific
manner.
The boot_file and boot_osflags environment variables are used as default values
for the bootstrap filename and option flags. The console indicates the actual
bootstrap image filename (if any) and option flags for the current bootstrap
attempt in the booted_file and booted_osflags and environment variables. The
boot_file default bootstrap image filename is used whenever the bootstrap
requires a filename and either none was specified on the
bootstrap was initiated by the console as the result of a major state transition.
The console never interprets the bootstrap option flags, but simply passes them
between the console presentation layer and system software.
boot
command or the
2.4.3 Warm Bootstrapping in a Uniprocessor Environment
The actions of the console on a warm bootstrap are a subset of those for a cold
bootstrap. A system bootstrap will be a warm bootstrap whenever the boot_
reset environment variable is set to ‘‘Off’’ (46 4E4F16) and console internal state
permits.
The console program performs the following steps in the warm bootstrap
sequence.
1. Locates and validates the Hardware Reset Parameter Block (HWRPB)
2. Locates and loads the system software primary bootstrap image
3. Initializes processor state on all processors
4. Initializes bootstrap page tables and maps initial regions
5. Transfers control to the system software primary bootstrap image
At warm bootstrap, the console does not load PALcode, does not modify the
Memory Data Descriptor Table, and does not reinitialize any environment
variables. If the console cannot locate and validate the previously initialized
HWRPB, the console must initiate a cold bootstrap. Prior to beginning a
bootstrap, the console must clear any internally pended restarts to any processor.
2–36 Power-On Diagnostics and System LEDs
Page 61
2.4.4 Multiprocessor Bootstrapping
Multiprocessor bootstrapping differs from uniprocessor bootstrapping primarily
in synchronization between processors. In a shared memory system, processors
cannot independently load and start system software; bootstrapping is controlled
by the primary processor.
DEC 4000 AXP systems always select CPU0 as the primary processor. The
secondary processor polls a mailbox for a start address.
2.4.5 Boot Devices
The supported boot devices shown in Table 2–8 are determined by the console’s
device drivers.
This chapter provides information on how to run system diagnostics.
•Section 3.1 describes how to run ROM-based diagnostics, including error
reporting utilities, and loopback tests.
•Section 3.2 describes how to run DSSI internal device tests.
•Section 3.3 describes the DEC VET verifier and exerciser software.
•Section 3.4 describes how to run UETP environmental test package software.
•Section 3.5 describes acceptence testing and initialization procedures.
3.1 Running ROM-Based Diagnostics
DEC 4000 AXP ROM-based diagnostics (RBDs), which are part of the console
firmware that is loaded from the FEPROM on the I/O module, offer many
powerful diagnostic utilities, including the ability to examine error logs from the
console environment and run system- or device-specific exercisers.
Unlike previous systems, DEC 4000 AXP RBDs rely on exerciser modules,
rather than functional tests to isolate errors. The exercisers are designed to run
concurrently, providing a maximum bus interaction between the console drivers
and the target devices.
The multitasking ability of the console firmware allows you to run diagnostics in
the background (using the background operator ‘‘&’’ at the end of the command).
You run RBDs by using console commands.
RBDs can be separated into four types of utilities:
1. System or device diagnostic test/exercisers using the
(Section 3.1.1).
The
test
command is the primary diagnostic for acceptance testing and
console environment diagnosis.
test
command
Running System Diagnostics 3–1
Page 64
2. Three related commands are used to list system bus FRUs, report the status
of RBDs in progress, and report errors:
•The
•The
•The
3. Several commands allow you to perform extended testing and exercising of
specific system components. These commands are used for troubleshooting
and are not needed for routine acceptance testing:
•The
•The
•The
•The
•The
show fru
part numbers, hardware and software revision numbers, and summary
error information.
show_status
status of RBD test/exercisers currently in progress.
show error
test-directed diagnostics (TDD), via the RBDs, and by symptom-directed
diagnostics (SDD), via the operating system.
memexer
specified number of memory tests. The tests are run in the background.
memexer_mp
multiprocessor system by running a specified number of memory exerciser
sets. The tests are run in the background.
exer_read
random reads on the device.
exer_write
random writes to the specified device.
fbus_diag
command (Section 3.1.2) reports system bus FRUs, module
command (Section 3.1.3) reports the error count and
command (Section 3.1.4) reports errors captured by
command (Section 3.1.5) exercises memory by running a
command (Section 3.1.6) tests memory in a
command (Section 3.1.7) tests a disk by performing
command (Section 3.1.8) tests a disk by performing
command (Section 3.1.9) tests the Futurebus+ modules.
•The
•The
4. Loopback tests for testing console and Ethernet ports (Section 3.1.12)
In addition to the four utilities listed above, there are two diagnostic-related
commands. The
terminate diagnostics.
3–2 Running System Diagnostics
show_mop_counters
MOP counters.
clear_mop_counters
MOP counters.
kill
and
kill_diags
command (Section 3.1.10) is used to read the
command (Section 3.1.11) is used to reset the
commands (Section 3.1.13) are used to
Page 65
3.1.1 test
The
test
command runs firmware diagnostics for the entire system, specified
subsystems, or specific devices. These firmware diagnostics are run in the
background. When the tests are successfully completed, the message ‘‘tests done’’
is displayed. If any of the tests fail, a failure message is displayed.
If you do not specify an argument with the
test
command, all tests except those
for tape drives are performed.
Note
By default, no write tests are performed on disk; and read and write tests
are performed for tape drives. You need a scratch tape to test tape drives.
Early systems may not support RBD testing for tape drives.
All tests run concurrently for a minimum of 30 seconds. Tests complete when all
component tests have completed at least one pass. Test passes are repeated for
any component that completes its test before other components.
The run time of a test is proportional to the amount of memory to be tested
and the number of disk and tape drives to be tested. Running
test all
on a
system with fully configured 512-MB memory takes approximately 10 minutes to
complete.
[all]Firmware diagnostics will test/exercise all the devices present in
[cpu]Firmware diagnostics will test backup cache and memory coherency.
[disk]Firmware diagnostics will perform read-only tests of all disk drives
[tape]Firmware diagnostics will perform read and write tests of all the tape
[dssi]Firmware diagnostics will test the DSSI subsystem, including read-only
the system configuration: CPU, disk, tape, DSSI subsystem, SCSI
subsystem, Futurebus+ subsystem, memory, Ethernet, and I/O devices.
present in the system. One pass consists of seeking to a random block
on the disk and reading a packet of 2048 bytes and repeating until 512
packets are read.
devices present in the system. Testing the tape drives requires that a
scratch tape be loaded in the tape drive.
tests of all DSSI disks, and read-write tests for tape drives.
Running System Diagnostics 3–3
Page 66
[scsi]Firmware diagnostics will test the SCSI subsystem, including read-only
[fbus]Firmware diagnostics will instruct all Futurebus+ modules to perform
[memory]Firmware diagnostics will test memory modules present in the system.
[ethernet]Firmware diagnostics will test the Ethernet logic.
[device_list]Use the device_list argument to specify disk, tape, or Futurebus+ devices
tests of all SCSI disks and read-write tests for SCSI tape drives.
extended category default self-tests.
to be tested. As with all the RBDs, uses the exer script to perform readonly tests on the specified disk devices, and read-write tests for tape
drives. Legal devices are disk, tape, and Futurebus+ device names.
SDD: Number of symptom-directed diagnostic events logged by the
operating system, or in the case of memory, by the operating system and
firmware diagnostics.
TDD: Number of test-directed diagnostic events logged by the firmware
diagnostics.
'
Futurebus+ option name, fban, where:
fb indicates Futurebus+ option
a indicates corresponding Futurebus+ slot a–f (1–6)
n indicates the Futurebus+ node number, 0 or 1
(
Description of Futurebus+ module
3–6 Running System Diagnostics
Page 69
3.1.3 show_status
The
show_status
diagnostic. The information includes ID, diagnostic program, device under
test, error counts, passes completed, bytes written and read.
Many of the diagnostics run in the background and provide information only
if an error occurs. Use the
diagnostics.
The following command string is useful for periodically displaying diagnostic
status information for diagnostics running in the background:
>>> while true;show_status;sleep n;done
command reports one line of information per executing
Error count (hard and soft): Soft errors are not usually fatal; hard errors halt
the system or prevent completion of the diagnostics.
&
Bytes successfully written by diagnostic
'
Bytes successfully read by diagnostic
Running System Diagnostics 3–7
Page 70
3.1.4 show error
The
show error
bus EEPROM data. Both the operating system and the ROM-based diagnostics
log errors to the serial control bus EEPROMs. This functionality provides the
ability to generate an error log from the console environment.
command reports error information based on the serial control
exercisers. The exercisers are run in the background and nothing is displayed
unless an error occurs. Each exerciser tests all available memory in 2-MB blocks
for each pass.
command tests memory by running a specified number of memory
To terminate the memory tests, use the
diagnostic or the
show_status
kill_diags
command to terminate all diagnostics. Use the
display to determine the process ID when killing an individual
kill
command to terminate an individual
diagnostic test.
Synopsis:
memexer [number]
Arguments:
[number]Number of memory exercisers to start. The default is 1.
The number of exercisers, as well as the length of time for testing,
depends on the context of the testing. Generally, running 3–5 exercisers
for 15 minutes to 1 hour is sufficient for troubleshooting most memory
problems.
system by running a specified number of memory exerciser sets. A set is a
memory test that runs on each processor checking alternate longwords. The
exercisers are run in the background and nothing is displayed unless an error
occurs.
command tests memory cache coherency in a multiprocessor
To terminate the memory tests, use the
diagnostic or the
show_status
kill_diags
command to terminate all diagnostics. Use the
display to determine the process ID when killing an individual
kill
command to terminate an individual
diagnostic test.
Synopsis:
memexer_mp [number]
Arguments:
[number]Number of memory exerciser sets to start. The default is 1.
The number of exercisers, as well as the length of time for testing,
depends on the context of the testing. Generally, running 2 or 3
exercisers for 5 minutes is sufficient.
Examples:
>>>
memexer_mp 2
>>>
kill_diags
>>>
Running System Diagnostics 3–11
Page 74
3.1.7 exer_read
The
exer_read
on one or more devices. The exercisers are run in the background and nothing is
displayed unless an error occurs.
The tests continue until one of the following conditions occurs:
1. All blocks on the device have been read for a passcount of d_passes (default is
1).
command tests a disk by performing random reads of 2048 bytes
2. The exer_read process has been terminated via the
killorkill_diags
commands, or Ctrl/C.
3. The specified time has elapsed.
To terminate the read tests, enter Ctrl/C, or use the
an individual diagnostic or the
Use the
show_status
display to determine the process ID when killing an
[device_name]One or more device names to be tested. The default is du*.* dk*.* to test
all DSSI and SCSI disks that are on line.
Options:
[-sec seconds]Number of seconds to run exercisers. If you do not enter the number
of seconds, the tests will run until d_passes have completed (d_passes
default is 1).
If you want to test the entire disk, run at least one pass across the
disk. If you do not need to test the entire disk, run the test for 5 or 10
minutes.
3–12 Running System Diagnostics
Page 75
Examples:
>>>
exer_read
failed to send command to pkc0.1.0.2.0
failed to send Read to dkc100.1.0.2.0
*** Hard Error - Error #5 Diagnostic NameIDDevicePassTestHard/Soft
31-JUL-1992
exer_kid00000175dkc100.1.0.20010
14:54:18
Error in read of 0 bytes at location 014DD400 from device dkc100.1.0.2.0
*** End of Error ***
>>>
Running System Diagnostics 3–13
Page 76
3.1.8 exer_write
The
exer_write
more devices. The exercisers are run in the background and nothing is displayed
unless an error occurs.
The exer_write tests cause the device to seek to a random block and read a
2048-byte packet of data, write that same data back to the same location on the
device, read the data again, and compare it to the data originally read.
The tests continue until one of the following conditions occurs:
1. All blocks on the device have been read for a passcount of d_passes (default is
1).
command tests a disk by performing random writes on one or
2. The exer_read process has been terminated via the
killorkill_diags
commands, or Ctrl/C.
3. The specified time has elapsed.
To terminate the read tests, enter Ctrl/C, or use the
an individual diagnostic or the
Use the
show_status
display to determine the process ID when killing an
[device_name]One or more device names to be tested. The default is du*.* dk*.* to test
all DSSI and SCSI disks that are on line.
Options:
[-sec seconds]Number of seconds to run exercisers. If you do not enter the number
of seconds, the tests will run until d_passes have completed (d_passes
default is 1).
If you want to test the entire disk, run at least one pass across the
disk. If you do not need to test the entire disk, run the test for 5 or 10
minutes.
3–14 Running System Diagnostics
Page 77
Examples:
>>>
exer_write dka0
EXECUTING THIS COMMAND WILL DESTROY DISK DATA
OR DATA ON THE SPECIFIED DEVICES
Do you really want to continue? [Y/(N)]:
failed to send command to pkc0.1.0.2.0
failed to send Read to dkc100.1.0.2.0
*** Hard Error - Error #5 Diagnostic NameIDDevicePassTestHard/Soft
31-JUL-1992
exer_kid0000012edka0.0.0.00010
15:21:22
Error in read of 0 bytes at location 017B3400 from device dka0.0.0.0.0
*** End of Error ***
failed to send command to pka0.0.0.0.0
failed to send Read to dka0.0.0.0.0
>>>
y
Running System Diagnostics 3–15
Page 78
3.1.9 fbus_diag
The
fbus_diag
onboard a specific Futurebus+ device.
The
fbus_diag
initiate commands on specific Futurebus+ devices, waits for tests to complete, and
then reports the results to the console. If an error is reported by the Futurebus+
node, the diagnostic issues a dump buffer command to gain any available
extended information that will also be reported to the console.
Refer to documentation for the specific Futurebus+ option for the recommended
test procedures and form of the
diagnostics. For more information, consult the Futurebus+ Handbook.
Test categories that require a buffer pointer in the argument CSR will have a
default buffer provided by this diagnostic if the user does not specify a buffer
address.
Process options and command line arguments are used to specify the specific
test or test script to be executed as well as the target Futurebus+ node for this
command.
command is used to start execution of a diagnostic test script
comand uses the Futurebus+ standard test CSR interface to
fbus_diag
command to initiate module-resident
nodeSpecifies the device name of the Futurebus+ device to execute the test.
Use the command
names.
[test_arg]Specifies an argument to be passed to the Futurebus+ node in the test
argument CSR. If this parameter is not specified and the category is
either extended or system, the routine allocates a buffer and passes the
buffer address through the test argument CSR.
Options:
[-rb]Randomly allocates from memzone on each pass with a block size of
4096.
[-p](pass_count) Specifies the number of times to run the test. If 0, the
test runs continuously. This overrides the value of the pass_count
environment variable. In the absence of this option, pass_count is used.
The default for pass_count is 1.
[-st](test_number) Specifies the test number to be run. The default is 0,
which runs the default tests in the category.
3–16 Running System Diagnostics
show device fb
to display the Futurebus+ device
Page 79
[-cat](test_group) Specifies the test category to be executed. The possible
categories are as follows:
•Init: Initialization tests
•Extended: Extended tests (default category)
•System: System tests
•Manual: Manual tests
•x: Bit mask of the desired test categories
[-opt](test_option) Specify the Test Start CSR Option field bits to be set. The
possible option bits are as follows:
•Loop_error: Loop on test if an error is detected
•Loop_test: Loop on this test
•Cont_error: Continue if an error is detected
•x: Bit mask of the desired option bits
The default value for this qualifier is based on the current values in the
global enviroment variables as follows:
•Loop_test: 1 if D_PASSES == 0 ; 0 otherwise
•Loop_error: 1 if D_HARDERR == "Loop" ; 0 otherwise
•Cont_error: 1 if D_HARDERR == "Continue" ; 0 otherwise
Running System Diagnostics 3–17
Page 80
3.1.10 show_mop_counter
The
show_mop_counter
Ethernet port.
Synopsis:
show_mop_counter [port_name]
Arguments:
command displays the MOP counters for the specified
[port_name]Specifies the Ethernet port for which to display MOP counters: eza0 for
Block check: 0 Framing error: 0 Long frame: 0
Unknown destination: 36953 Data overrun: 0 No system buffer: 18
No user buffers: 0
>>>
3–18 Running System Diagnostics
Page 81
3.1.11 clear_mop_counter
The
clear_mop_counter
Ethernet port.
Synopsis:
show_mop_counter [port_name]
Arguments:
command initializes the MOP counters for the specified
[port_name]Specifies the Ethernet port for which to initialize MOP counters: eza0
for Ethernet port 0; ezb0 for Ethernet port 1.
Examples:
>>>
clear_mop_counter eza0
>>>
Running System Diagnostics 3–19
Page 82
3.1.12 Loopback Tests
Internal and external loopback tests can be used to isolate a failure by testing
segments of a particular control or data path. The loopback tests are a subset of
the RBDs.
3.1.12.1 Testing the Auxiliary Console Port (exer)
Using a loopback connector (29–24795–00) and a form of the
can test the auxiliary serial port. Before running the loopback test, you must
set the tt_allow_login environment variable to 1; after the test is completed, you
must set tt_allow_login to 0.
Use the following commands to send a fixed data pattern through the auxiliary
serial port:
>>> set tt_allow_login 1
>>> exer -bs 1 -a "wRc" -p 0 tta1 &
>>> kill_diags
>>> set tt_allow_login 0
>>>
In the above command, the portion in quotes (the write, read, and compare
instruction) is case sensitive. The background operator &, at the end of the
command, causes the loopback tests to run in the background. Nothing is
displayed unless an error occurs.
exer
command, you
To terminate the console loopback test, use the
individual diagnostic or the
Use the
individual diagnostic test.
3.1.12.2 Testing the Ethernet Ports (netexer)
The
between eza0 and ezb0. The network ports must be connected and terminated.
The loopback tests are run in the background. Nothing is displayed unless an
error occurs.
To terminate the console loopback test, use the
individual diagnostic or the
Use the
individual diagnostic test.
3–20 Running System Diagnostics
show_status
netexer
command performs an Ethernet port-to-port MOP loopback test
show_status
kill_diags
display to determine the process ID when killing an
kill_diags
display to determine the process ID when killing an
kill
command to terminate the
command to terminate all diagnostics.
kill
command to terminate the
command to terminate all diagnostics.
Page 83
3.1.13 kill and kill_diags
The
kill
and
kill_diags
executing .
commands terminates diagnostics that are currently
•The
•The
kill
command terminates a specified process.
kill_diags
command terminates all diagnostics.
Synopsis:
kill_diags
kill [PID . . . ]
Arguments:
[PID . . . ]The process ID of the diagnostic to terminate. Use the
command to determine the process ID.
show_status
3.1.14 Summary of Diagnostic and Related Commands
Table 3–1 provides a summary of the diagnostic and related commands.
Table 3–1 Summary of Diagnostic and Related Commands
CommandFunctionReference
Acceptance Testing
testTest the entire system, subsystem, or specific device.Section 3.1.1
Error Reporting and Diagnostic Status
show fruReports system bus and Futurebus+ FRUs,
module identification numbers, and summary error
information.
show_statusReports the status of currently executing
test/exercisers.
show errorReports some errors captured by diagnostics and
operating system.
(continued on next page)
Section 3.1.2
Section 3.1.3
Section 3.1.4
Running System Diagnostics 3–21
Page 84
Table 3–1 (Cont.) Summary of Diagnostic and Related Commands
memexerExercises memory by running a specified number of
memexer_mpTests memory in a multiprocessor system by running
exer_readTests a disk by performing random reads on the
exer_writeTests a disk by performing random writes to the
fbus_diagInitiates onboard tests for a specified Futurebus+
show_mop_
counter
clear_mop_
counter
Loopback Testing
exerConducts loopback tests for the specified console
netexerConducts loopback tests for the Ethernet ports.Section 3.1.12.2
Diagnostic-Related Commands
killTerminates a specified process.Section 3.1.13
kill_diagsTerminates all currently executing diagnostics.Section 3.1.13
memory tests. The tests are run in the background.
a specified number of memory exerciser sets. The
tests are run in the background.
specified device.
specified device.
device.
Displays the MOP counters for the specified
Ethernet port.
Initializes the MOP counters for the specified
Ethernet port.
port.
Section 3.1.5
Section 3.1.6
Section 3.1.7
Section 3.1.8
Section 3.1.9
Section 3.1.10
Section 3.1.11
Section 3.1.12.1
3.2 DSSI Device Internal Tests
A DSSI storage device may fail either during initial power-up or during normal
operation. In both cases, the failure is indicated by the lighting of the red Fault
LED on the drive’s front panel.
If the drive is unable to execute the Power-On Self-Test (POST) successfully, the
red Fault LED remains on and the Run/Ready LED does not come on, or both
LEDs remain on.
3–22 Running System Diagnostics
Page 85
POST is also used to handle two types of error conditions in the drive:
•Controller errors are caused by the hardware associated with the controller
function of the drive module. A controller error is fatal to the operation of the
drive, since the controller cannot establish a logical connection to the host.
The red Fault LED comes on. If this occurs, replace the drive module.
•Drive errors are caused by the hardware associated with the drive control
function of the drive module. These errors are not fatal to the drive, since
the drive can establish a logical connection and report the error to the host.
Both LEDs go out for about 1 second, then the red Fault LED comes on. In
this case, run either DRVTST, DRVEXR, or PARAMS via the
set host -dup
command, as described in the drive’s service documentation, to determine the
error code.
Three configuration errors are often the cause of drive errors:
•More than one node with the same bus node ID number
•Identical node names
•Identical MSCP unit numbers
The first error cannot be detected by software. Use the
show device
command
(Section 6.2) to display the second and third types of errors. This command
displays each device along with such information as bus node ID, unit number,
and node name.
If the device is connected to the front panel of the storage compartment, you
must install a bus node ID plug in the corresponding socket on the front panel. If
the device is not connected to the front panel, it reads the bus node ID from the
three-switch DIP switch on the side of the drive.
DSSI storage devices contain the following local programs:
DIRECTA directory, in DUP-specified format, of available local programs
DRVTSTA comprehensive drive functionality verification test
DRVEXRA utility that exercises the device
HISTRYA utility that saves information retained by the drive, including the
ERASEA utility that erases all user data from the disk
VERIFYA utility that is used to determine the amount of ‘‘margin’’ remaining in
DKUTILA utility that displays disk structures and disk data
PARAMSA utility that allows you to look at or change drive status, history,
internal error log
on-disk structures
parameters, and the internal error log
Running System Diagnostics 3–23
Page 86
Use the
set host -dup
command to access the local programs listed above.
Example 3–1 provides an abbreviated example of running DRVTST for a device
(Bus node 2 on Bus 0).
Caution
When running internal drive tests, always use the default (0 = No) in
responding to the ‘‘Write/read anywhere on medium?’’ prompt. Answering
Yes could destroy data.
.
GAMMA::MSCP$DUP17-MAY-1992 12:55:42 DRVTSTCPU=0 00:02:13.41 PI=2388
Test passed.
Stopping DUP server...
>>>
Return
Example 3–2 provides an abbreviated example of running DRVEXR for an
RF-series disk (Bus node 2 on Bus 0).
3–24 Running System Diagnostics
Page 87
Example 3–2 Running DRVEXR
>>>
set host -dup -task drvexr dub0
Starting DUP server...
Copyright (C) 1992Digital Equipment Corporation
Write/read anywhere on medium? [1=Yes/(0=No)]
Test time in minutes? [(10)-100]
Number of sectors to transfer at a time? [0 - 50]
Compare after each transfer? [1=Yes/(0=No)]:
Test the DBN area? [2=DBN only/(1=DBN and LBN)/0=LBN only]:
Refer to the RF-Series Integrated Storage Element Service Guide for instructions
on running these programs.
3.3 DEC VET
Digital’s DEC Verifier and Exerciser Tool (DEC VET) software is a multipurpose
system maintenance tool that performs exerciser-oriented maintenance testing.
DEC VET runs on both OpenVMS AXP and DEC OSF/1 operating systems.
DEC VET consists of a manager and exercisers that test devices. The DEC VET
manager controls these exercisers.
DEC VET exercisers test system hardware and the operating system.
DEC VET supports various exerciser configurations, ranging from a single device
exerciser to full system loading—that is, simultaneous exercising of multiple
devices.
Refer to the DEC Verifier and Exerciser Tool User’s Guide (AA–PTTMA–TE) for
instructions on running DEC VET.
Running System Diagnostics 3–25
Page 88
3.4 Running UETP
The User Environment Test Package (UETP) tool is an OpenVMS AXP software
package designed to test whether the OpenVMS AXP operating system is
installed correctly. UETP software puts the system through a series of tests that
simulate a typical user environment, by making demands on the system that are
similar to demands that might occur in everyday use.
Run UETP after system installation when OpenVMS AXP is running; or when
you need to run stress tests to pinpoint intermittent errors.
UETP is not a diagnostic program; it does not attempt to test every
feature exhaustively. When UETP runs to completion without encountering
unnrecoverable errors, the system being tested is ready for use.
UETP exercises devices and functions that are common to all VMS and OpenVMS
AXP systems, with the exception of optional features, such as high-level language
compilers. The system components tested include the following:
•Most standard peripheral devices
•The system’s multiuser capability
•DECnet for OpenVMS AXP software
3.4.1 Summary of UETP Operating Instructions
This section summarizes the procedure for running all phases of UETP with
default values.
1. Log in to the SYSTEST account as follows:
Username: SYSTEST
Password:
Because the SYSTEST and SYSTEST_CLIG accounts have privileges,
unauthorized use of these accounts might compromise the security of your
system.
3–26 Running System Diagnostics
Caution
Page 89
2. Make sure no user programs are running and no user volumes are mounted.
Caution
By design, UETP assumes and requests the exclusive use of system
resources. If you ignore this restriction, UETP may interfere with
applications that depend on these resources.
3. After you log in, check all devices to be sure that the following conditions
exist:
•All devices you want to test are powered up and are on line to the system.
•Scratch disks are mounted and initialized.
•Disks contain a directory named [SYSTEST] with OWNER_
UIC=[1,7]. (You can create this directory with the DCL command
CREATE/DIRECTORY.)
•Scratch magnetic tape reels are physically mounted on each drive you
want tested and are initialized with the label UETP (using the DCL
command INITIALIZE). Make sure magnetic tape reels contain at least
600 feet of tape.
•Scratch tape cartridges have been inserted in each drive you want to test
and are initialized with the label UETP.
•Line printers and hardcopy terminals have plenty of paper.
•Terminal characteristics and baud rate are set correctly (see the user’s
guide for your terminal).
4. To start UETP, enter the following command and press Return:
$ @UETP
UETP responds with the following question:
Run "ALL" UETP phases or a "SUBSET" [ALL]?
Press Return to choose the default response enclosed in brackets. UETP
responds with three more questions in the following sequence:
How many passes of UETP do you wish to run [1]?
How many simulated user loads do you want [n]?
Do you want Long or Short report format [Long]?
Use the default values when acceptance testing with UETP. For stress testing,
enter your own values.
Running System Diagnostics 3–27
Page 90
Press Return after each prompt. After you answer the last question, UETP
initiates its entire sequence of tests, which run to completion without further
input. The final message should look like the following:
END OF UETP PASS 1 AT 20-JUL-1992 16:30:09.38
**
*****************************************************
5. After UETP runs, check the log files for errors. If testing completes
successfully, the OpenVMS AXP operating system is working properly.
Note
After a run of UETP, you should run the Error Log Utility to check for
hardware problems that can occur during a run of UETP. For information
on running the Error Log Utility, refer to the VMS Error Log UtilityManual.
If UETP does not complete successfully, refer to Section 3.4.11.
3.4.2 System Disk Requirements
Before running UETP, be sure that the system disk has at least 1200 blocks
available. Systems running more than 20 load test processes may require a
minimum of 2000 available blocks. If you run multiple passes of UETP, log files
will accumulate in the default directory and further reduce the amount of disk
space available for subsequent passes.
If disk quotas are enabled on the system disk, you should disable them before you
run UETP.
3.4.3 Preparing Additional Disks
To prepare each disk drive in the system for UETP testing, use the following
procedure:
1. Place a scratch disk in the drive and spin up the drive. If a scratch disk is
not available, use any disk with a substantial amount of free space. UETP
does not overwrite existing files on any volume. If your scratch disk contains
files that you want to keep, do not initialize the disk; go to step 3.
2. If the disk does not contain files you want to save, initialize it. For example:
$ INITIALIZE DUA1: TEST1
3–28 Running System Diagnostics
Page 91
This command initializes DUA1, and assigns the volume label TEST1 to the
disk. All volumes must have unique labels.
3. Mount the disk. For example:
$ MOUNT/SYSTEM DUA1: TEST1
This command mounts the volume labeled TEST1 on DUA1. The /SYSTEM
qualifier indicates that you are making the volume available to all users on
the system.
4. UETP uses the [SYSTEST] directory when testing the disk. If the volume
does not contain the directory [SYSTEST], you must create it. For example:
$ CREATE/DIRECTORY/OWNER_UIC=[1,7] DUA1:[SYSTEST]
This command creates a [SYSTEST] directory on DUA1 and assigns a user
identification code (UIC) of [1,7]. The directory must have a UIC of [1,7] to
run UETP.
If the disk you have mounted contains a root directory structure, you can
create the [SYSTEST] directory in the [SYS0.] tree.
3.4.4 Preparing Magnetic Tape Drives
Set up magnetic tape drives that you want to test by doing the following:
1. Place a scratch magnetic tape with at least 600 feet of magnetic tape in the
tape drive. Make sure that the write-enable ring is in place.
2. Position the magnetic tape at the beginning-of-tape (BOT) and put the drive
on line.
3. Initialize each scratch magnetic tape with the label UETP. For example, if
you have physically mounted a scratch magnetic tape on MTA1, enter the
following command and press Return:
$ INITIALIZE MTA1: UETP
Magnetic tapes must be labeled UETP to be tested. As a safety feature, UETP
does not test tapes that have been mounted with the MOUNT command.
3.4.5 Preparing Tape Cartridge Drives
Set up tape cartridge drives that you want to test by doing the following:
1. Insert a scratch tape cartridge in the tape cartridge drive.
2. Initialize the tape cartridge. For example:
$ INITIALIZE MKE0: UETP
Running System Diagnostics 3–29
Page 92
Tape cartridges must be labeled UETP to be tested. As a safety feature,
UETP does not test tape cartridges that have been mounted with the MOUNT
command.
3.4.5.1 TLZ06 Tape Drives
During the initialization phase, UETP sets a time limit of 6 minutes for a TLZ06
unit to complete the UETTAPE00 test. If the device does not complete the
UETTAPE00 test within the alloted time, UETP displays a message similar to
the following:
-UETP-E-TEXT, UETTAPE00.EXE testing controller MKA was stopped ($DELPRC) at 16:23:23.07
To increase the timeout value, type a command similar to the following before
running UETP:
This example defines the initialization timeout value as 8 minutes.
because the time out period (UETP$INIT_TIMEOUT) expired or
because it seemed hung or because UETINIT01 was aborted.
3.4.6 Preparing RRD42 Compact Disc Drives
To run UETP on an RRD42 compact disc drive, you must first load the test disc
that you received with your compact disc drive unit.
3.4.7 Preparing Terminals and Line Printers
Terminals and line printers must be turned on to be tested by UETP. They must
also be on line. Check that line printers and hardcopy terminals have enough
paper. The amount of paper required depends on the number of UETP passes
that you plan to execute. Each pass requires two pages for each line printer and
hardcopy terminal.
Check that all terminals are set to the correct baud rate and are assigned
appropriate characteristics (see the user’s guide for your terminal).
Spooled devices and devices allocated to queues fail the initialization phase of
UETP and are not tested.
3.4.8 Preparing Ethernet Adapters
Make sure that no other processes are sharing the Ethernet adapter device when
you run UETP.
3–30 Running System Diagnostics
Page 93
Note
UETP will not test your Ethernet adapter if DECnet for OpenVMS AXP
or another application has the device allocated.
Because either DECnet for OpenVMS AXP or the LAT terminal server might also
try to use the Ethernet adapter (a shareable device), you must shut down DECnet
for OpenVMS AXP and the LAT terminal server before you run the device test
phase, if you want to test the Ethernet adapter.
3.4.9 DECnet for OpenVMS AXP Phase
The DECnet for OpenVMS AXP phase of UETP uses more system resources than
other tests. You can, however, minimize disruptions to other users by running the
test on the ‘‘least busy’’ node.
By default, the file UETDNET00.COM specifies the node from which the DECnet
for OpenVMS AXP test will be run. To run the DECnet for OpenVMS AXP test
on a different node, enter the following command before you invoke UETP:
$ DEFINE/GROUP UETP$NODE_ADDRESS node_address
This command equates the group logical name UETP$NODE_ADDRESS to the
node address of the node in your area on which you want to run the DECnet for
OpenVMS AXP phase of UETP.
For example:
$ DEFINE/GROUP UETP$NODE_ADDRESS 9.999
When you use the logical name UETP$NODE_ADDRESS, UETP tests
only the first active circuit found by NCP. Otherwise, UETP tests all
active testable circuits.
When you run UETP, a router node attempts to establish a connection between
your node and the node defined by UETP$NODE_ADDRESS. Occasionally, the
connection between your node and the router node might be busy or nonexistent.
When this happens, the system displays the following error messages:
%NCP-F-CONNEC, Unable to connect to listener
-SYSTEM-F-REMRSRC, resources at the remote node were insufficient
%NCP-F-CONNEC, Unable to connect to listener
-SYSTEM-F-NOSUCHNODE, remote node is unknown
Note
Running System Diagnostics 3–31
Page 94
3.4.10 Termination of UETP
At the end of a UETP pass, the master command procedure UETP.COM displays
the time at which the pass ended. In addition, UETP.COM determines whether
UETP needs to be restarted.
At the end of an entire UETP run, UETP.COM deletes temporary files and does
other cleanup activities.
Pressing Ctrl/Y or Ctrl/C lets you terminate a UETP run before it completes
normally. Normal completion of a UETP run, however, includes the deletion of
miscellaneous files that have been created by UETP for the purpose of testing.
The use of Ctrl/Y or Ctrl/C might interrupt or prevent these cleanup procedures.
3.4.11 Interpreting UETP VMS Failures
When UETP encounters an error, it reacts like a user program. It either returns
an error message and continues, or it reports a fatal error and terminates
the image or phase. In either case, UETP assumes the hardware is operating
properly and it does not attempt to diagnose the error.
If the cause of an error is not readily apparent, use the following methods to
diagnose the error:
•VMS Error Log Utility—Run the Error Log Utility to obtain a detailed report
of hardware and system errors. Error log reports provide information about
the state of the hardware device and I/O request at the time of each error.
For information about running the Error Log Utility, refer to the VMS ErrorLog Utility Manual and Chapter 4 of this manual.
•Diagnostic facilities—Use the diagnostic facilities to test exhaustively a device
or medium to isolate the source of the error.
3.4.12 Interpreting UETP Output
You can monitor the progress of UETP tests at the terminal from which they were
started. This terminal always displays status information, such as messages that
announce the beginning and end of each phase and messages that signal an error.
The tests send other types of output to various log files, depending on how you
started the tests. The log files contain output generated by the test procedures.
Even if UETP completes successfully, with no errors displayed at the terminal,
it is good practice to check these log files for errors. Furthermore, when errors
are displayed at the terminal, check the log files for more information about their
origin and nature.
3–32 Running System Diagnostics
Page 95
3.4.12.1 UETP Log Files
UETP stores all information generated by all UETP tests and phases from its
current run in one or more UETP.LOG files, and it stores the information from
the previous run in one or more OLDUETP.LOG files. If a run of UETP involves
multiple passes, there will be one UETP.LOG or one OLDUETP.LOG file for each
pass.
At the beginning of a run, UETP deletes all OLDUETP.LOG files, and renames
existing UETP.LOG files to OLDUETP.LOG. Then UETP creates a new
UETP.LOG file and stores the information from the current pass in the new
file. Subsequent passes of UETP create higher versions of UETP.LOG. Thus, at
the end of a run of UETP that involves multiple passes, there is one UETP.LOG
file for each pass. In producing the files UETP.LOG and OLDUETP.LOG, UETP
provides the output from the two most recent runs.
If the run involves multiple passes, UETP.LOG contains information from all
the passes. However, only information from the latest run is stored in this file.
Information from the previous run is stored in a file named OLDUETP.LOG.
Using these two files, UETP provides the output from its tests and phases from
the two most recent runs.
The cluster test creates a NETSERVER.LOG file in SYS$TEST for each pass
on each system included in the run. If the test is unable to report errors (for
example, if the connection to another node is lost), the NETSERVER.LOG file on
that node contains the result of the test run on that node. UETP does not purge
or delete NETSERVER.LOG files; therefore, you must delete them occasionally to
recover disk space.
If a UETP run does not complete normally, SYS$TEST might contain other log
files. Ordinarily these log files are concatenated and placed within UETP.LOG.
You can use any log files that appear on the system disk for error checking, but
you must delete these log files before you run any new tests. You may delete
these log files yourself or rerun the entire UETP, which checks for old UETP.LOG
files and deletes them.
3.4.12.2 Possible UETP Errors
This section is intended to help you identify problems you might encounter
running UETP.
The following are the most common failures encountered while running UETP:
•Wrong quotas, privileges, or account
•UETINIT01 failure
•Ethernet device allocated or in use by another application
Running System Diagnostics 3–33
Page 96
•Insufficient disk space
•Incorrect cluster setup
•Problems during the load test
•DECnet for OpenVMS AXP error
•Lack of default access for the FAL object
•Errors logged but not displayed
•No PCB or swap slots
•Hangs
•Bugchecks and machine checks
For more information refer to the VAX 3520, 3540 VMS Installation andOperations (ZKS166) manual.
3.5 Acceptance Testing and Initialization
Perform the acceptance testing procedure listed below, after installing a system,
or whenever adding or replacing the following:
3. Run DEC VET or UETP to test that the operating system is correctly
installed. Refer to Section 3.3 for information on DEC VET. Refer to
Section 3.4 for instructions on running UETP.
3–34 Running System Diagnostics
test
command.
Page 97
4
Error Log Analysis
This chapter provides information on how to interpret error logs reported by the
operating system.
•Section 4.1 describes machine check/interrupts and how these errors are
detected and reported.
•Section 4.2 describes the entry format used by the ERF/UERF error
formatters.
•Section 4.3 describes how to translate the error log information using the
OpenVMS AXP and DEC OSF/1 error formatters.
•Section 4.4 describes how to interpret the system error log to isolate the
failing FRU.
4.1 Fault Detection and Reporting
Table 4–1 provides a summary of the fault detection and correction components of
DEC 4000 AXP systems.
Generally, PALcode handles exceptions as follows:
•The PALcode determines the cause of the exception.
•If possible, it corrects the problem and passes control to the operating system
for reporting before returning the system to normal operation.
•If a problem is not correctable, or if error/event logging is required, control is
passed through the system control block (SCB) to the appropriate exception
handler.
Error Log Analysis 4–1
Page 98
Table 4–1 DEC 4000 AXP Fault Detection and Correction
Backup cache (B-cache)EDC check bits on the data store; and parity on the tag
MS430 Memory Modules
Memory moduleEDC logic protects data by detecting and correcting up to
KFA40 I/O Module
I/O moduleDSSI/SCSI buses: Data parity is checked and generated.
System Bus
System busLongword parity on command, address, and data.
Error Detection and Correction (EDC) logic. For all data
entering the 21064 microprocessor, single bits are checked
and corrected; for all data exiting the 21064 microprocessor,
the appropriate check bits are generated. A single-bit error
on any of the four longwords being read can be corrected
(per cycle).
store and control store.
2 bits per DRAM chip per gate array. The four bits of data
per DRAM are spread across two gate arrays (one for even
longwords, the other for odd longwords).
Lbus data transfers to Ethernet and SCSI/DSSI controllers:
Data parity is checked and generated.
Futurebus+ data transfers: Parity is checked and passed
on.
4.1.1 Machine Check/Interrupts
The exceptions that result from hardware system errors are called machine
check/interrupts. They occur when a system error is detected during the
processing of a data request. There are three types of machine check/interrupts
related to system events:
1. Processor machine check
2. System machine check
3. Processor corrected machine check
4–2 Error Log Analysis
Page 99
The causes for each of the machine check/interrupts are as follows. The system
control block (SCB) vector through which PALcode transfers control to the
operating system is shown in parentheses.
Processor Machine Check (SCB: 670)
Processor machine check errors are fatal system errors and immediately crash
the system.
•The DECchip 21064 microprocessor detected one or more of the following
uncorrectable data errors:
–Uncorrectable B-cache data error
–Uncorrectable memory data error (CU_ERR asserted)
–Uncorrectable data from other CPU’s B-cache (CU_ERR asserted)
•A B-cache tag or tag control parity error occurred
•Hard error status was asserted in response to:
–A read data parity error
–System bus timeouts (NOACK error bit asserted)—The bus responder
detected a write data or command address error and did not acknowledge
the bus cycle.
System Machine Check (SCB: 660)
A system machine check is a system detected error, external to the DECchip
21064 microprocessor and possibly not related to the activities of the microprocessor. It occurs when C_ERROR is asserted on the system bus.
Fatal errors:
•The I/O module detected a system bus error while serving as system bus
commander:
–System bus timeouts (NOACK error bit asserted)—The bus responder
detected a write data or command address error and did not acknowledge
the bus cycle
–Uncorrectable data (CU-ERR asserted) from responder
•Any system bus device detected a command/address parity error
•A bus responder detected a write data parity error
•Memory or I/O system bus gate array detected an internal error (SYNC error)
Error Log Analysis 4–3
Page 100
Nonfatal errors:
•A memory module correctable error occurred
•Correctable B-cache errors were detected while the B-cache was providing
data to the system bus (errors from other CPU)
•Duplicate tag store parity errors occurred
Processor Corrected Machine Check (SCB: 630)
Processor corrected machine checks are caused by B-cache errors that are
detected and corrected by the DECchip 21064 microprocessor. These errors
are nonfatal and result in an error log entry.
4.1.2 System Bus Transaction Cycle
In order to interpret error logs for system bus errors, you need a basic
understanding of the system bus transaction cycle and the function of the
commander, responder, and bystanders.
For any particular bus transaction cycle there is one commander (either CPU or
I/O) that initiates bus transactions and one responder (memory, CPU, or I/O) that
accepts or supplies data in response to a command/address from the system bus
commander. A bystander is a system bus node (CPU, I/O, or memory) that is not
addressed by a current system bus commander.
There are four system bus transaction types: read, write, exchange, and nut.
•Read and write transactions consist of a command/address cycle followed by
two data cycles.
•Exchange transactions are used to replace the cache block when a cache block
resource conflict occurs. They consist of a command/address cycle followed by
four data cycles: two writes and two reads.
•Nut transactions consist of a command/address cycle and two dummy data
cycles for which no data is transferred.
For more information, refer to the DEC 4000 Model 600 Series Technical Manual.
4.2 Error Logging and Event Log Entry Format
The OpenVMS AXP and DEC OSF/1 error handlers can generate several entry
types. All error entries, with the exception of correctable memory errors, are
logged immediately. Entries can be of variable length based on the number of
registers within the entry.
4–4 Error Log Analysis
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.