Digital Equipment Corporation
Maynard, Massachusetts
First Printing, March 1996
Second Printing, October 1996
Digital Equipment Corporation makes no representations that the use of its products in the
manner described in this publication will not infringe on existing or future patent rights, nor do
the descriptions contained in this publication imply the granting of licenses to make, use, or sell
equipment or software in accordance with the description.
Possession, use, or copying of the software described in this publication is authorized only pursuant
to a valid written license from Digital or an authorized sublicensor.
VET, Digital, OpenVMS, StorageWorks, VAX DOCUMENT, and the DIGITAL logo.
Digital UNIX Version 3.0 is an X/Open UNIX 93 branded product. Windows NT is a trademark of
Microsoft Corp.
All other trademarks and registered trademarks are the property of their respective holders.
FCC NOTICE: The equipment described in this manual generates, uses, and may emit radio
frequency energy. The equipment has been type tested and found to comply with the limits for
a Class B computing device pursuant to Subpart J of Part 15 of FCC Rules, which are designed
to provide reasonable protection against such radio frequency interference when operated in a
commercial environment. Operation of this equipment in a residential area may cause interference,
in which case the user at his own expense may be required to take measures to correct the
interference.
This document was prepared using VAX DOCUMENT Version 2.1.
6–2Power Cord Order Numbers .....................6–11
ix
Preface
This guide describes the procedures and tests used to service AlphaServer 1000A
systems. AlphaServer 1000A systems use a deskside ‘‘wide-tower’’ enclosure.
Intended Audience
This guide is intended for use by Digital Equipment Corporation service personnel
and qualified self-maintenance customers.
xi
Conventions
The following conventions are used in this guide:
ConventionMeaning
Return
Ctrl/xCtrl/x
WarningWarnings contain information to prevent personal injury.
CautionCautions provide information to prevent damage to equipment
NoteA note calls the reader’s attention to any information that may
boot
[]
show config
italic typeIn console command sections, italic type indicates a variable.
< >In console mode online help, angle brackets enclose a
{ }In command descriptions, braces containing items separated by
A key name enclosed in a box indicates that you press that key.
indicates that you hold down the Ctrl key while you
press another key, indicated here by x. In examples, this key
combination is enclosed in a box, for example,
or software.
be of special importance.
Console and operating system commands are shown in this
special typeface.
In command format descriptions, brackets indicate optional
elements.
Console command abbreviations must be entered exactly as
shown. Commands shown in lowercase can be entered in
either uppercase or lowercase.
placeholder for which you must specify a value.
commas imply mutually exclusive items.
Ctrl/C
.
Related Documentation
•AlphaServer 1000A Owner’s Guide, EK-ALPSV-OG
•AlphaServer 1000/1000A Model 5/xxx Owner’s Guide Supplement, EKAL530-OG
•DEC Verifier and Exerciser Tool User’s Guide, AA-PTTMD-TE
•Guide to Kernel Debugging, AA-PS2TD-TE
•OpenVMS Alpha System Dump Analyzer Utility Manual, AA-PV6UB-TE
•DECevent Translation and Reporting Utility for OpenVMS Alpha, User andReference Guide, AA-Q73KC-TE
xii
•DECevent Translation and Reporting Utility for Digital UNIX, User and
Reference Guide AA-QAA3A-TE
•DECevent Analysis and Notification Utility for OpenVMS Alpha, User and
Reference Guide, AA-Q73LC-TE
•DECevent Analysis and Notification Utility for Digital UNIX, User and
Reference Guide AA-QAA4A-TE
This chapter describes the troubleshooting strategy for AlphaServer 1000A
systems.
•Section 1.1 provides questions to consider before you begin troubleshooting an
AlphaServer 1000A system.
•Tables 1–1 through 1–5 provide a diagnostic flow for each category of system
problem.
•Section 1.2 lists the product tools and utilities.
•Section 1.3 lists available information services.
1.1 Troubleshooting the System
Before troubleshooting any system problem, check the site maintenance log for
the system’s service history. Be sure to ask the system manager the following
questions:
•Has the system been used before and did it work correctly?
•Have changes to hardware or updates to firmware or software been made
to the system recently? If so, are the revision numbers compatible for the
system? (Refer to the hardware and operating system release notes).
•What is the state of the system—is the operating system running?
If the operating system is down and you are not able to bring it up, use
the console environment diagnostic tools, such as the power-up display and
ROM-based diagnostics (RBDs).
If the operating system is running, use the operating system environment
diagnostic tools, such as the DECevent event management utility (to translate
and interpret error logs), crash dumps, and exercisers (DEC VET).
Troubleshooting Strategy 1–1
1.1.1 Problem Categories
System problems can be classified into the following five categories. Using these
categories, you can quickly determine a starting point for diagnosis and eliminate
the unlikely sources of the problem.
1. Power problems (Table 1–1)
2. No access to console mode (Table 1–2)
3. Console-reported failures (Table 1–3)
4. Boot failures (Table 1–4)
5. Operating system-reported failures (Table 1–5)
1–2 Troubleshooting Strategy
Table 1–1 Diagnostic Flow for Power Problems
SymptomAction
System does not power on.
•Check the power source and power cord.
•Check that the system’s top cover is properly
secured. A safety interlock switch shuts off power
to the system if the top cover is removed.
•If there are two power supplies, make sure both
power supplies are plugged in.
•Check the On/Off switch setting on the operator
control panel.
•Check that the ambient room temperature is
within environmental specifications (10–40°C,
50–104°F).
•Check that internal power supply cables are
plugged in at both the power supply and system
motherboard (Section 5.9).
Power supply shuts down after a
few seconds (fan failure).
Using a flashlight, look through the front (to the left
of the internal StorageWorks shelf) to determine if the
fans are spinning at power-up. A failure of either fan
causes the system to shut down after a few seconds.
Troubleshooting Strategy 1–3
Table 1–2 Diagnostic Flow for Problems Getting to Console Mode
SymptomAction
Power-up screen is not displayed.Interpret the error beep codes at power-up (Section 2.1)
for a failure detected during self-tests. In addition to
beep codes, model 5/xxx systems display error codes on
the OCP (Section 2.2).
Check that the keyboard and monitor are properly
connected and turned on.
If the power-up screen is not displayed, yet the system
enters console mode when you press
console
the
you are using a VGA monitor as the console terminal,
the console variable should be set to ‘‘graphics.’’ If
you are using a serial console terminal, the console
variable should be set to ‘‘serial.’’
If a VGA controller other than the standard on-board
VGA controller is being used, refer to Section 5.10 for
more information.
console
If
routed to the COM1 serial communication port
(Section 5.10) and cannot be viewed from the VGA
monitor.
Try connecting a console terminal to the COM1 serial
communication port (Section 5.10). If necessary use
an MMJ-to-9-pin adapter (H8571-J). Check the baud
rate setting for the console terminal and the system.
The system baud rate setting is 9600. When using the
COM1 port, you must set the
variable to ‘‘serial.’’
For certain situations, power up using the fail-safe
loader (Section 2.9) to load new console firmware from
a diskette.
environment variable is set correctly. If
is set to serial, the power-up screen is
Return
console
, check that
environment
1–4 Troubleshooting Strategy
Table 1–3 Diagnostic Flow for Problems Reported by the Console Program
SymptomAction
Power-up tests do not complete.Interpret the error beep codes at power-up (Section 2.1)
Console program reports error:
•Error beep codes report an
error at power-up.
•Power-up screen includes
error messages.
•Model 5/xxx display error
codes on the OCP display.
and check the power-up screen (Section 2.3) for
a failure detected during self-tests. In addition,
model 5/xxx systems display error codes on the OCP
(Section 2.2).
Use the error beep codes (Section 2.1) and/or console
terminal (Section 2.3) to determine the error. In
addition, model 5/xxx systems display error codes on
the OCP (Section 2.2).
Examine the console event log (enter the
command) (Section 2.3.1) or the power-up screen
(Section 2.3) to check for embedded error messages
recorded during power-up.
If the power-up screen or console event log indicates
problems with mass storage devices, or if storage
devices are missing from the
the troubleshooting tables (Section 2.5) to determine
the problem.
show config
more el
display, use
Note
The external SCSI terminator must be
installed on the SCSI port at the rear of
the enclosure. Without the termination,
some SCSI drives will not be available–
these drives will be missing from the
config
display.
show
If the power-up screen or console event log indicates
problems with EISA devices, or if EISA devices are
missing from the
troubleshooting table (Section 2.7) to determine the
problem.
If the power-up screen or console event log indicates
problems with PCI devices, or if PCI devices are
missing from the
troubleshooting table (Section 2.8) to determine the
problem.
show config
show config
(continued on next page)
Troubleshooting Strategy 1–5
display, use the
display, use the
Table 1–3 (Cont.) Diagnostic Flow for Problems Reported by the Console
Program
SymptomAction
Run the ROM-based diagnostic (RBD) tests
(Section 3.1) to verify the problem.
1–6 Troubleshooting Strategy
Table 1–4 Diagnostic Flow for Boot Problems
SymptomAction
System cannot find boot device.Check the system configuration for the correct device
parameters (node ID, device name, and so on).
•For Digital UNIX and OpenVMS, use the
show config
(Section 5.1).
•For Windows NT, use the Display Hardware
Configuration display and the Set Default
Environment Variables display (Section 5.1).
Check the system configuration for the correct
environment variable settings.
•For Digital UNIX and OpenVMS, examine the
auto_action, bootdef_dev, boot_osflags, and os_type
environment variables. Also, make sure that the
bus_probe_algorithm environment variable is set
to ‘‘new’’ (Section 5.1.4.4).
For problems booting over a network, check the
ew*0_protocols or er*0_protocols environment
variable settings: Systems booting from a Digital
UNIX server should be set to bootp; systems
booting from an OpenVMS server should be set to
mop (Section 5.1.4.4).
•For Windows NT, examine the FWSEARCHPATH,
AUTOLOAD, and COUNTDOWN environment
variables (Section 5.1.4.4).
and
show device
commands
Device does not boot.For problems booting over a network, check the ew*0_
protocols or er*0_protocols environment variable
settings: Systems booting from a Digital UNIX
server should be set to bootp; systems booting
from an OpenVMS server should be set to mop
(Section 5.1.4.4).
For systems running Digital UNIX and OpenVMS,
make sure that the bus_probe_algorithm environment
variable is set to ‘‘new’’ (Section 5.1.4.4).
Run the device tests (Section 3.1) to check that the
boot device is operating.
Troubleshooting Strategy 1–7
Table 1–5 Diagnostic Flow for Errors Reported by the Operating System
SymptomAction
System is hung or has crashed.Examine the crash dump file.
Refer to OpenVMS Alpha System Dump AnalyzerUtility Manual (AA-PV6UB-TE) for information on
how to interpret OpenVMS crash dump files.
Refer to the Guide to Kernel Debugging (AA–PS2TD–
TE) for information on using the Digital UNIX Krash
Utility.
Errors have been logged and the
operating system is up.
Examine the operating system error log files to isolate
the problem (Chapter 4).
If the problem occurs intermittently, run an operating
system exerciser, such as DEC VET, to stress the
system.
Refer to the DEC Verifier and Exerciser Tool User’sGuide (AA–PTTMD–TE) for instructions on running
DEC VET.
1.2 Service Tools and Utilities
This section lists the array of service tools and utilities available for acceptance
testing, diagnosis, and serviceability and provides recommendations for their use.
Error Handling/Logging Tools
Digital UNIX, OpenVMS, and Microsoft Windows NT operating systems
provide recovery from errors, fault handling, and event logging. The
DECevent Translation and Reporting Utility provides bit-to-text translation
of event logs for interpretation for Digital UNIX and Open VMS error logs.
RECOMMENDED USE: Analysis of error logs is the primary method of
diagnosis and fault isolation. If the system is up, or you are able to bring it
up, look at this information first.
ROM-Based Diagnostics (RBDs)
Many ROM-based diagnostics and exercisers are embedded in AlphaServer
1000A systems. ROM-based diagnostics execute automatically at power-up
and can be invoked in console mode using console commands.
1–8 Troubleshooting Strategy
RECOMMENDED USE: ROM-based diagnostics are the primary means of
testing the console environment and diagnosing the CPU, memory, Ethernet,
I/O buses, and SCSI and DSSI subsystems. Use ROM-based diagnostics in
the acceptance test procedures when you install a system, add a memory
module, or replace the following components: CPU module, memory module,
motherboard, I/O bus device, or storage device. Refer to Chapter 3 for
information on running ROM-based diagnostics.
Loopback Tests
Internal and external loopback tests are used to isolate a failure by testing
segments of a particular control or data path. The loopback tests are a subset
of the ROM-based diagnostics.
RECOMMENDED USE: Use loopback tests to isolate problems with the
COM2 serial port, the parallel port, and Ethernet controllers. Refer to
Chapter 3 for instructions on performing loopback tests.
Firmware Console Commands
Console commands are used to set and examine environment variables
and device parameters, as well as to invoke ROM-based diagnostics and
exercisers. For example, the
device
dev, auto_action, and boot_osflags) commands are used to set environment
variables; and the
commands are used to examine the configuration; the
cdp
command is used to configure DSSI parameters.
show memory,show configuration
, and
set
(bootdef_
show
RECOMMENDED USE: Use console commands to set and examine
environment variables and device parameters and to run RBDs. Refer to
Section 5.1 for information on configuration-related firmware commands and
Chapter 3 for information on running RBDs.
Operating System Exercisers (DEC VET)
The Digital Verifier and Exerciser Tool (DEC VET) is supported by the Digital
UNIX, OpenVMS, and Windows NT operating systems. DEC VET performs
exerciser-oriented maintenance testing of both hardware and operating
system.
RECOMMENDED USE: Use DEC VET as part of acceptance testing to
ensure that the CPU, memory, disk, tape, file system, and network are
interacting properly. Also use DEC VET to stress test the user’s environment
and configuration by simulating system operation under heavy loads to
diagnose intermittent system failures.
Troubleshooting Strategy 1–9
Crash Dumps
For fatal errors, such as fatal bugchecks, Digital UNIX and OpenVMS
operating systems will save the contents of memory to a crash dump file.
RECOMMENDED USE: Crash dump files can be used to determine why the
system crashed. To save a crash dump file for analysis, you need to know
the proper system settings. Refer to the OpenVMS Alpha System DumpAnalyzer Utility Manual (AA-PV6UB-TE) or the Guide to Kernel Debugging
(AA–PS2TD–TE) for Digital UNIX.
1.3 Information Services
Several information resources are available, including online information
for servicers and customers, computer-based training, and maintenance
documentation database services. A brief description of some of these resources
follows.
Fast Track Service Help File
The information contained in this guide, including the field-replaceable unit
(FRU) procedures and illustrations, is available in online format. You can
download the hypertext file (A1000A-S.HLP) or a self-extracting .HLP file
from TIMA, or order the diskette (AK-QQRMB-CA) or the AlphaServer 1000A
Maintenance Kit (QZ-OOUAB-GC). The maintenance kit includes hardcopy,
diskette, and illustrated parts breakdown.
Alpha Firmware Updates
Under certain circumstances, such as a CPU upgrade or replacement of the
system backplane, you need to update your system firmware. An Alpha
Firmware CD–ROM is shipped on an ‘‘as released’’ basis with Digital UNIX,
OpenVMS, and Windows NT operating systems. The Alpha firmware files can
also be downloaded from the Internet as follows:
http://ftp.digital.com/pub/DEC/Alpha/firmware/
New versions of firmware released between shipments of the Alpha Firmware
CD–ROM are available in an interim directory:
ftp://ftp.digital.com/pub/Digital/Alpha/firmware/interim/
1–10 Troubleshooting Strategy
ECU Revisions
The EISA Configuration Utility (ECU) is used for configuring EISA options on
AlphaServer systems. Systems are shipped with an ECU kit, which includes
the ECU license. Customers who already have the ECU and license, but need
the latest revision of the ECU, can order a separate kit. Call 1-800-DIGITAL
to order.
If the customer plans to migrate from Digital UNIX or OpenVMS to Windows
NT, you must re-run the appropriate ECU. Failure to run the operatingspecific ECU will result in system failure.
OpenVMS Patches
Software patches for the OpenVMS operating system are available from the
World Wide Web as follows:
http://www.service.digital.com/html/patch_service.html
Choose the ‘‘Contract Access’’ option if you have a valid software contract
with Digital or you wish to become a software contract customer. Choose the
‘‘Public Access’’ options if you do not have a sofware service contract.
Late-Breaking Technical Information
You can download up-to-date files and late-breaking technical information
from the Internet for managing AlphaServer 1000A systems.
•FTP address:
ftp.digital.com
cd /pub/DEC/Alpha/systems/as1000/docs
The information includes firmware updates, the latest configuration utilities,
software patches, lists of supported options, Wide SCSI information and more.
Supported Options
Refer to the AlphaServer 1000A Supported Options List for a list of options
supported under Digital UNIX, OpenVMS, and Windows NT. The options list
is available from the Internet as follows:
•FTP address:
ftp://ftp.digital.com/pub/Digital/Alpha/systems/
•World Wide Web address:
http://www.service.digital.com/alpha/server/
Troubleshooting Strategy 1–11
You can obtain information about hardware configurations for the
AlphaServer 1000A from the Digital Systems and Options Catalog. The
catalog is regularly published to assist in ordering and configuring systems
and hardware options. Each printing of the catalog presents all of the
products that are announced, actively marketed, and available for ordering.
Access printable postscript files of any section of the catalog from the Internet
as follows (Be sure to check the Readme file):
•
ftp://ftp.digital.com/pub/Digital/info/SOC/
Training
The following Computer Based Training (CBT) and lecture lab courses are
available from the Digital training center:
•Alpha Concepts
•DSSI Concepts: EY-9823E
•ISA and EISA Bus Concepts: EY-I113E-P0
•RAID Concepts: EY-N935E
•SCSI Concepts and Troubleshooting: EY-P841E, EY-N838E
Digital Assisted Services
Digital Assisted Services (DAS) offers products, services, and programs to
customers who participate in the maintenance of Digital computer equipment.
Components of Digital assisted services include:
•Spare parts and kits
•Diagnostics and service information/documentation
•Tools and test equipment
•Parts repair services, including Field Change Orders
1–12 Troubleshooting Strategy
2
Power-Up Diagnostics and Display
This chapter provides information on how to interpret error beep codes and
the power-up display on the console screen. In addition, a description of the
power-up and firmware power-up diagnostics is provided as a resource to aid in
troubleshooting.
•Section 2.1 describes how to interpret error beep codes at power-up.
•Section 2.4 describes SROM memory tests that can be run at power-up to
isolate failing SIMM memory.
•Section 2.3 describes how to interpret the power-up screen display.
•Section 2.5 describes how to troubleshoot mass-storage problems indicated at
power-up or storage devices missing from the
•Section 2.6 shows the location of storage device LEDs.
•Section 2.7 describes how to troubleshoot EISA bus problems indicated at
power-up or EISA devices missing from the
•Section 2.8 describes how to troubleshoot PCI bus problems indicated at
power-up or PCI devices missing from the
show config
show config
show config
display.
display.
display.
•Section 2.9 describes the use of the Fail-Safe Loader.
•Section 2.10 describes the power-up sequence.
•Section 2.11 describes power-on self-tests.
Power-Up Diagnostics and Display 2–1
2.1 Interpreting Error Beep Codes
If errors are detected at power-up, audible beep codes are emitted from the
system. For example, if the SROM code could not find any good memory, you
would hear a 1-3-3 beep code (one beep, a pause, a burst of three beeps, a pause,
and another burst of three beeps).
Be sure to check that the CPU daughter board is properly seated in its connector
if errors are reported.
Note
A single beep is emitted for model 5/xxx systems when the SROM code
has successfully completed. The console firmware then continues with its
power-up tests.
The beep codes are the primary diagnostic tool for troubleshooting problems
when console mode cannot be accessed. Refer to Table 2–1 for information on
interpreting error beep codes.
Table 2–1 Interpreting Error Beep Codes
Beep
CodeProblemCorrective Action
1-1-2ROM data path error detected while
loading ARC/SRM console code.
2–2 Power-Up Diagnostics and Display
1. Use the Fail-Safe Loader to
load new ARC/SRM console code
(Section 2.9).
2. If successfully loading new
console firmware does not
solve the problem, replace the
motherboard (Chapter 6).
(continued on next page)
Table 2–1 (Cont.) Interpreting Error Beep Codes
Beep
CodeProblemCorrective Action
1-1-4The SROM code is unable to load the
console code: Flash ROM header area or
checksum error detected.
1. Use the Fail-Safe Loader to
load new ARC/SRM console code
(Section 2.9).
2. If successfully loading new
console firmware does not
solve the problem, replace the
motherboard (Chapter 6).
1-2-1TOY NVRAM failure.Replace the TOY NVRAM chip (E78)
1-2-4Backup cache error.Replace the CPU daughter board
on system motherboard (Chapter 6).
(Chapter 6).
Model 5/xxx systems can be operated
with the Bcache disabled until a
replacement CPU daughter board
is available. Bank 4 of the J1 or
J4 jumper on the CPU daughter
board is used to disable the Bcache
(Figures 2–2 and 2–2).
(continued on next page)
Power-Up Diagnostics and Display 2–3
Table 2–1 (Cont.) Interpreting Error Beep Codes
Beep
CodeProblemCorrective Action
1-3-3No usable memory detected.
1. Verify that the memory modules
are properly seated and try
powering up again.
2. Swap bank 0 memory with
known good memory and run
SROM memory tests at powerup (Section 2.4).
3. If populating bank 0 with known
good memory does not solve
the problem, replace the CPU
daughter board (Chapter 6).
4. If replacing the CPU daughter
board does not solve the problem, replace the motherboard
(Chapter 6).
3-1-2J1 jumper on CPU daughter board set
incorrectly or failure of native SCSI
controller (NCR810).
2–4 Power-Up Diagnostics and Display
1. Check that the J1 jumper on the
CPU daughter board is set at
bank 1 for AlphaServer 1000A
systems, as opposed to bank 0,
reserved for AlphaServer 1000
systems (Figure 2–5).
Note that model 5/xxx systems
can use either standard boot
setting, bank 0 or 1, regardless
of system, and that model 5/300
systems use jumper designator
J4, rather than J1.
2. If the J1 jumper setting is
not the problem, replace the
motherboard (Chapter 6).
(continued on next page)
Table 2–1 (Cont.) Interpreting Error Beep Codes
Beep
CodeProblemCorrective Action
3-3-1Generic system failure. Possible problem
sources include the TOY NVRAM chip
(Dallas DS1287A) or PCI-to-EISA bridge
chipset (Intel 82375EB).
3-3-2J1 jumper on CPU daughter board set
incorrectly or failure of the PCI-to-PCI
bridge (DECchip 21050).
1. Replace the TOY NVRAM chip
(E78) on system motherboard
(Chapter 6.)
2. If replacing the TOY NVRAM
chip did not solve the problem,
replace the motherboard
(Chapter 6).
1. Check that the J1 jumper on the
CPU daughter board is set at
bank 1 for AlphaServer 1000A
systems, as opposed to bank 0,
reserved for AlphaServer 1000
systems (Figure 2–5).
Note that model 5/xxx systems
can use either standard boot
setting, bank 0 or 1, regardless
of system, and that model 5/300
systems use jumper designator
J4, rather than J1.
2. If the J1 jumper setting is
not the problem, replace the
motherboard (Chapter 6).
(continued on next page)
Power-Up Diagnostics and Display 2–5
Table 2–1 (Cont.) Interpreting Error Beep Codes
Beep
CodeProblemCorrective Action
3-3-3J1 jumper on the CPU daughter board
set incorrectly or failure of the native
SCSI controller (NCR810) on the system
motherboard.
1. Check that the J1 jumper on the
CPU daughter board is set at
bank 1 for AlphaServer 1000A
systems, as opposed to bank 0,
reserved for AlphaServer 1000
systems (Figure 2–5).
Note that model 5/xxx systems
can use either standard boot
setting, bank 0 or 1, regardless
of system, and that model 5/300
systems use jumper designator
J4, rather than J1.
2. If the J1 jumper setting is
not the problem, replace the
motherboard (Chapter 6).
2–6 Power-Up Diagnostics and Display
Figure 2–1 Model 4/xxx Systems: Jumper J1 on the CPU Daughter Board
J1
MA00926
BankJumper Setting
0Standard boot setting (AlphaServer 1000 systems)
1Standard boot setting (AlphaServer 1000A systems)
2Mini-console setting: Internal use only
3SROM CacheTest: backup cache test
4SROM BCacheTest: backup cache and memory test
5SROM memTest: memory test with backup and data cache disabled
6SROM memTestCacheOn: memory test with backup and data cache enabled
7Fail-Safe Loader setting: selects fail-safe loader firmware
0
1
2
3
4
5
6
7
Power-Up Diagnostics and Display 2–7
Figure 2–2 Model 5/xxx Systems: Jumper J4 on the CPU Daughter Board
J4
0 1 2 3 4 5 6 7
MLO-013462
BankJumper Setting
0Standard boot setting (AlphaServer 1000/1000A systems)
1Standard boot setting (AlphaServer 1000/1000A systems)
2Mini-console setting: Internal use only
3Mini-console setting: Internal use only
4Power up with no Bcache: Power up with Bcache disabled allows the system to run
5Mini-console setting: Internal use only
6Mini-console setting: Internal use only
7Fail-Safe Loader setting: selects fail-safe loader firmware
despite bad Bcache until a replacement daughter board is available
2–8 Power-Up Diagnostics and Display
Figure 2–3 Model 5/xxx Systems: Jumper J1 on the CPU Daughter Board
J1
0 1 2 3 4 5 6 7
MLO-013469
BankJumper Setting
0Standard boot setting (AlphaServer 1000/1000A systems)
1Standard boot setting (AlphaServer 1000/1000A systems)
2Mini-console setting: Internal use only
3Mini-console setting: Internal use only
4Power up with no Bcache: Power up with Bcache disabled allows the system to run
5Mini-console setting: Internal use only
6Mini-console setting: Internal use only
7Fail-Safe Loader setting: selects fail-safe loader firmware
despite bad Bcache until a replacement daughter board is available
Power-Up Diagnostics and Display 2–9
2.2 Model 5/xxx SROM Error Codes
Model 5/xxx systems report errors and status to the OCP display during SROM
power-up tests. Table 2–2 provides an explanation of the status and error codes
that may be displayed:
•Fatal error codes identify errors that prevent the system from accessing the
cosole and booting the operating system.
•Nonfatal error codes identify errors that may not provent the system from
accessing the console, but may prevent the system from successfully booting
the operating system.
•Execution status codes identify the process tht is currently underway.
Note
If errors are reported, be sure that the CPU daughter board is properly
seated in its connectors.
Table 2–2 Model 5/xxx SROM Test/Status Codes
OCP CodeDescriptionLikely FRU
Fatal Error Codes
FFNo s-cache bits set in sc_ctl registerCPU daughter board
FDFloppy load errorBad or wrong diskette in drive
FANo usable memory detectedSIMM memory or backplane
F9System init failureCPU daughter board
F8PCI data path errorCPU daughter board
F7CIA/PCEB I/O reister init failureCPU daughter board
F6Bad CIA memory csr was detectedCPU daughter board
F4Bcache data path errorCPU daughter board
F3Bcache address line errorCPU daughter board
F1Flash ROM data path read errorCPU daughter board
2–10 Power-Up Diagnostics and Display
(continued on next page)
Table 2–2 (Cont.) Model 5/xxx SROM Test/Status Codes
EBCPU speed error detectedCPU daughter board
EAPCI-to-PCI (PPB) data path errorMotherboard
E9No real-time clock (TOY)TOY/NVRAM chip
E6EISA configuraton NVRAM
E5Main memory data path error
E4Q-logic SCSI data path error
E3Main memory address lines error
E2Super I/O error
E1Main memory cell test error
E0Flash ROM checksum error
(continued on next page)
Power-Up Diagnostics and Display 2–11
Table 2–2 (Cont.) Model 5/xxx SROM Test/Status Codes
OCP CodeDescriptionLikely FRU
Execution Status Codes
DFSROM program beginning to initialize
DEInitialize CPU and system interface
DDSizing CPU speed
DCSizing S-cache
DBInitializing and testing the PCI bus
DASizing B-cache
D9Sizing memory
D8Configuring memory
D7Initializing Bchache
D6Testing memory
D5Testing Bcache bits
D4Testing memory bits
D3Testing Bcache address
D2Testing memory address
D1Testing Bcache cells
D0Testing memory cells
CFInitializing memory
CDLoading Flash ROM code
CCRe-initializing CPU and system interface
CBSROM execution completedThe system could hang here if
the EV5 CPU
EV4 console code is used with
5/xxx (EV5) systems.
2–12 Power-Up Diagnostics and Display
2.3 Power-Up Screen
During power-up self-tests, the test status and result are displayed on the console
terminal. Information similar to the following example should be displayed on
the screen.
ff.fe.fd.fc.fb.fa.f9.f8.f7.f6.f5.ef.df.ee.f4.
probing hose 0, PCI
probing PCI-to-PCI bridge, bus 1
bus 1, slot0 -- pka -- QLogic ISP1020
bus 0, slot 11 -- ewa -- DECchip 21040-AA
probing hose 1, EISA
ECU error, slot 0, found DEC5000, expected nothing
EISA Configuration Error
Run the EISA Configuration Utility
ed.ec.eb.....ea.e9.e8.e7.e6.e5.e4.e3.e2.e1.e0.
X4.6-8189, built on Jul 29 1996 at 03:21:03
Memory Testing and Configuration Status
32 Meg of System Memory
Bank 0 = 32 Mbytes(8 MB Per SIMM) Starting at 0x00000000
Bank 1 = No Memory Detected
Bank 2 = No Memory Detected
Bank 3 = No Memory Detected
Testing the System
Change mode to Internal loopback.
Change to Normal Operating Mode.
>>>
Table 2–3 provides a description of the power-up countdown for output to the
serial console port. If the power-up display stops, use the beep codes (Table 2–1)
and Table 2–3 to isolate the likely field-replaceable unit (FRU).
Power-Up Diagnostics and Display 2–13
Table 2–3 Console Power-Up Countdown Description and Field Replaceable
Digital UNIX and OpenVMS operating systems are supported by the SRM
firmware (see Section 5.1.1). The SRM console prompt follows:
>>>
2–14 Power-Up Diagnostics and Display
Windows NT for Model 4/xxx Systems
The Windows NT operating system is supported by the ARC firmware for model
4/xxx systems. (see Section 5.1.1). Model 4/xxx systems using Windows NT power
up to the ARC boot menu as follows:
Alpha Firmware Version
Copyright (c) 1993-1995 Microsoft Corporation
Copyright (c) 1993-1995 Digital Equipment Corporation
Boot menu:
Boot Windows NT
Boot an alternate operating system...
Run a program...
Supplementary menu...
Use the arrow keys to select, then press Enter.
Windows NT for Model 5/xxx Systems
The Windows NT operating system is supported by the AlphaBIOS firmware for
model 5/xxx systems. (see Section 5.1.1). Model 5/xxx systems using Windows NT
power up to the AlpahBIOS boot menu as follows:
AlphaBIOS Version 5.11
Figure2–4AlphaBIOSBootMenu
Please select the operating system to start:
Windows NT Workstation 3.51
n.nn
Use and to move the highlight to your choice.
Press Enter to choose.
Alpha
Press <F2> to enter SETUP
PK-0728-96
Refer to the AlphaServer 1000/1000A Model 5/xxx Owner’s Guide Supplement for
information on the AlphaBIOS firmware.
Power-Up Diagnostics and Display 2–15
2.3.1 Console Event Log
AlphaServer 1000A systems maintain a console event log consisting of status
messages received during power-on self-tests. If problems occur during power-up,
standard error messages indicated by asterisks (***) may be embedded in the
console event log. To display a console event log, use the
command.
Note
more elorcat el
To stop the screen display from scrolling, press
Ctrl/Q
press
You can also use the command,
.
more el
, to display the console event log
Ctrl/S
. To resume scrolling,
one screen at a time.
The following example shows a console event log that contains two standard error
messages. The first indicates that the mouse is not plugged in or is not working,
and the second indicates that SROM tests detected a bad SIMM (bank1, SIMM3).
>>> cat el
ff.fe.fd.fc.fb.fa.f9.f8.f7.f6.f5.ef.df.ee.f4.
probing hose 0, PCI
probing PCI-to-EISA bridge, bus 1
probing PCI-to-PCI bridge, bus 2
bus 2, slot0 -- pka -- QLogic ISP1020
bus 0, slot 11 -- ewa -- DECchip 21040-AA
ed.ec.
** mouse error **
*** Bad memory detected by serial rom
*** SROM failing Bank 1, SIMM 3
eb.....ea.e9.e8.e7.e6.e5.e4.e3.e2.e1.e0.
X4.6-10166, built on Aug 30 1996 at 16:18:06
.
.
.
>>>
2.4 Model 4/xxx SROM Memory Power-Up Tests
If the power-up tests or ROM-based diagnostics indicate a memory error without
identifying the failing bank and SIMM position, you can match the failing address
to a table using the procedure in Chapter 3, or for model 4/xxx systems, you can
run specific SROM power-up tests using jumper J1 (Figure 2–5) on the CPU
daughter board. The progress and results of these tests are reported on the LCD
display on the operator control panel (OCP).
2–16 Power-Up Diagnostics and Display
To thoroughly test memory and data paths, complete the SROM tests in the order
presented in Table 2–4. If a SIMM is reported bad, replace the SIMM (Chapter 6)
and resume testing at bank 4 (Memory Test).
Table 2–4 SROM Memory Tests, CPU Jumper J1
Bank
#Test DescriptionTest Results
3Cache Test: Tests
backup cache.
5Memory Test:
Tests memory with
backup and data
cache disabled.
Test status displays on OCP:
....done.
If the test takes longer than a few seconds to complete,
there is a problem with the backup cache—replace the
CPU daughter board (Chapter 6).
Test status displays on OCP:
12345.done.
If an error is detected, the bank number and failing
SIMM position are displayed. The following OCP message
indicates a failing SIMM at bank 0, SIMM position 2.
FAIL B:0 S:2
Test duration: Approximately 10 seconds per 8 megabytes
of memory.
Figure 2–6 shows the bank and SIMM layout for
AlphaServer 1000A systems. After determining the bad
SIMM, refer to Chapter 6 for instructions on replacing
FRUs.
Note: The memory tests do not test the ECC SIMMs. If
the operating system logs five or more single-bit correctible
errors, replace the suspected ECC SIMMs with good
SIMMs and repeat the memory test.
ECC SIMMs cannot be used in the standard memory
banks (banks 0–3). ECC SIMMs are specialized for use
only in ECC banks.
(continued on next page)
Power-Up Diagnostics and Display 2–17
Table 2–4 (Cont.) SROM Memory Tests, CPU Jumper J1
Bank
#Test DescriptionTest Results
6Memory Test,
Cache Enabled:
Tests memory with
backup and data
cache enabled.
Test status displays on OCP:
12345.done.
If an error is detected, the bank number and failing
SIMM position are displayed. The following OCP message
indicates a failing SIMM at bank 0, SIMM position 2.
FAIL B:0 S:2
Test duration: Approximately 2 seconds per 8 megabytes
of memory.
Figure 2–6 shows the bank and SIMM layout for
AlphaServer 1000A systems. After determining the bad
SIMM, refer to Chapter 6 for instructions on replacing
FRUs.
Note: The memory tests do not test the ECC SIMMs. If
the operating system logs five or more single-bit correctible
errors, replace the suspected ECC SIMMs with good
SIMMs and repeat the memory test.
ECC SIMMs cannot be used in the standard memory
banks (banks 0–3). ECC SIMMs are specialized for use
only in ECC banks.
(continued on next page)
2–18 Power-Up Diagnostics and Display
Table 2–4 (Cont.) SROM Memory Tests, CPU Jumper J1
Bank
#Test DescriptionTest Results
4Backup Cache Test:
Tests backup cache
alternatively with
data cache enabled
then disabled.
Test status displays on OCP:
d 12345.done.
D 12345.done.
D 12345.done.
d 12345.done.
If an error is detected, the bank number and failing
SIMM position are displayed. The following OCP message
indicates a failing SIMM at bank 0, SIMM position 2.
FAIL B:0 S:2
Test duration: Approximately 2 seconds per 8 megabytes
of memory.
Figure 2–6 shows the bank and SIMM layout for
AlphaServer 1000A systems. After determining the bad
SIMM, refer to Chapter 6 for instructions on replacing
FRUs.
Note: The memory tests do not test the ECC SIMMs. If
the operating system logs five or more single-bit correctible
errors, replace the suspected ECC SIMMs with good
SIMMs and repeat the memory test.
ECC SIMMs cannot be used in the standard memory
banks (banks 0–3). ECC SIMMs are specialized for use
only in ECC banks.
Power-Up Diagnostics and Display 2–19
Figure 2–5 Model 4/xxx: Jumper J1 on the CPU Daughter Board
J1
MA00926
BankJumper Setting
0Standard boot setting (AlphaServer 1000 systems)
1Standard boot setting (AlphaServer 1000A systems)
2Mini-console setting: Internal use only
3SROM CacheTest: backup cache test
4SROM BCacheTest: backup cache and memory test
5SROM memTest: memory test with backup and data cache disabled
6SROM memTestCacheOn: memory test with backup and data cache enabled
7Fail-Safe Loader setting: selects fail-safe loader firmware
0
1
2
3
4
5
6
7
2–20 Power-Up Diagnostics and Display
Figure 2–6 Model 4/xxx: AlphaServer 1000A Memory Layout
Bank 3
Bank 2
Bank 1
Bank 0
ECC Banks
SIMM 1
SIMM 0
SIMM 1
SIMM 0
SIMM 1
SIMM 0
SIMM 1
SIMM 0
ECC SIMM for Bank 2
ECC SIMM for Bank 0
SIMM 3
SIMM 2
SIMM 3
SIMM 2
SIMM 3
SIMM 2
SIMM 3
SIMM 2
ECC SIMM for Bank 3
ECC SIMM for Bank 1
MA00327
2.5 Mass Storage Problems Indicated at Power-Up
Mass storage failures at power-up are usually indicated by read fail messages.
Other problems are indicated by storage devices missing from the
display.
•Table 2–5 provides information for troubleshooting mass storage problems
indicated at power-up or storage devices missing from the
display.
•Table 2–6 provides troubleshooting tips for AlphaServer systems that use the
RAID Array 200 Subsystem.
show config
show config
•Section 2.6 provides information on storage device LEDs.
Use Tables 2–5 and 2–6 to diagnose the likely cause of the problem.
Power-Up Diagnostics and Display 2–21
Table 2–5 Mass Storage Problems
ProblemSymptomCorrective Action
Drive failureFault LED for drive is on
Duplicate SCSI IDsDrives with duplicate SCSI
SCSI ID set to 7
(reserved for host ID)
Duplicate host IDs on
a shared bus
Missing or loose
cables. Drives not
properly seated on
StorageWorks shelf
(steady) (Section 2.6).
IDs are missing from the
show config
Valid drives are missing
from the
display.
One drive may appear
seven times on the
config
Valid drives are missing
from the
display.
One drive may appear
seven times on the
config
Activity LEDs do not come
on. Drive missing from the
show config
display.
show config
show
display.
show config
show
display.
display.
Replace drive.
Correct SCSI IDs. May
need to reconfigure internal
StorageWorks backplane
(Section 5.8).
Correct SCSI IDs.
Change host ID through the
pk*0_host_id environment
variable (
for systems running OpenVMS
or Digital UNIX (SRM console).
For systems running Windows
NT (ARC console), choose ‘‘Set
default configuration’’ in the
Setup Menu.
Remove device and inspect cable
connections. Reseat drive on
StorageWorks shelf.
set pk*0_host_id
(continued on next page)
)
2–22 Power-Up Diagnostics and Display
Table 2–5 (Cont.) Mass Storage Problems
ProblemSymptomCorrective Action
SCSI bus length
exceeded
Terminator missing or
wrong terminator used
Extra terminatorDevices produce errors or
SCSI storage controller
failure
Drives may disappear
intermittently from the
show config
device
Read/write errors in the
console event log; storage
adapter port may fail.
If the bulkhead terminator
for the removable-media
bus is missing, removable
media devices may not be
recognized by the system
and may be missing from
show config
the
device
device IDs are dropped.
Problems persist after
eliminating the problem
sources.
and
displays.
displays.
show
and
show
A SCSI bus extended to the
internal StorageWorks shelf with
the backplane configured as a
single bus, cannot be extended
outside of the enclosure.
A SCSI bus extended to the
internal StorageWorks shelf with
the backplane configured as a
dual bus, can be extended 1
meter outside of the enclosure.
The entire SCSI bus length, from
terminator to terminator, must
not exceed 6 meters for singleended SCSI-2 at 5 MB/sec, or 3
meters for single-ended SCSI-2 at
10 MB/sec.
Attach appropriate terminators
as needed (external SCSI
terminator for use with the RAID
Array 200 Subsystem, 12-4166704 (68-pin), 17-04166-02 (50-pin);
external SCSI terminator for
removable-bus, 12-41667-05).
Note: The SCSI terminator
jumper (J51) on the system
motherboard should be set to
‘‘on’’ to enable the onboard SCSI
termination.
Check that bus is terminated only
at beginning and end. Remove
unnecessary terminators.
Note: The SCSI terminator
jumper (J51) on the system
motherboard should be set to
‘‘on’’ to enable the onboard SCSI
termination.
Replace failing EISA or PCI
storage adapter module (or
motherboard for the native SCSI
controller).
Table 2–6 provides troubleshooting hints for AlphaServer 1000A systems that
have the StorageWorks RAID Array 200 Subsystem. The RAID subsystem
Power-Up Diagnostics and Display 2–23
includes either the KZESC-xx (SWXCR-Ex) or the KZPSC-xx (SWXCR-Px) PCI
backplane RAID controller.
Table 2–6 Troubleshooting RAID Problems
SymptomAction
Some RAID drives do not appear
show device d
on the
Drives on the SWXCR controller
power up with the amber Fault
light on.
Cannot access disks connected to
the RAID subsystem on Windows
NT systems.
display.
Valid configured RAID logical drives will appear
as DRA0–DRAn, not as DKn. Configure the drives
by running the RAID Configuration Utility (RCU),
following the instructions in the StorageWorks RAID
SWRA2-IG.
Reminder: several physical disks can be grouped as a
single logical DRAn device.
External SCSI terminators used with the SWXCR
controller must be of the following type: 12-41667-04
(68-pin); 17-41667-02 (50-pin).
Whenever you move drives onto or off of the controller,
run the RAID Configuration Utility to set up the
drives and logical units. Follow the instructions in the
External SCSI terminators used with the SWXCR
controller must be of the following type: 12-41667-04
(68-pin); 17-41667-02 (50-pin).
On Windows NT systems, disks connected to the
controller must be spun up before they can be
accessed. While running the ECU, verify that the
controller is set to spin up two disks every six seconds.
This is the default setting if you are using the default
configuration files for the controller. If the settings are
different, adjust them as needed.
2.6 Storage Device LEDs
Storage device LEDs indicate the status of the device.
•Figure 2–7 shows the LEDs for disk drives contained in a StorageWorks shelf.
A failure is indicated by the Fault light on each drive.
•Figure 2–8 shows the Activity LED for the floppy drive. This LED is on when
the drive is in use.
2–24 Power-Up Diagnostics and Display
•Figure 2–9 shows the Activity LED for the CD–ROM drive. This LED is on
when the drive is in use.
For information on other storage devices, refer to the documentation provided by
the manufacturer or vendor.
Figure 2–7 StorageWorks Disk Drive LEDs (SCSI)
Activity
Fault
MA00927
Figure 2–8 Floppy Drive Activity LED
Activity LED
MA00330
Power-Up Diagnostics and Display 2–25
Figure 2–9 CD–ROM Drive Activity LED
Activity LED
MA00333
2–26 Power-Up Diagnostics and Display
2.7 EISA Bus Problems Indicated at Power-Up
EISA bus failures at power-up are usually indicated by the following messages
displayed during power-up:
EISA Configuration Error. Run the EISA Configuration Utility.
Run the EISA Configuration Utility (ECU) (Section 5.4) when this message is
displayed. Other EISA bus problems are indicated by the absence of EISA devices
from the
Table 2–7 provides steps for troubleshooting EISA bus problems that persist after
you run the ECU.
Table 2–7 EISA Troubleshooting
StepAction
1Confirm that the EISA module and any cabling are properly seated.
2Run the ECU to:
show config
•Confirm that the system has been configured with the most recently installed
controller.
•See what the hardware jumper and switch setting should be for each ISA
controller.
•See what the software setting should be for each ISA and EISA controller.
•See if the ECU deactivated (<>) any controllers to prevent conflict.
display.
•See if any controllers are locked (!), which limits the ECU’s ability to change
resource assignments.
3Confirm that the hardware jumpers and switches on ISA controllers reflect the
4Run ROM-based diagnostics for the type of option:
5Check for a bad slot by moving the last installed controller to a different slot.
6Call the option manufacturer or support for help.
settings indicated by the ECU. Start with the last ISA module installed.
The following tips can aid in isolating EISA bus problems.
•Peripheral device controllers need to be seated (inserted) carefully, but firmly,
into their slots to make all necessary contacts. Improper seating is a common
source of problems for EISA modules.
•Be sure you run the correct version of the ECU for the operating system.
For windows NT, use ECU diskette DECpc AXP (AK-PYCJ*-CA); for Digital
UNIX and OpenVMS, use ECU diskette DECpc AXP (AK-Q2CR*-CA).
•The CFG files supplied with the option you want to install may not work on
AlphaServer 1000A systems. Some CFG files call overlay files that are not
required on this system or may reference inappropriate system resources, for
example, BIOS addresses. Contact the option vendor to obtain the proper
CFG file.
•Peripherals cannot share direct memory access (DMA) channels. Assignment
of more than one peripheral to the same DMA channel can cause
unpredictable results or even loss of function of the EISA module.
•Not all EISA products work together. EISA is an open standard, and not
every EISA product or combination of products can be tested. Violations of
specifications may matter in some configurations, but not in others.
Manufacturers of EISA options often test the most common combinations and
may have a list of ISA and EISA options that do not function in combination
with particular systems. Be sure to check the documentation or contact the
option vendor for the most up-to-date information.
•EISA systems will not function unless they are first configured using the
ECU.
•The ECU will not notify you if the configuration program diskette is writeprotected when it attempts to write the system configuration file (
to the diskette.
2–28 Power-Up Diagnostics and Display
system.sci
)
2.8 PCI Bus Problems Indicated at Power-Up
PCI bus failures at power-up are usually indicated by the inability of the system
to see the device. Table 2–8 provides steps for troubleshooting PCI bus problems.
Use the table to diagnose the likely cause of the problem.
Note
Some PCI devices do not implement PCI parity, and some have a paritygenerating scheme in which parity is sometimes incorrect or is not
compliant with the PCI Specification. In such cases, the device functions
properly as long as parity is not checked. The pci_parity environment
variable for the SRM console, or the ENABLEPCIPARITY CHECKING
environment variable for the ARC console, allow you to turn off parity
checking so that false PCI parity errors do not result in machine check
errors.
When you disable PCI parity, no parity checking is implemented for any
PCI device, even those devices that produce correct, compliant parity.
Table 2–8 PCI Troubleshooting
StepAction
1Confirm that the PCI module and any cabling are properly seated.
2Run ROM-based diagnostics for the type of option:
3Check for a bad slot by moving the last installed controller to a different slot.
4Call the option manufacturer or support for help.
test
to exercise the storage devices off the PCI
netewornetwork
to exercise an Ethernet adapter
2.8.1 Additional PCI Troubleshooting Tips
Some PCI options are restricted to the primary PCI bus, slots 11, 12, and 13.
Refer to the following documents for restrictions on specific PCI options:
•AlphaServer 1000A READ THIS FIRST—shipped with the system.
•AlphaServer 1000A Supported Options List—The options list is available from
the Internet at the following locations:
The fail-safe loader (FSL) is a redundant or backup ROM that allows you to
power up without running power-up diagnostics and load new SRM/ARC or
SRM/AlphaBIOS and FSL console firmware from the firmware diskette.
Note
The fail-safe loader should be used only when a failure at power-up
prohibits you from getting to the console program. You cannot boot an
operating system from the fail-safe loader.
If a checksum error is detected when the SRM/ARC or SRM/AlphaBIOS
console is loading at power-up (error beep code 1-1-4), you need to activate
the fail-safe loader and reinstall the firmware.
The fail-safe loader (FSL) allows you to attempt to recover when one of the
following is the cause of a problem getting to the console program under normal
power-up:
•A hardware or power failure, or accidental power down during a firmware
upgrade occurred.
•A configuration error, such as an incorrect environment variable setting or an
inappropriate nvram script.
•A driver error at power-up.
•A checksum error is detected when the SRM console is loading at power-up
(corrupted firmware).
The fail-safe loader program is also available on diskette.
2.9.1 Fail-Safe Loader Functions
From the FSL program, you can update or load new SRM/ARC or SRM
/AlphaBIOS console firmware and FSL console firmware.
Note
When installing new console firmware, the flash ROM VPP enable jumper
(J50) on the motherboard must be enabled.
2–30 Power-Up Diagnostics and Display
2.9.2 Activating the Fail-Safe Loader
To activate the FSL:
1. Install the jumper at bank 7 of the J1 or J4 jumper on the CPU daughter
board. The jumper is normally installed in the standard boot setting (bank
1 for AlphaServer 1000A Model 4/xxx systems, bank 0 or 1 for Model 5/xxx
systems) Refer to Figures 2–10 through 2–12.
2. Install the console firmware diskette and turn on the system.
Two messages are displayed on the operator control panel (OCP) when the
FSL program loads the diskette:
OCP
MessageMeaning
Floppy
Boot
Starting
CPU
FSL firmware is executing.
FSL firmware found a valid boot block, loaded the program into memory,
and is attempting to transfer control to the loaded program.
3. Reinstall the console firmware from a firmware diskette.
4. When you have finished, power down and return the J1 or J4 jumper to the
standard boot setting (bank 1).
Power-Up Diagnostics and Display 2–31
Figure 2–10 Model 4/xxx: Jumper J1 on the CPU Daughter Board
J1
MA00926
BankJumper Setting
0Standard boot setting (AlphaServer 1000 systems)
1Standard boot setting (AlphaServer 1000A systems)
2Mini-console setting: Internal use only
3SROM CacheTest: backup cache test
4SROM BCacheTest: backup cache and memory test
5SROM memTest: memory test with backup and data cache disabled
6SROM memTestCacheOn: memory test with backup and data cache enabled
7Fail-Safe Loader setting: selects fail-safe loader firmware
0
1
2
3
4
5
6
7
2–32 Power-Up Diagnostics and Display
Figure 2–11 Model 5/xxx Systems: Jumper J4 on the CPU Daughter Board
J4
0 1 2 3 4 5 6 7
MLO-013462
BankJumper Setting
0Standard boot setting (AlphaServer 1000/1000A systems)
1Standard boot setting (AlphaServer 1000/1000A systems)
2Mini-console setting: Internal use only
3Mini-console setting: Internal use only
4Power up with no Bcache: Power up with Bcache disabled allows the system to run
5Mini-console setting: Internal use only
6Mini-console setting: Internal use only
7Fail-Safe Loader setting: selects fail-safe loader firmware
despite bad Bcache until a replacement daughter board is available
Power-Up Diagnostics and Display 2–33
Figure 2–12 Model 5/xxx Systems: Jumper J1 on the CPU Daughter Board
J1
0 1 2 3 4 5 6 7
MLO-013469
BankJumper Setting
0Standard boot setting (AlphaServer 1000/1000A systems)
1Standard boot setting (AlphaServer 1000/1000A systems)
2Mini-console setting: Internal use only
3Mini-console setting: Internal use only
4Power up with no Bcache: Power up with Bcache disabled allows the system to run
5Mini-console setting: Internal use only
6Mini-console setting: Internal use only
7Fail-Safe Loader setting: selects fail-safe loader firmware
despite bad Bcache until a replacement daughter board is available
2–34 Power-Up Diagnostics and Display
2.10 Power-Up Sequence
During the AlphaServer 1000A power-up sequence, the power supplies are
stabilized and the system is initialized and tested through the firmware power-on
self-tests.
The power-up sequence includes the following:
•Power supply power-up:
–AC power-up
–DC power-up
•Two sets of power-on diagnostics:
–Serial ROM diagnostics
–Console firmware-based diagnostics
Caution
The AlphaServer 1000A enclosure will not power up if the top cover is not
securely attached. Removing the top cover will cause the system to shut
down.
2.10.1 AC Power-Up Sequence
The following power-up sequence occurs when AC power is applied to the system
(system is plugged in) or when electricity is restored after a power outage:
1. The front end of the power supply begins operation and energizes.
2. The power supply then waits for the DC power to be enabled.
Note
The top cover and side panels must be securely installed. A safety
interlock prevents the system from being powered on with the cover and
panels removed.
Power-Up Diagnostics and Display 2–35
2.10.2 DC Power-Up Sequence
DC power is applied to the system with the DC On/Off button on the operator
control panel.
A summary of the DC power-up sequence follows:
1. When the DC On/Off button is pressed, the power supply checks for a POK_H
condition.
2. 12V, 5V, 3.3V, and -12V outputs are energized and stabilized. If the outputs
do not come into regulation, the power-up is aborted and the power supply
enters the latching-shutdown mode.
2.11 Firmware Power-Up Diagnostics
After successful completion of AC and DC power-up sequences, the processor
performs its power-up diagnostics. These tests verify system operation, load
the system console, and test the core system (CPU, memory, and motherboard),
including all boot path devices. These tests are performed as two distinct sets of
diagnostics:
1. Serial ROM diagnostics—These tests are loaded from the serial ROM located
on the CPU daughter board into the CPU’s instruction cache (I-cache). The
tests check the basic functionality of the system and load the console code
from the FEPROM on the motherboard into system memory.
Failures during these tests are indicated by audible error beep codes
(Table 2–1), the console event log (Section 2.3.1), and for Model 5/xxx systems,
OCP error codes (Section 2.2).
Failures of customized SROM tests for Model 4/xxx systems (Section 2.4),
set using the J1 jumper on the CPU daughter board, are displayed on the
operator control panel.
2. Console firmware-based diagnostics—These tests are executed by the console
code. They test the core system, including all boot path devices.
Failures during these tests are reported to the console terminal through the
power-up screen or console event log.
2–36 Power-Up Diagnostics and Display
2.11.1 Serial ROM Diagnostics
The serial ROM diagnostics are loaded into the CPU’s instruction cache from the
serial ROM on the CPU daughter board. The diagnostics test the system in the
following order:
1. Test the CPU and backup cache located on the CPU daughter board.
2. Test the CPU module’s system bus interface.
3. Test the system bus to PCI bus bridge and system bus to EISA bus bridge. If
the PCI bridge fails or EISA bridge fails, an audible error beep code (3-3-1)
sounds (Table 2–1). The power-up tests continue despite these errors.
4. Test the PCI-to-PCI bus bridge. If the bridge fails, an error beep code (3-3-2)
sounds.
5. Test the native SCSI controller. If the controller fails, an error beep code
(3-1-2) sounds.
6. Configure the memory in the system and test only the first 16 MB of memory.
If the memory test fails, the failing bank is mapped out and memory is
reconfigured and re-tested. Testing continues until good memory is found. If
good memory is not found, an error beep code (1-3-3) is generated and the
power-up tests are terminated.
7. Check the data path to the FEPROM on the motherboard.
8. The console program is loaded into memory from the FEPROM on the
motherboard. A checksum test is executed for the console image. If the
checksum test fails, an error beep code (1-1-4) is generated, and the power-up
tests are terminated.
If the checksum test passes, control is passed to the console code, and the
console firmware-based diagnostics are run.
2.11.2 Console Firmware-Based Diagnostics
Console firmware-based tests are executed once control is passed to the console
code in memory. They check the system in the following order:
1. Perform a complete check of system memory.
Steps 2–5 may be completed in parallel.
2. Start the I/O drivers for mass storage devices and tapes. At this time a
complete functional check of the machine is made. After the I/O drivers
are started, the console program continuously polls the bus for devices
(approximately every 20 or 30 seconds).
Power-Up Diagnostics and Display 2–37
3. Check that EISA configuration information is present in NVRAM for each
EISA module detected and that no information is present for modules that
have been removed.
4. Run exercisers on the drives currently seen by the system.
Note
This step does not ensure that all disks in the system will be tested or
that any device drivers will be completely tested. Spin-up time varies
for different drives, so not all disks may be on line at this point in the
power-up sequence. To ensure complete testing of disk devices, use the
test
command (Section 3.3.1).
5. Enter console mode or boot the operating system. This action is determined
by the auto_action environment variable.
If the os_type environment variable is set to NT, the ARC (Model 4/xxx
systems) or AlphaBIOS (Model 5/xxx systems) console is loaded into memory,
and control is passed to the ARC or AlphaBIOS console.
2–38 Power-Up Diagnostics and Display
3
Running System Diagnostics
This chapter provides information on how to run system diagnostics.
•Section 3.1 describes how to run ROM-based diagnostics, including error
reporting utilities and loopback tests.
•Section 3.4 describes acceptance testing and initialization procedures.
•Section 3.5 describes the DEC VET operating system exerciser.
3.1 Running ROM-Based Diagnostics
ROM-based diagnostics (RBDs), which are part of the console firmware that
is loaded from the FEPROM on the system motherboard, offer many powerful
diagnostic utilities, including the ability to examine error logs from the console
environment and run system- or device-specific exercisers.
AlphaServer 1000A RBDs rely on exerciser modules, rather than functional tests,
to isolate errors. The exercisers are designed to run concurrently, providing a
maximum bus interaction between the console drivers and the target devices.
The multitasking ability of the console firmware allows you to run diagnostics in
the background (using the background operator ‘‘&’’ at the end of the command).
You run RBDs by using console commands.
Note
ROM-based diagnostics, including the
SRM console (firmware used by OpenVMS and Digital UNIX operating
systems). If you are running a Windows NT system, refer to Section 5.1.2
for the steps used to switch between consoles.
RBDs report errors to the console terminal and/or the console event log.
test
command, are run from the
Running System Diagnostics 3–1
3.2 Command Summary
Table 3–1 provides a summary of the diagnostic and related commands.
Table 3–1 Summary of Diagnostic and Related Commands
CommandFunctionReference
Acceptance Testing
testQuickly tests the core system. The
Error Reporting
cat elDisplays the console event log.Section 3.3.3
more elDisplays the console event log one screen at a time.Section 3.3.3
Extended Testing/Troubleshooting
memexerExercises memory by running a specified number of
memoryRuns memory exercises each time the command is
net -icInitializes the MOP counters for the specified
net -sDisplays the MOP counters for the specified
netewRuns external MOP loopback tests for specified
sys_exerExercises core system for Model 5/xxx systems. Runs
is the primary diagnostic for acceptance testing
and console environment diagnosis. For Model 4
/xxx systems, the tests are run concurrently and
indefinitely.
For Model 5/xxx systems, the
one pass of the tests. To run tests concurrently
and indefinitely on Model 5/xxx systems, use the
sys_exer
memory tests on Model 5/xxx systems. The tests are
run in the background.
entered. These exercises run concurrently in the
background.
Ethernet port.
Ethernet port.
EISA- or PCI-based ew* (DECchip 21040, TULIP)
Ethernet ports.
tests concurrently.
command.
test
test
command runs
command
Section 3.3.1
Section 3.3.4
Section 3.3.5
Section 3.3.9
Section 3.3.8
Section 3.3.6
Section 3.3.2
3–2 Running System Diagnostics
(continued on next page)
Table 3–1 (Cont.) Summary of Diagnostic and Related Commands
CommandFunctionReference
Loopback Testing
netewRuns external MOP loopback tests for specified
EISA- or PCI-based ew* (DECchip 21040, TULIP)
Ethernet ports.
sys_exer -lbConducts loopback tests for COM2 and the parallel
port in addition to core system tests for Model 5/xxx
systems.
test -lbConducts loopback tests for COM2 and the parallel
port in addition to quick core system tests.
Diagnostic-Related Commands
killTerminates a specified process.Section 3.3.10
kill_diagsTerminates all currently executing diagnostics.Section 3.3.10
show_statusReports the status of currently executing test
/exercisers.
Section 3.3.6
Section 3.3.2
Section 3.3.1
Section 3.3.11
3.3 Command Reference
This section provides detailed information on the diagnostic commands and
related commands.
Running System Diagnostics 3–3
3.3.1 test
The
test
command runs firmware diagnostics for the entire core system. The
tests are run concurrently in the background. Fatal errors are reported to the
console terminal.
The
cat el
examine test/error information reported to the console event log.
For Model 4/xxx systems, the tests are run concurrently and indefinitely (until
you stop them with the
out intermittent hardware problems.
command should be used in conjunction with the
kill_diags
command). These test are useful in flushing
test
command to
For Model 5/xxx systems, the
tests concurrently and indefinitely on Model 5/xxx systems, use the
command.
By default, no write tests are performed on disk and tape drives. Media
must be installed to test the floppy drive and tape drives. A loopback
connector is required for the COM2 (9-pin loopback connector, 12-27351-
01) port.
The test command does not test the DNSES, TGA card, reflective memory
option, nor third party options.
When using the
you must initialize the system to a quiescent state. Enter the following
commands at the SRM console:
>>> set auto_action halt
>>> init
...
>>>test
After testing is completed, set the auto_action environment variable to
its previous value (usually, boot) and use the Reset button to reset the
system.
test
test
command runs one pass of the tests. To run
sys_exer
Note
command after shutting down an operating system,
To terminate the tests, use the
diagnostic or the
show_status
diagnostic test.
3–4 Running System Diagnostics
kill_diags
display to determine the process ID when terminating an individual
kill
command to terminate an individual
command to terminate all diagnostics. Use the
Note
A serial loopback connector (12-27351-01) must be installed on the COM2
serial port for the
kill_diags
command to successfully terminate system
tests.
The
test
script tests devices in the following order:
1. Console loopback tests if lb argument is specified: COM2 serial port and
parallel port.
2. Network external loopback tests for E*A0. This test requires that the
Ethernet port be terminated or connected to a live network; otherwise, the
test will fail.
5. VGA console tests. These tests are run only if the console environment
variable is set to ‘‘serial.’’ The VGA console test displays rows of the letter
‘‘digital’’.
Synopsis:
test [lb]
Argument:
[lb]The loopback option includes console loopback tests for the COM2 serial
port and the parallel port during the test sequence.
Examples:
In the following example, a Model 4/xxx system is tested and the tests complete
successfully.
Note
Examine the console event log after running tests.
Running System Diagnostics 3–5
>>> test
Requires diskette and loopback connectors on COM2 and parallel port
type kill_diags to halt testing
type show_status to display testing progress
type cat el to redisplay recent errors
Testing COM2 port
Setting up network test, this will take about 20 seconds
Testing the network
48 Meg of System Memory
Bank 0 = 16 Mbytes(4 MB Per Simm) Starting at 0x00000000
Bank 1 = 16 Mbytes(4 MB Per Simm) Starting at 0x01000000
Bank 2 = 16 Mbytes(4 MB Per Simm) Starting at 0x02000000
Bank 3 = No Memory Detected
Testing the memory
Testing parallel port
Testing the SCSI Disks
Non-destructive Test of the Floppy started dka400.4.0.6.0 has no media
present or is disabled via the RUN/STOP switch
file open failed for dka400.4.0.6.0
Testing the VGA(Alphanumeric Mode only)
Printer offline
file open failed for para
In the following example, the system is tested and the system reports a fatal
error message. No network server responded to a loopback message. Ethernet
connectivity on this system should be checked.
>>> test
Requires diskette and loopback connectors on COM2 and parallel port
type kill_diags to halt testing
type show_status to display testing progress
type cat el to redisplay recent errors
Testing COM2 port
Setting up network test, this will take about 20 seconds
Testing the network
*** Error (era0), Mop loop message timed out from: 08-00-2b-3b-42-fd
*** List index: 7 received count: 0 expected count 2
3–6 Running System Diagnostics
>>>
In the following example, a Model 5/xxx system is tested and tests terminate
after successfully completing one pass of the diagnostics.
Note
Examine the console event log after running tests.
>>> test
Testing the Memory
Testing the DK* Disks(read only)
No DU* Disks available for testing
No DR* Disks available for testing
No MK* Tapes available for testing
No MU* Tapes available for testing
Testing the DV* Floppy Disks(read only)
Testing the VGA (Alphanumeric Mode only)
Testing the EWA0 Network
Testing the EWB0 Network
>>>
Running System Diagnostics 3–7
3.3.2 sys_exer
The
sys_exer
model 5/xxx systems. The same tests that are run using the
run with
Nothing is displayed unless an error occurs.
command runs firmware diagnostics for the entire core system for
sys_exer
, only these tests are run concurrently and in the background.
Note
test
command are
The diagnostics started by the
resources. The
booting and operating system.
Because the sys_exer tests are run concurrently and indefinitely (until you stop
them with the
hardware problems.
By default, no write tests are performed on disk and tape drives. Media
must be installed to test the floppy drive and tape drives.
Media must be installed to test the floppy drive and tape drives.
Certain memory errors that are reported by the OCP may not be reported
by the ROM-based diagnostics. Always check the power-up/diagnostic
display before running diagnostic commands.
Synopsis:
sys_exer [lb]
Arguments:
[lb]The loopback option includes console loopback tests for the COM2 serial
init
command must be used to reconfigure memory before
init
command), they are useful in flushing out intermittent
port and the parallel port during the test sequence.
sys_exer
Note
command require extra memory
3–8 Running System Diagnostics
Examples:
>>> sys_exer
Default zone extended at the expense of memzone.
Use INIT before booting
Exercising the Memory
Exercising the DK* Disks(read only)
Exercising the Floppy(read only)
Testing the VGA (Alphanumeric Mode only)
Exercising the EWA0 Network
Exercising the EWB0 Network
Type "init" in order to boot the operating system
Type "show_status" to display testing progress
Type "cat el" to redisplay recent errors
event log. Status and error messages (if problems occur) are logged to the console
event log at power-up, during normal system operation, and while running system
tests.
Standard error messages are indicated by asterisks (***).
and
more el
commands display the current contents of the console
When
the
The
cat el
Ctrl/S
more el
is used, the contents of the console event log scroll by. You can use
combination to stop the screen from scrolling,
command allows you to view the console event log one screen at a
Ctrl/Q
to resume scrolling.
time.
Synopsis:
cat el
or
more el
Examples:
The following examples show abbreviated console event logs that contains a
standard error message:
!
The error message indicates the keyboard is not plugged in or is not working.
>>> cat el
*** keyboard not plugged in...
ff.fe.fd.fc.fb.fa.f9.f8.f7.f6.f5.
ef.df.ee.f4.ed.ec.eb.ea.e9.e8.e7.e6.port pka0.7.0.6.0 initialized,
scripts are at 4f7faa0
resetting the SCSI bus on pka0.7.0.6.0
port pkb0.7.0.12.0 initialized, scripts are at 4f82be0
resetting the SCSI bus on pkb0.7.0.12.0
e5.e4.e3.e2.e1.e0.
V1.1-1, built on Nov4 1994 at 16:44:07
device dka400.4.0.6.0 (RRD43) found on pka0.4.0.6.0
>>>
!
3–10 Running System Diagnostics
3.3.4 memexer
The
memexer
exercisers for Model 5/xxx systems. The exercisers are run in the background
and nothing is displayed unless an error occurs. Each exerciser tests all available
memory in twice the backup cache size blocks for each pass.
command tests memory by running a specified number of memory
To terminate the memory tests, use the
diagnostic or the
show_status
diagnostic test.
Synopsis:
memexer [number]
Arguments:
[number]Number of memory exercisers to start. The default is 1.
kill_diags
display to determine the process ID when terminating an individual
The number of exercisers, as well as the length of time for testing,
depends on the context of the testing. Generally, running three to five
exercisers for 15 minutes to 1 hour is sufficient for troubleshooting most
memory problems.
The following is an example with a memory compare error indicating bad SIMMs.
In most cases, the failing bank and SIMM position (Figures 3–1 and 3–2) are
specified in the error message. If the failing SIMM information is not provided,
use the procedure in Section 3.3.5 to isolate a failing SIMM.
>>> memexer 3
*** Hard Error - Error #41 - Memory compare error
Diagnostic NameIDDevicePassTestHard/Soft11-JUN-1996
memtest00000193brd011411012:00:01
Expected value:25c07
Received value:35c07
Failing addr:a11848
*** End of Error ***
>>> kill_diags
>>>
3–12 Running System Diagnostics
3.3.5 memory
The
memory
command is entered. The exercisers are run in the background and nothing is
displayed unless an error occurs.
The number of exercisers, as well as the length of time for testing, depends on the
context of the testing. Generally, running three to five exercisers for 15 minutes
to 1 hour is sufficient for troubleshooting most memory problems.
command tests memory by running a memory exerciser each time the
To terminate the memory tests, use the
diagnostic or the
show_status
kill_diags
command to terminate all diagnostics. Use the
display to determine the process ID when terminating an individual
The following is an example with a memory compare error indicating bad SIMMs.
In most cases, the failing bank and SIMM position (Figures 3–1 and 3–2 are
specified in the error message. If the failing SIMM information is not provided,
use the procedure following the example to isolate a failing SIMM.
>>> memory
>>> memory
>>> memory
*** Hard Error - Error #41 - Memory compare error
Diagnostic NameIDDevicePassTestHard/Soft11-JUN-1996
memtest00000193brd011411012:00:01
Expected value:25c07
Received value:35c07
Failing addr:a11848
*** End of Error ***
>>> kill_diags
>>>
To find the failing bank, compare the failing address (a11848 in this example)
with the
show memory
display or memory portion of the
show config
command
display:
1. Banks with no memory present are eliminated as possible failing banks.
2. If the failing address is greater than the bank starting address, but less than
the starting address for the next bank, then the failing SIMM is within this
bank. Bank 0 in the example using failing address a11848 and the following
memory display.
>>> show memory
Memory
32 Meg of System Memory
Bank 0 = 16 Mbytes (4MB per SIMM) Starting at 0x00000000
Bank 1 = 16 Mbytes (4MB per SIMM) Starting at 0x01000000
Bank 2 = No Memory Detected
Bank 3 = No Memory Detected
>>>
3–14 Running System Diagnostics
To determine the failing SIMM:
•Model 4/xxx Systems:
Match the least significant nibble of the failing address to the failing SIMM
using the table below.
Failing Address Least
Significant NibbleFailing SIMM
00
41
82
C3
In the example, a11848, the 8 would indicate SIMM 2 as the failing SIMM.
•Model 5/xxx Systems:
Match the least significant nibble of the failing address and the bit range in
which the bad data is received to the failing SIMM using the table below.
Failing Address Least
Significant Nibble
0 or 8bits 15:00
0 or 8bits 31:161
4 or Cbits 15:02
4 or Cbits 31:163
Data Miscompare
in Bit RangeFailing SIMM
In the example the least significant nibble is the failing address is 8 (a11848).
The expected data value was 25c07, the received value was 35c07.
The data miscompare occurred in bits 16–19 or within bits 31:16, therefore
the failing SIMM would be SIMM 1.
Model 4/xxx systems have SROM power-up tests for memory that can report
a failing bank and SIMM. This series of tests is set using the J1 jumper on
the CPU daughter board (Section 2.4).
Running System Diagnostics 3–15
Figure 3–1 Model 4/xxx: AlphaServer 1000A Memory Layout
Bank 3
Bank 2
Bank 1
Bank 0
ECC Banks
SIMM 1
SIMM 0
SIMM 1
SIMM 0
SIMM 1
SIMM 0
SIMM 1
SIMM 0
ECC SIMM for Bank 2
ECC SIMM for Bank 0
SIMM 3
SIMM 2
SIMM 3
SIMM 2
SIMM 3
SIMM 2
SIMM 3
SIMM 2
ECC SIMM for Bank 3
ECC SIMM for Bank 1
Figure 3–2 Model 5/xxx: AlphaServer 1000A Memory Layout
based ew* (DECchip 21040, TULIP) Ethernet ports. The command can also be
used to test a port on a ‘‘live’’ network.
The loopback tests are set to run continuously (-p pass_count set to 0). Use the
kill
command to terminate all diagnostics. Use the
the process ID when terminating an individual diagnostic test.
While some results of network tests are reported directly to the console,
you should examine the console event log (using the
commands) for complete test results.
Synopsis:
netew
command is used to run MOP loopback tests for any EISA- or PCI-
command (or
Ctrl/C
) to terminate an individual diagnostic or the
show_status
Note
display to determine
cat elormore el
kill_diags
When the
net -sa ew*0>ndbr/lp_nodes_ew*0
set ew*0_loop_count 2 2>nl
set ew*0_loop_inc 1 2>nl
set ew*0_loop_patt ffffffff 2>nl
set ew*0_loop_size 10 2>nl
set ew*0_lp_msg_node 1 2>nl
net -cm ex ew*0
echo "Testing the network"
nettest ew*0 -sv 3 -mode nc -p 0 -w 1 &
The script builds a list of nodes for which to send MOP loopback packets, sets
certain test environment variables, and tests the Ethernet port by using the
following variation of the nettest exerciser:
netew ew*0 -sv 3 -mode nc -p 0 -w 1 &
netew
command is entered, the following script is executed:
based er* (DEC 4220, LANCE) Ethernet ports. The command can also be used to
test a port on a ‘‘live’’ network.
The loopback tests are set to run continuously (-p pass_count set to 0). Use the
kill
command (or
command to terminate all diagnostics. Use the
the process ID when terminating an individual diagnostic test.
While some results of network tests are reported directly to the console,
you should examine the console event log (using the
commands) for complete test results.
Synopsis:
network
command is used to run MOP loopback tests for any EISA- or PCI-
Ctrl/C
) to terminate an individual diagnostic or the
show_status
Note
display to determine
kill_diags
cat elormore el
When the
echo "setting up the network test, this will take about 20 seconds"
net -stop er*0
net -sa er*0>ndbr/lp_nodes_er*0
net ic er*0
set er*0_loop_count 2 2>nl
set er*0_loop_inc 1 2>nl
set er*0_loop_patt ffffffff 2>nl
set er*0_loop_size 10 2>nl
set er*0_lp_msg_node 1 2>nl
set er*0_mode 44 2>nl
net -start er*0
echo "Testing the network"
nettest er*0 -sv 3 -mode nc -p 0 -w 1 &
The script builds a list of nodes for which to send MOP loopback packets, sets
certain test environment variables, and tests the Ethernet port by using the
following variation of the nettest exerciser:
network er*0 -sv 3 -mode nc -p 0 -w 1 &
network
command is entered, the following script is executed:
Error count (hard and soft): Soft errors are not usually fatal; hard errors halt
the system or prevent completion of the diagnostics.
&
Bytes successfully written by diagnostic
'
Bytes successfully read by diagnostic
3–24 Running System Diagnostics
3.4 Acceptance Testing and Initialization
Perform the acceptance testing procedure listed below after installing a system or
whenever adding or replacing the following:
Memory modules
Motherboard
CPU daughter board
Storage devices
EISA or PCI options
1. Run the RBD acceptance tests using the
2. If you have added or moved, an EISA option or some ISA options, run the
EISA Configuration Utility (ECU).
3. Bring up the operating system.
4. Run DEC VET to test that the operating system is correctly installed. Refer
to Section 3.5 for information on DEC VET.
testorsys_exer
command.
3.5 DEC VET
Digital’s DEC Verifier and Exerciser Tool (DEC VET) software is a multipurpose
system maintenance tool that performs exerciser-oriented maintenance testing.
DEC VET runs on Digital UNIX, OpenVMS, and Windows NT operating systems.
DEC VET consists of a manager and exercisers. The DEC VET manager controls
the exercisers. The exercisers test system hardware and the operating system.
DEC VET supports various exerciser configurations, ranging from a single device
exerciser to full system loading, that is, simultaneous exercising of multiple
devices.
Refer to the DEC Verifier and Exerciser Tool User’s Guide (AA–PTTMD–TE) for
instructions on running DEC VET.
Running System Diagnostics 3–25
4
Error Log Analysis
This chapter provides information on how to interpret error logs reported by the
operating system.
•Section 4.1 describes machine check/interrupts and how these errors are
detected and reported.
•Section 4.2 describes the entry format used by the error formatters.
•Section 4.3 describes how to generate a formatted error log using the
DECevent Translation and Reporting Utility available with OpenVMS and
Digital UNIX.
4.1 Fault Detection and Reporting
Table 4–1 provides a summary of the fault detection and correction components of
AlphaServer 1000A systems.
Generally, PALcode handles exceptions as follows:
•The PALcode determines the cause of the exception.
•If possible, it corrects the problem and passes control to the operating system
for reporting before returning the system to normal operation.
•If error/event logging is required, control is passed through the system control
block (SCB) to the appropriate exception handler.
Error Log Analysis 4–1
Table 4–1 AlphaServer 1000 Fault Detection and Correction
Backup cache (B-cache)EDC check bits on the data store, and parity on the tag
Memory Subsystem
Memory SIMMsEDC logic protects data by detecting and correcting data
System Motherboard
SCSI ControllerSCSI data parity is generated.
EISA-to-PCI bridge chipPCI data parity is generated.
PCI-to-PCI bridge chipPCI data parity is generated.
Contains error detection and correction (EDC) logic for
data cycles. There are check bits associated with all data
entering and exiting the 21064(A) microprocessor. A singlebit error on any of the four longwords being read can be
corrected (per cycle). A double-bit error on any of the four
longwords being read can be detected (per cycle).
address store and tag control store.
cycle errors. A single-bit error on any of the four longwords
can be corrected (per cycle). A double-bit error on any of
the four longwords being read can be detected (per cycle).
4.1.1 Machine Check/Interrupts
The exceptions that result from hardware system errors are called machine
check/interrupts. They occur when a system error is detected during the
processing of a data request. There are four types of machine check/interrupts
related to system events:
1. Processor machine check (SCB 670)
2. System machine check (SCB 660)
3. Processor-corrected machine check (SCB 630)
4. System-corrected machine check (SCB 620)
During the error handling process, errors are first handled by the appropriate
PALcode error routine and then by the associated operating system error handler.
The causes of each of the machine check/interrupts are as follows. The system
control block (SCB) vector through which PALcode transfers control to the
operating system is shown in parentheses.
4–2 Error Log Analysis
Processor Machine Check (SCB: 670)
Processor machine check errors are fatal system errors that result in a system
crash. The error handling code for these errors is common across all platforms
using the DECchip 21064, 21064A, and 21164 microprocessors.
•The DECchip 21064, 21064A, or 21164 microprocessor detected one or more of
the following uncorrectable data errors:
–Uncorrectable B-cache data error
–Uncorrectable memory data error
•A B-cache tag or tag control parity error occurred
A system machine check is a system-detected error, external to the DECchip
21064, 21064A, or 21164 microprocessor and possibly not related to the activities
of the CPU. These errors are specific to AlphaServer 1000A systems.
Fatal errors:
•System overtemperature failure
•System complete power supply failure
The power supply number is called out in the register: power supply 1 is the
bottom supply; power supply 2 is the top supply.
•System fan failure
•I/O read/write retry timeout
•DMA data parity error
•I/O data parity error
•Slave abort PCI transaction
•DEVSEL not asserted
•Uncorrectable read error
Error Log Analysis 4–3
•Invalid page table lookup (scatter gather)
•Memory cycle error
•B-cache tag address parity error
•B-cache tag control parity error
•Non-existent memory error
•ESC NMI: IOCHK
Processor-Corrected Machine Check (SCB: 630)
Processor-corrected machine checks are caused by B-cache errors that are
detected and corrected by the DECchip 21064, 21064A, or 21164 microprocessor.
These are nonfatal errors that result in an error log entry. The error handling
code for these errors is common across all platforms using the DECchip 21064,
21064A, and 21164 microprocessors.
•Single-bit Istream ECC error
•Single-bit Dstream ECC error
•System transaction terminated with CACK_SERR
System Machine Check (SCB: 620)
These errors (non-fatal) are AlphaServer 1000A-specific correctable errors. These
errors result in the generation of the correctable machine check logout frame:
•Correctable read errors
•Single power supply failure when operating with redundant power supplies.
•System overtemperature warning
4.2 Error Logging and Event Log Entry Format
The Digital UNIX and OpenVMS error handlers can generate several entry types.
All error entries, with the exception of correctable memory errors, are logged
immediately. Entries can be of variable length based on the number of registers
within the entry.
Each entry consists of an operating system header, several device frames, and an
end frame. Most entries have a PAL-generated logout frame, and may contain
frames for CPU, memory, and I/O.
4–4 Error Log Analysis
4.3 Event Record Translation
Systems running Digital UNIX and OpenVMS operating systems use the
DECevent management utility to translate events into ASCII reports derived
from system event entries (bit-to-text translations).
The DECevent utility has the following features relating to the translation of
events:
•Translating event log entries into readable reports
•Selecting input and output sources
•Filtering input events
•Selecting alternate reports
•Translating events as they occur
•Maintaining and customizing the user environment with the interactive shell
commands
Note
Microsoft Windows NT does not currently provide bit-to-text translation
of system errors.
•Section 4.3.1 summarizes the command used to translate the error log
information for the OpenVMS operating system using DECevent.
•Section 4.3.2 summarizes the command used to translate the error log
information for the Digital UNIX operating system using DECevent.
4.3.1 OpenVMS Alpha Translation Using DECevent
The kernel error log entries are translated from binary to ASCII using the
DIAGNOSE command. To invoke the DECevent utility, enter the DCL command
DIAGNOSE.
For more information on generating error log reports using DECevent, refer
to DECevent Translation and Reporting Utility for OpenVMS Alpha, User andReference Guide, AA-Q73KC-TE.
Error Log Analysis 4–5
System faults can be isolated by examining translated system error logs or
using the DECevent Analysis and Notification Utility. Refer to the DECeventAnalysis and Notification Utility for OpenVMS Alpha, User and Reference Guide,
AA-Q73LC-TE, for more information.
4.3.2 Digital UNIX Translation Using DECevent
The kernel error log entries are translated from binary to ASCII using the
command. To invoke the DECevent utility, enter
Format:
dia [-a -f infile[...]]
Example:
% dia -t s:14-jun-1995:10:00
For more information on generating error log reports using DECevent, refer to
DECevent Translation and Reporting Utility for Digital UNIX, User and Reference
Guide, AA-QAA3-TE.
System faults can be isolated by examining translated system error logs or
using the DECevent Analysis and Notification Utility. Refer to the DECeventAnalysis and Notification Utility for Digital UNIX, User and Reference Guide,
AA-QAA4A-TE, for more information.
dia
command.
dia
4–6 Error Log Analysis
5
System Configuration and Setup
This chapter provides configuration and setup information for AlphaServer 1000A
systems and system options.
•Section 5.1 describes how to examine the system configuration using the
console firmware.
–Section 5.1.1 describes the function of the two firmware interfaces used
with AlphaServer 1000A systems.
–Section 5.1.2 describes how to switch between firmware interfaces.
–Sections 5.1.3 and 5.1.4 describe the commands used to examine system
configuration for each firmware interface.
•Section 5.2 describes the system bus configuration.
•Section 5.3 describes the motherboard.
•Section 5.4 describes the EISA bus.
•Section 5.5 describes how ISA options are compatible on the EISA bus.
•Section 5.6 describes the EISA configuration utility (ECU).
•Section 5.7 describes the PCI bus.
•Section 5.8 describes SCSI buses and configurations.
•Section 5.9 describes power supply configurations.
•Section 5.10 describes the console port configurations.
System Configuration and Setup 5–1
5.1 Verifying System Configuration
Figures 5–1 and 5–2 illustrate the system architecture for AlphaServer 1000A
systems.
Figure 5–1 System Architecture: AlphaServer 1000A Model 4/xxx Systems
Secondary
PCI Bus
Comanche
21064
Bcache
2MB
Memory
(16MB-1GB)
Decade
SROM
CPU Card
Epic
Primary
PCI Bus
PCI-PCI
Bridge
PCI Slots
PCI Slots
PCI Slots
PCI Slots
PCI Slots
PCI Slots
PCI Slots
EISA Slots
EISA Slots
PCI-EISA
Bridge
EISA Bus
QLOGIC
ISP1020A
TOY
Flash
ROM
(1MB)
Buffers
SVGA
Cirrus
5428
NS
87332
X-Bus
Fast-Wide
SCSI Bus
OCP
EISA
Config
RAM
8242
Keybd &
Mouse
Keyboard
Mouse
Serial Ports
Floppy Port
Parallel Port
MA00946
5–2 System Configuration and Setup
Figure 5–2 System Architecture: AlphaServer 1000A Model 5/xxx Systems
Secondary
PCI Bus
21164
Bcache
2MB
Memory
(16MB-1GB)
DSW
SROM
CPU Card
CIA
Primary
PCI Bus
PCI-PCI
Bridge
PCI Slots
PCI Slots
PCI Slots
PCI Slots
PCI Slots
PCI Slots
PCI Slots
EISA Slots
EISA Slots
PCI-EISA
Bridge
EISA Bus
QLOGIC
ISP1020A
TOY
Flash
ROM
(1MB)
Buffers
SVGA
Cirrus
5428
NS
87332
X-Bus
Fast-Wide
SCSI Bus
OCP
EISA
Config
RAM
8242
Keybd &
Mouse
Keyboard
Mouse
Serial Ports
Floppy Port
Parallel Port
MLO-013494
5.1.1 System Firmware
The system firmware currently provides support for the following operating
systems:
•Digital UNIX and OpenVMS Alpha are supported under the SRM command
line interface, which can be serial or graphical. The SRM firmware is in
compliance with the Alpha System Reference Manual (SRM).
•For Model 4/xxx systems, Windows NT is supported under the ARC menu
interface, which is graphical. The ARC firmware is in compliance with the
Advanced RISC Computing Standard Specification (ARC).
•For Model 5/xxx systems, Windows NT is supported under the AlphaBIOS
console. Refer to the AlphaServer 1000/1000A Model 5/xxx Owner’s GuideSupplement.
The console firmware provides the data structures and callbacks available to
booted programs defined in the SRM, ARC, and AlphaBIOS standards.
System Configuration and Setup 5–3
SRM Command Line Interface
Systems running Digital UNIX or OpenVMS access the SRM firmware through a
command line interface (CLI). The CLI is a UNIX style shell that provides a set
of commands and operators, as well as a scripting facility. The CLI allows you
to configure and test the system, examine and alter system state, and boot the
operating system.
The SRM console prompt is
Several system management tasks can be performed only from the SRM console
command line interface:
•All console test and reporting commands are run from the SRM console.
•Certain environment variables are changed using the SRM
For example:
To run the ECU, you must enter the
ARC firmware and the ECU software, or in the case of AlphaBIOS, will boot the
AlphaBIOS firmware.
ARC and AlphaBIOS Menu Interface
Systems running Windows NT access the ARC or AlphaBIOS console firmware
through menus that are used to configure and boot the system, run the EISA
Configuration Utility (ECU), run the RAID Configuration Utility (RCU), adapter
configuration utility, or set environment variables.
•You must run the EISA Configuration Utility (ECU) whenever you add,
remove, or move an EISA or ISA option in your AlphaServer system. The
ECU is run from diskette. Two diskettes are supplied with your system
shipment, one for Digital UNIX and OpenVMS and one for Windows NT. For
more information about running the ECU, refer to Section 5.6.
>>>
.
set
command.
ecu
command. This command will boot the
•If you purchased a StorageWorks RAID Array 200 Subsystem for your server,
you must run the RAID Configuration Utility (RCU) to set up the disk
drives and logical units. Refer to StorageWorks RAID Array 200 SubsystemsController Installation and Standalone Configuration Utility User’s Guide,
included in your RAID kit.
5–4 System Configuration and Setup
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.