DEC 4000 AXP Service Manual

DEC4000AXP ServiceGuide
Order Number: EK–KN430–SV. B01
Digital Equipment Corporation Maynard, Massachusetts
Revised, July 1993 First Printing, December 1992
Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document.
The software, if any, described in this document is furnished under a license and may be used or copied only in accordance with the terms of such license. No responsibility is assumed for the use or reliability of software or equipment that is not supplied by Digital Equipment Corporation or its affiliated companies.
Copyright © Digital Equipment Corporation, 1992. All Rights Reserved. The Reader’s Comments form at the end of this document requests your critical evaluation to assist
in preparing future documentation. The following are trademarks of Digital Equipment Corporation: Alpha AXP, AXP, DEC, DECchip,
DECconnect, DECdirect, DECnet, DECserver, DEC VET, DESTA, MSCP, RRD40, ThinWire, TMSCP, TU, UETP, ULTRIX, VAX, VAX DOCUMENT, VAXcluster, VMS, the AXP logo, and the DIGITAL logo.
OSF/1 is a registered trademark of Open Software Foundation, Inc. All other trademarks and registered trademarks are the property of their respective holders. FCC NOTICE: The equipment described in this manual generates, uses, and may emit radio
frequency energy. The equipment has been type tested and found to comply with the limits for a Class A computing device pursuant to Subpart J of Part 15 of FCC Rules, which are designed to provide reasonable protection against such radio frequency interference when operated in a commercial environment. Operation of this equipment in a residential area may cause interference, in which case the user at his own expense may be required to take measures to correct the interference.
This document was prepared using VAX DOCUMENT, Version 2.1.
S2384
Contents
Preface ................................................ xiii
1 System Maintenance Strategy
1.1 Troubleshooting the System . ....................... 1–1
1.2 Service Delivery Methodology ...................... 1–7
1.3 Product Service Tools and Utilities . . . ............... 1–8
1.4 Information Services ............................. 1–11
1.5 Field Feedback . . . ............................... 1–12
2 Power-On Diagnostics and System LEDs
2.1 Interpreting System LEDs . . ....................... 2–1
2.1.1 Power Supply LEDs ........................... 2–2
2.1.2 Operator Control Panel LEDs ................... 2–7
2.1.3 I/O Panel LEDs .............................. 2–9
2.1.4 Futurebus+ Option LEDs ....................... 2–11
2.1.5 Storage Device LEDs . . . ....................... 2–12
2.2 Power-Up Screens ............................... 2–15
2.2.1 Console Event Log ............................ 2–17
2.2.2 Mass Storage Problems Indicated at Power-Up ...... 2–18
2.2.3 Robust Mode Power-Up . ....................... 2–26
2.3 Power-Up Sequence .............................. 2–27
2.3.1 AC Power-Up Sequence . ....................... 2–27
2.3.2 DC Power-Up Sequence . ....................... 2–29
2.3.3 Firmware Power-Up Diagnostics . . ............... 2–32
2.3.3.1 Serial ROM Diagnostics ..................... 2–32
2.3.3.2 Console Firmware-Based Diagnostics........... 2–33
2.4 Boot Sequence . . . ............................... 2–33
2.4.1 Cold Bootstrapping in a Uniprocessor Environment . . 2–34
2.4.2 Loading of System Software ..................... 2–35
2.4.3 Warm Bootstrapping in a Uniprocessor
Environment . ............................... 2–36
v
2.4.4 Multiprocessor Bootstrapping ................... 2–37
2.4.5 Boot Devices . . ............................... 2–37
3 Running System Diagnostics
3.1 Running ROM-Based Diagnostics ................... 3–1
3.1.1 test . ....................................... 3–3
3.1.2 show fru .................................... 3–5
3.1.3 show_status . . ............................... 3–7
3.1.4 show error . . . ............................... 3–8
3.1.5 memexer ................................... 3–10
3.1.6 memexer_mp . ............................... 3–11
3.1.7 exer_read ................................... 3–12
3.1.8 exer_write . . . ............................... 3–14
3.1.9 fbus_diag ................................... 3–16
3.1.10 show_mop_counter ............................ 3–18
3.1.11 clear_mop_counter ............................ 3–19
3.1.12 Loopback Tests............................... 3–20
3.1.12.1 Testing the Auxiliary Console Port (exer) . . ...... 3–20
3.1.12.2 Testing the Ethernet Ports (netexer) ........... 3–20
3.1.13 kill and kill_diags ............................ 3–21
3.1.14 Summary of Diagnostic and Related Commands ..... 3–21
3.2 DSSI Device Internal Tests . ....................... 3–22
3.3 DECVET...................................... 3–25
3.4 Running UETP . . ............................... 3–26
3.4.1 Summary of UETP Operating Instructions . . . ...... 3–26
3.4.2 System Disk Requirements ..................... 3–28
3.4.3 Preparing Additional Disks ..................... 3–28
3.4.4 Preparing Magnetic Tape Drives . . ............... 3–29
3.4.5 Preparing Tape Cartridge Drives . . ............... 3–29
3.4.5.1 TLZ06 Tape Drives. . ....................... 3–30
3.4.6 Preparing RRD42 Compact Disc Drives ............ 3–30
3.4.7 Preparing Terminals and Line Printers ............ 3–30
3.4.8 Preparing Ethernet Adapters .................... 3–30
3.4.9 DECnet for OpenVMS AXP Phase . ............... 3–31
3.4.10 Termination of UETP . . . ....................... 3–32
3.4.11 Interpreting UETP VMS Failures . ............... 3–32
3.4.12 Interpreting UETP Output ..................... 3–32
3.4.12.1 UETP Log Files ........................... 3–33
3.4.12.2 Possible UETP Errors ...................... 3–33
3.5 Acceptance Testing and Initialization. . ............... 3–34
vi
4 Error Log Analysis
4.1 Fault Detection and Reporting ...................... 4–1
4.1.1 Machine Check/Interrupts ...................... 4–2
4.1.2 System Bus Transaction Cycle ................... 4–4
4.2 Error Logging and Event Log Entry Format ........... 4–4
4.3 Event Record Translation. . . ....................... 4–6
4.3.1 OpenVMS AXP Translation ..................... 4–6
4.3.2 DEC OSF/1 Translation . ....................... 4–7
4.4 Interpreting System Faults Using ERF and UERF ...... 4–7
4.4.1 Note 1: System Bus Address Cycle Failures . . ...... 4–12
4.4.2 Note 2: System Bus Write-Data Cycle Failures ...... 4–13
4.4.3 Note 3: System Bus Read Parity Error ............ 4–14
4.4.4 Note 4: Backup Cache Uncorrectable Error . . . ...... 4–14
4.4.5 Note 5: Data Delivered to I/O Is Known Bad. . ...... 4–15
4.4.6 Note 6: Futurebus+ DMA Parity Error ............ 4–15
4.4.7 Note 7: Futurebus+ Mailbox Access Parity Error .... 4–16
4.4.8 Note 8: Multi-Event Analysis of Command/Address
Parity, Write-Data Parity, or Read-Data Parity
Errors ..................................... 4–16
4.4.9 Sample System Error Report (ERF) ............... 4–16
4.4.10 Sample System Error Report (UERF) ............. 4–18
5 Repairing the System
5.1 General Guidelines for FRU Removal and Replacement . . 5–1
5.2 Front FRUs .................................... 5–4
5.2.1 Operator Control Panel . ....................... 5–4
5.2.2 Vterm Module ............................... 5–4
5.2.3 Fixed-Media Storage . . . ....................... 5–4
5.2.3.1 3.5-Inch Fast-SCSI Disk Drives (RZ26, RZ27,
RZ35) ................................... 5–4
5.2.3.2 3.5-Inch SCSI Disk Drives ................... 5–5
5.2.3.3 5.25-Inch SCSI Disk Drive ................... 5–6
5.2.3.4 SCSI Storageless Tray Assembly .............. 5–6
5.2.3.5 3.5-Inch DSSI Disk Drive .................... 5–7
5.2.3.6 5.25-Inch DSSI Disk Drive ................... 5–7
5.2.3.7 DSSI Storageless Tray Assembly .............. 5–8
5.2.4 Removable-Media Storage (Tape and Compact
Disc) ....................................... 5–8
5.2.4.1 SCSI Bulkhead Connector ................... 5–8
5.2.4.2 SCSI Continuity Card ...................... 5–8
5.2.5 Fans ....................................... 5–9
vii
5.3 Rear FRUs ..................................... 5–16
5.3.1 Modules (CPU, Memory, I/O, Futurebus+) .......... 5–16
5.3.2 Ethernet Fuses .............................. 5–17
5.3.3 Power Supply . ............................... 5–17
5.3.4 Fans ....................................... 5–17
5.4 Backplane ..................................... 5–20
5.5 Repair Data for Returning FRUs .................... 5–22
6 System Configuration and Setup
6.1 Functional Description ............................ 6–1
6.1.1 System Bus . . ............................... 6–7
6.1.1.1 KN430 CPU .............................. 6–7
6.1.1.2 Memory . . ............................... 6–10
6.1.1.3 I/O Module ............................... 6–13
6.1.2 Serial Control Bus ............................ 6–15
6.1.3 Futurebus+ . . ............................... 6–16
6.1.4 Power Subsystem ............................. 6–17
6.1.5 Mass Storage . ............................... 6–19
6.1.5.1 Fixed-Media Compartments . . . ............... 6–19
6.1.5.2 Removable-Media Storage Compartment . . ...... 6–21
6.1.6 System Expansion ............................ 6–23
6.1.6.1 Power Control Bus for Expanded Systems . ...... 6–23
6.2 Examining System Configuration.................... 6–25
6.2.1 show config . . . ............................... 6–25
6.2.2 show device . . ............................... 6–26
6.2.3 show memory . ............................... 6–29
6.3 Setting and Showing Environment Variables ........... 6–29
6.4 Setting and Examining Parameters for DSSI Devices .... 6–33
6.4.1 show device du pu ............................ 6–33
6.4.2 cdp........................................ 6–34
6.4.3 DSSI Device Parameters: Definitions and Function. . 6–36
6.4.3.1 How OpenVMS AXP Uses the DSSI Device
Parameters .............................. 6–38
6.4.3.2 Example: Modifying DSSI Device Parameters .... 6–39
6.5 Console Port Baud Rate ........................... 6–41
6.5.1 Console Serial Port ........................... 6–42
6.5.2 Auxiliary Serial Port . . . ....................... 6–44
viii
A Environment Variables
B Power System Controller Fault Displays
C Worksheet for Recording Customer Environment
Variable Settings
Glossary
Index
Examples
3–1 Running DRVTST ............................ 3–24
3–2 Running DRVEXR ............................ 3–25
4–1 ERF-Generated Error Log Entry Indicating CPU
Corrected Error .............................. 4–17
4–2 UERF-Generated Error Log Entry Indicating CPU
Error ...................................... 4–18
Figures
2–1 Power Supply LEDs ........................... 2–3
2–2 LDC and Fan Unit Locations and Error Codes ...... 2–6
2–3 OCP LEDs . . . ............................... 2–7
2–4 Module Locations Corresponding to OCP LEDs ...... 2–9
2–5 I/O Panel LEDs .............................. 2–10
2–6 Futurebus+ Option LEDs ....................... 2–11
2–7 Fixed-Media Mass Storage LEDs (SCSI) ........... 2–13
2–8 Fixed-Media Mass Storage LEDs (DSSI) ........... 2–14
2–9 Power-Up Self-Test Screen ...................... 2–16
2–10 Sample Power-Up Configuration Screen............ 2–17
2–11 Flowchart for Troubleshooting Fixed-Media
Problems ................................... 2–19
2–12 Flowchart for Troubleshooting Fixed-Media Problems
(Continued) . . ............................... 2–20
ix
2–13 Flowchart for Troubleshooting Removable-Media
Problems ................................... 2–23
2–14 Flowchart for Troubleshooting Removable-Media
Problems (Continued) . . ....................... 2–24
2–15 AC Power-Up Sequence . ....................... 2–28
2–16 DC Power-Up Sequence . ....................... 2–30
2–17 DC Power-Up Sequence (Continued) .............. 2–31
4–1 ERF/UERF Error Log Format ................... 4–5
5–1 SCSI Continuity Card Placement . . ............... 5–9
5–2 Front FRUs . . ............................... 5–10
5–3 Storage Compartment with Four 3.5-inch Fast-SCSI
Drives (RZ26, RZ27, RZ35)...................... 5–11
5–4 Storage Compartment with Four 3.5-inch SCSI/DSSI
Drives ..................................... 5–12
5–5 3.5-Inch SCSI Drive Resistor Packs and Power
Termination Jumpers . . . ....................... 5–13
5–6 Position of Drives in Relation to Bus Node ID
Numbers ................................... 5–14
5–7 Storage Compartment with One 5.25-inch SCSI/DSSI
Drive ...................................... 5–15
5–8 Rear FRUs . . . ............................... 5–18
5–9 Ethernet Fuses and Ethernet Address ROMs . ...... 5–19
5–10 Removing Shell .............................. 5–21
5–11 Removing Backplane . . . ....................... 5–22
6–1 System Block Diagram . . ....................... 6–3
6–2 System Backplane ............................ 6–4
6–3 BA640 Enclosure (Front) ....................... 6–5
6–4 BA640 Enclosure (Rear) . ....................... 6–6
6–5 CPU Block Diagram ........................... 6–8
6–6 MS430 Memory Block Diagram . . . ............... 6–12
6–7 I/O Module Block Diagram ...................... 6–14
6–8 Serial Control Bus EEPROM Interaction ........... 6–16
6–9 Power Subsystem Block Diagram . . ............... 6–18
6–10 Fixed-Media Storage . . . ....................... 6–20
6–11 Removable-Media Storage ...................... 6–22
6–12 Sample Power Bus Configuration . . ............... 6–24
6–13 Device Name Convention ....................... 6–27
x
6–14 How OpenVMS Sees Unit Numbers for DSSI
Devices ..................................... 6–39
6–15 Sample DSSI Buses for an Expanded DEC 4000 AXP
System ..................................... 6–41
6–16 Console Baud Rate Select Switch . . ............... 6–43
Tables
1–1 Recommended Troubleshooting Procedures . . . ...... 1–2
1–2 Diagnostic Flow for Power Problems .............. 1–5
1–3 Diagnostic Flow for Problems Getting to Console
Mode ...................................... 1–5
1–4 Diagnostic Flow for Problems Reported by the Console
Program .................................... 1–6
1–5 Diagnostic Flow for Boot Problems ............... 1–6
1–6 Diagnostic Flow for Errors Reported by the Operating
System ..................................... 1–7
2–1 Interpreting Power Supply LEDs . . ............... 2–4
2–2 Interpreting OCP LEDs . ....................... 2–8
2–3 Interpreting I/O Panel LEDs .................... 2–10
2–4 Interpreting Futurebus+ Option LEDs ............. 2–12
2–5 Interpreting Fixed-Media Mass Storage LEDs . ...... 2–14
2–6 Fixed-Media Mass Storage Problems .............. 2–21
2–7 Removable-Media Mass Storage Problems .......... 2–25
2–8 Supported Boot Devices . ....................... 2–37
3–1 Summary of Diagnostic and Related Commands ..... 3–21
4–1 DEC 4000 AXP Fault Detection and Correction ...... 4–2
4–2 Error Field Bit Definitions for Error Log
Interpretation ............................... 4–8
6–1 Memory Features ............................. 6–11
6–2 Power Control Bus ............................ 6–24
6–3 Environment Variables Set During System
Configuration . ............................... 6–30
6–4 Console Line Baud Rates ....................... 6–43
A–1 Environment Variables ....................... A–1
B–1 Power System Controller Fault ID Display . . . ...... B–1
C–1 Nonvolatile Environment Variables ............... C–1
xi
Preface
This guide describes the procedures and tests used to service DEC 4000 AXP systems.
Intended Audience
This guide is intended for use by Digital Equipment Corporation service personnel and qualified self-maintenance customers.
Conventions
The following coventions are used in this guide.
Convention Meaning
Return
Ctrl/x Ctrl/x indicates that you hold down the Ctrl key while you
bold type In the online book (Bookreader), bold type in examples
lowercase Lowercase letters in commands indicate that commands can be
A key name enclosed in a box indicates that you press that key.
press another key, indicated here by x. In examples, this key combination is enclosed in a box, for example,
indicates commands and other instructions that you enter at the keyboard.
entered in uppercase or lowercase.
Ctrl/C
.
xiii
In some illustrations, small drawings of the DEC 4000 AXP system appear in the left margin. Shaded areas help you locate
components on the front or back of the system. Warning Warnings contain information to prevent personal injury. Caution Cautions provide information to prevent damage to equipment
[]
console command abbreviations
boot
italic type Italic type in console command sections indicates a variable. < > In console mode online help, angle brackets enclose a
{ } In command descriptions, braces containing items separated by
or software.
In command format descriptions, brackets indicate optional
elements.
Console command abbreviations must be entered exactly as
shown.
Console and operating system commands are shown in this
special typeface.
placeholder for which you must specify a value.
commas imply mutually exclusive items.
xiv
1
System Maintenance Strategy
Any successful maintenance strategy is based on the proper understanding and use of information services, service tools, service support and escalation procedures, field feedback, and troubleshooting procedures. This chapter describes the maintenance strategy for the DEC 4000 AXP system.
Section 1.1 provides a diagnostic strategy you should use to troubleshoot a
DEC 4000 AXP system.
Section 1.2 explains the service delivery methodology.
Section 1.3 lists the product tools and utilities.
Section 1.4 lists available information services.
Section 1.5 describes field feedback procedures.
1.1 Troubleshooting the System
Before troubleshooting any system problem, check the site maintenance log for the system’s service history. Be sure to ask the system manager the following questions:
Has the system been used before and did it work correctly?
Have changes to hardware or updates to firmware or software been made to
the system recently?
What is the state of the system—is the operating system up?
If the operating system is down and you are not able to bring it up, use the console environment diagnostic tools, such as RBDs and LEDs.
If the operating system is up, use the operating system environment diagnostic tools, such as error logs, crash dumps, DEC VET and UETP exercisers, and other log files.
System Maintenance Strategy 1–1
System problems can be classified into the following five categories:
1. Power problems
2. Problems getting to the console
3. Failures reported by the console subsystem
4. Boot failures
5. Failures reported by the operating system Using these categories, you can quickly determine a starting point for diagnosis
and eliminate the unlikely sources of the problem. Table 1–1 provides the recommended tools or resources you should use to isolate problems in each category.
Table 1–1 Recommended Troubleshooting Procedures
Description
1. Power Problems (Table 1–2)
Diagnostic Tools/Resources Reference
No power at system enclosure or trouble with power supply subsystem, as indicated by LEDs.
2. Problems Getting to Console Mode (Table 1–3)
System powers up, but does not display power-up screen.
Power supply subsystem LEDs
OCP LEDs Refer to Section 2.1.2 for information on
Console terminal troubleshooting flow
Power-up sequence description
Robust mode power-up
Refer to Section 2.1.1 for information on interpreting power supply LEDs.
interpreting OCP LEDs.
Refer to Table 1–3 for information on troubleshooting console terminal problems.
Refer to Section 2.3 and 2.3.3 for a description of the power-up and self-test sequence.
Refer to Section 2.2.3 for a description of robust mode power-up and its functions.
(continued on next page)
1–2 System Maintenance Strategy
Table 1–1 (Cont.) Recommended Troubleshooting Procedures
Description
3. Failures Reported by the Console Program (Table 1–4)
Diagnostic Tools/Resources Reference
Power-up console screens indicate a failure.
4. Boot Failures (Table 1–5)
System fails to boot operating system.
Power-up screens
Console event log
RBD device tests
Console commands (to examine environment variables and device parameters)
Storage device troubleshooting flowcharts
RBD device tests
Boot sequence description
Refer to Section 2.2 for information on interpreting power-up self-tests.
Refer to Section 2.2 for information on the console event log.
Refer to Section 3.1 for information on running RBD device tests.
Refer to Chapter 6 for instructions on setting and examining environment variables and device parameters.
Refer to Section 2.2.2.
Refer to Section 3.1 for information on running RBD device tests.
Refer to Section 2.4 for a description of the boot sequence.
(continued on next page)
System Maintenance Strategy 1–3
Table 1–1 (Cont.) Recommended Troubleshooting Procedures
Description
5. Failures Reported by the Operating System (Table 1–6)
Diagnostic Tools/Resources Reference
Operating system generates error logs; process hangs or operating system crashes.
Error logs Refer to Chapter 4 for information on
Crash dump Refer to OpenVMS AXP Alpha System
DEC VET or UETP
Other log files Refer to Chapter 4 for information on
interpreting error logs.
Dump Analyzer Utility Manual for information on how to interpret OpenVMS crash dump files.
Refer to the Guide to Kernel Debugging (AA–PS2TA–TE) for information on using the DEC OSF/1 Krash Utility.
Refer to Section 3.3 for a description of DEC VET, and Section 3.4 for information on running UETP software exercisers.
using log files such as SETHOST.LOG and OPERATOR.LOG to aid in troubleshooting.
Use the following tables to identify the diagnostic flow for the five types of system problems:
Table 1–2 provides the diagnostic flow for power problems.
Table 1–3 provides the diagnostic flow for problems getting to console mode.
Table 1–4 provides the diagnostic flow for problems reported by the console
program.
Table 1–5 provides the diagnostic flow for boot problems.
Table 1–6 provides the diagnostic flow for errors reported by the operating
system.
1–4 System Maintenance Strategy
Table 1–2 Diagnostic Flow for Power Problems
Symptom Action Reference
No AC power at system as indicated by AC present LED.
AC power is present, but system does not power on.
Check the power source and power cord.
Check the system AC circuit breaker
setting.
Check the DC on/off switch setting.
Examine power supply subsystem LEDs
to determine if a power supply unit
or fan has failed, or if the system has
shut down due to an overtemperature
condition.
Section 2.1.1
Table 1–3 Diagnostic Flow for Problems Getting to Console Mode
Symptom Action Reference
Power-up screens (or console event log) are not displayed.
Check OCP LEDs for a failure during
self-tests. If two OCP LEDs remain lit,
either option could be at fault.
Check baud rate setting for console
terminal and system. The system default
baud rate setting is 9600.
Try connecting the console terminal to
the auxiliary console port.
Note: No console output is directed to
the auxiliary console port untill the
power-up self-tests have completed and
you press the Enter key or Ctrl/x.
For certain situations, power up under
robust mode to bypass the power-up
script and get to a low-level console.
From console mode, you can then edit the
nvram file, set and examine environment
variables, or initialize individual phases
of drivers.
Section 2.1.2
Section 6.5
Section 2.2.3
System Maintenance Strategy 1–5
Table 1–4 Diagnostic Flow for Problems Reported by the Console Program
Symptom Action Reference
Power-up screens are displayed, but tests do not complete.
Console program reports error.
Use power-up display and/or OCP LEDs
to determine error.
Examine the console event log to check
for embedded error messages recorded
during power-up.
If power-up screens indicate problems
with mass storage devices, use the
troubleshooting flow charts to determine
the problems.
Run RBD tests to verify problem. Section 3.1
Use the
examine error information contained
in serial control bus EEPROMs.
show error
command to
Section 2.2 and Section 2.1.2
Section 2.2.1
Section 2.2.2
Section 3.1.4
Table 1–5 Diagnostic Flow for Boot Problems
Symptom Action Reference
System cannot find boot device.
Device does not boot. Run device test to check that boot device
Check system configuration for correct
device parameters (node ID, device name,
and so on) and environment variables
(bootdef_dev, boot_file, boot_osflags).
is operating.
Section 6.2.1, Section 6.3, and Section 6.4
Section 3.2
1–6 System Maintenance Strategy
Table 1–6 Diagnostic Flow for Errors Reported by the Operating System
Symptom Action Reference
System is hung or has crashed.
Operating system is up. Examine the operating system error log
Examine the crash dump file. Operating system
Use the
examine error information contained
in serial control bus EEPROMs (console
environment error log).
files to isolate the problem.
If the problem occurs intermittently, run
DEC VET or UETP to stress the system.
Examine other log files, such as
SETHOST.LOG, OPCOM.LOG, and
OPERATOR.LOG.
show error
command to
documentation
Section 3.1.4
Chapter 4
Section 3.3 and Section 3.4
1.2 Service Delivery Methodology
Before beginning any maintenance operation, you should be familiar with the following:
The site agreement
Your local and area geography support and escalation procedures
Your Digital Services product delivery plan
System Maintenance Strategy 1–7
Service delivery methods are part of the service support and escalation procedure. When appropriate, remote services should be part of the initial system installation. Methods of service delivery include the following:
Local support
Remote call screening
Remote diagnosis (using modem support)
Recommended System Installation
The recommended system installation includes:
1. Hardware installation and acceptance testing. Acceptance testing includes
running ROM-based diagnostics.
2. Software installation and acceptance testing. For example, using OpenVMS
Factory Installed Software (FIS), and then acceptance testing with DEC VET or UETP.
3. Installation of the remote service tools and equipment to allow a Digital
Service Center to dial in to the system. Refer to your remote service delivery strategy.
If you do not follow your service delivery methodology, you risk incurring excessive service expenses for any product.
1.3 Product Service Tools and Utilities
This section lists the array of service tools and utilities available for acceptance testing, diagnosis, and serviceability and provides recommendations for their use.
Error Handling/Logging
OpenVMS and DEC OSF/1 operating systems provide recovery from errors, fault handling, and event logging. The OpenVMS Error Report Formatter (ERF) provides bit-to-text translation of the event logs for interpretation. DEC OSF/1 uses UERF to capture the same kinds of information.
RECOMMENDED USE: Analysis of error logs is the primary method of diagnosis and fault isolation. If the system is up, or the customer allows the service representative to bring the system up, look at this information first. Refer to Chapter 4 for information on using error logs to isolate faults.
1–8 System Maintenance Strategy
ROM-Based Diagnostics (RBDs)
ROM-based diagnostics have significant advantages:
There is no load time.
The boot path is more reliable.
Diagnosis is done in console mode.
RECOMMENDED USE: The ROM-based diagnostic facility is the primary means of console environment testing and diagnosis of the CPU, memory, Ethernet, Futurebus+, and SCSI and DSSI subsystems. Use ROM-based diagnostics in the acceptance test procedures when you install a system, add a memory module, or replace the following: CPU module, memory module, backplane, I/O module, Futurebus+ device, or storage device. Refer to Section 3.1 for information on running ROM-based diagnostics.
Loopback Tests
Internal and external loopback tests are used to isolate a failure by testing segments of a particular control or data path. The loopback tests are a subset of the ROM-based diagnostics.
RECOMMENDED USE: Use loopback tests to isolate problems with the auxiliary console port and Ethernet controllers. Refer to Section 3.1.12 for instructions on performing loopback tests.
Firmware Console Commands
Console commands are used to set and examine environment variables and device parameters. For example, the and
show device
set
(bootdef_dev, auto_action, and boot_osflags) commands are used to set environment variables; and the parameters.
RECOMMENDED USE: Use console commands to set and examine environment variables and device parameters. Refer to Section 6.2 for information on firmware commands and utilities.
commands are used to examine the configuration; the
show memory,show configuration
cdp
command is used to configure DSSI
System Maintenance Strategy 1–9
,
Option LEDs During Power-Up
The power supply LEDs display pass/fail test results for the power supply subsystem; the operator control panel (OCP) LEDs display pass/fail self-test results for CPU, memory, I/O, and Futurebus+ modules. Storage devices and Futurebus+ modules have their own LEDs as well.
RECOMMENDED USE: Monitor LEDs during power-up to see if the devices pass their self-tests. Refer to Chapter 2 for information on LEDs and power­up tests.
Operating System Exercisers (DEC VET or UETP)
The Digital Verifier and Exerciser Tool (DEC VET) is supported by the OpenVMS and DEC OSF/1 operating systems. DEC VET performs exerciser­oriented maintenance testing of both hardware and operating system. UETP is included with OpenVMS and is designed to test whether the OpenVMS operating system is installed correctly.
RECOMMENDED USE: Use DEC VET or UETP as part of acceptance testing to ensure that the CPU, memory, disk, tape, file system, and network are interacting properly. Also use DEC VET or UETP to stress test the user’s environment and configuration by simulating system operation under heavy loads to diagnose intermittent system failures.
Crash Dumps
For fatal errors, such as fatal bugchecks, OpenVMS and DEC OSF/1 operating systems will save the contents of memory to a crash dump file.
RECOMMENDED USE: The support representative should analyze crash dump files. To save a crash dump file for analysis, you need to know proper system settings. Refer to the OpenVMS AXP Alpha System Dump Analyzer Utility Manual or the Guide to Kernel Debugging (AA–PS2TA–TE) for instructions.
Other Log Files
Several types of log files, such as operator log, console event log, sethost log, and accounting file (accounting.dat) are useful in troubleshooting.
RECOMMENDED USE: Use the sethost log and other log files to capture/examine the console output and compare with event logs and crash dumps in order to see what the system was doing at the time of the error.
1–10 System Maintenance Strategy
1.4 Information Services
As a Digital service representative, you may access several information resources, including advanced database applications, online training courses, and remote diagnostic tools. A brief description of some of these resources follows.
Technical Information Management Architecture (TIMA)
TIMA is an online database that delivers technical and reference information to service representatives. A key benefit of TIMA is the pooling of worldwide knowledge and expertise.
DEC 4000 AXP Model 600 Series Information Set
The DEC 4000 AXP Model 600 Series Information Set consists of service documentation that contains information on installing and using, servicing and upgrading, and understanding the system. The guide you are reading is part of the set. The hardcopy kit number is EK–KN430–DK. The set is also available on TIMA. Refer to your DEC 4000 Model 600 Information Map (EK–KN430–IN) for detailed information.
Training
Computer Based Training (CBT) and lecture lab courses are available from the Digital training center:
DEC 4000 System Installation and Troubleshooting (CBT course, EY–
I090E–CO)
Alpha Architecture Concepts (CBT course, EY–K725E–MT—magnetic
tape; EY–K725E–TK—TK50 tape)
Futurebus+ Concepts (EY–F479E–CO)
Digital Services Product Delivery Plan (Hardware or Software)
The Product Delivery Plan documents Digital Services’ delivery commitments. The plan is the communications vehicle used among the various groups responsible for ensuring consistency between Digital Services’ delivery strategies and engineering product strategies.
Blitzes
Technical updates are ‘‘blitzed’’ to the field using online mail and TIMA.
System Maintenance Strategy 1–11
Storage and Retrieval System (STARS)
STARS is a worldwide database for storing and retrieving technical information. The STARS databases, which contain more than 150,000 entries, are updated daily.
Using STARS, you can quickly retrieve the most up-to-date technical information via DSNlink or DSIN.
1.5 Field Feedback
Providing the proper feedback to the corporation is essential in closing the loop on any service call. Consider the following when completing a service call:
Fill out repair tags accurately and with as much symptom information as possible so that repair centers can fix a problem.
Provide accurate call closeout information for Labor Activity Reporting System (LARS) or Call-Handling and Management Planning (CHAMP).
Keep an up-to-date site maintenance log, whether hardcopy or electronic, to provide a record of the performed maintenance.
1–12 System Maintenance Strategy
2
Power-On Diagnostics and System
LEDs
This chapter provides information on how to interpret system LEDs and the power-up console screens. In addition, a description of the power-up and bootstrap sequence is provided as a resource to aid in troubleshooting.
Section 2.1 describes how to interpret system LEDs.
Section 2.2 describes how to interpret the power-up screens.
Section 2.3 describes the power-up sequence.
Section 2.3.3 describes power-on self-tests.
Section 2.4 describes the boot sequence.
2.1 Interpreting System LEDs
DEC 4000 AXP systems have several diagnostic LEDs that indicate whether modules and subsystems have passed self-tests. The power system controller constantly monitors the power supply subsystem and can indicate several types of failures. The system LEDs are used primarily to troubleshoot power problems and problems getting to the console program.
This section describes the function of each of the following types of system LEDs, and what action to take when a failure is indicated.
Power supply LEDs
Operator control panel (OCP) LEDs
I/O panel LEDs
Futurebus+ option LEDs
Storage device LEDs
Power-On Diagnostics and System LEDs 2–1
2.1.1 Power Supply LEDs
The power supply LEDs (Figure 2–1) are used to indicate the status of the components that make up the power supply subsystem. The following types of failures will cause the power system controller to shut down the system:
Power system controller (PSC) failure
Fan failure
Overtemperature condition
Power regulator failures (indicated by the DC3 or DC5 failure LEDs)
Front end unit (FEU) failure
Note
The AC circuit breaker will also shut down the system. If a power surge occurs, the breaker will trip, causing the switch to return to the off position (0). If the circuit breaker trips, wait 30 seconds before setting the switch to the on position (1).
Refer to Table 2–1 for information on interpreting the LEDs and determining what actions to take when a failure is indicated.
Figure 2–2 shows the local disk converter (LDC) and fan locations as they correspond to the fault ID display.
2–2 Power-On Diagnostics and System LEDs
Figure 2–1 Power Supply LEDs
PSC DC3FEU DC5
MO
SI
SO
AC Circuit
Breaker FEU Failure
FEU OK DC3 Failure DC3 OK
DC5 Failure DC5 OK
PSC Failure PSC OK
Over
Overtemperature Shutdown
Fan Failure Disk Power Failure Fault ID Display
AC Present
LJ-02011-TI0
Power-On Diagnostics and System LEDs 2–3
Table 2–1 Interpreting Power Supply LEDs
Indicator Meaning Action on Error Front End Unit (FEU)
AC Present When lit, indicates AC power
is present at the AC input connector (regardless of circuit breaker position).
FEU OK When lit, indicates DC output
voltages for the FEU are above the specified minimum.
FEU Failure When lit, indicates DC output
voltages for the FEU are less than the specified minimum.
If AC power is not present, check the power source and power cord.
If the system will not power up and the AC LED is the only lit LED, check if the system AC circuit breaker has tripped. Replace the front end unit (Chapter 5) if the system circuit breaker is broken.
Replace front end unit (Chapter 5).
(continued on next page)
2–4 Power-On Diagnostics and System LEDs
Table 2–1 (Cont.) Interpreting Power Supply LEDs
Indicator Meaning Action on Error Power System Controller (PSC)
PSC OK When blinking, indicates the
PSC Failure When lit, indicates the PSC has
Disk Power Failure
Fan Failure When lit, indicates a fan has
Overtemperature Shutdown
PSC is performing power-up self-tests.
When steady, indicates the PSC is functioning normally.
detected a fault in itself. When lit, indicates a disk
power problem for the storage compartment specified in the hexadecimal fault ID display. The most likely failing unit is the local disk converter, but a shorting cable or drive could also be at fault.
failed or a cable guide is not properly secured. The failure is identified by a number displayed in the hexadecimal fault ID display.
When lit, indicates the PSC has shut down the system due to excessive internal temperature.
Replace power system controller (Chapter 5).
To isolate the local disk converter, disconnect the drives on the specified bus and then power up the system. If the Disk Power Failure LED lights with the drives disconnected, replace the failing local disk converter (Chapter 5). Refer to Figure 2–2 to locate the local disk converter specified by the fault ID display. A is the top compartment, D is the bottom compartment.
Refer to Figure 2–2 to locate the failure specified by the fault ID display.
Replace the failing fan (Chapter 5).
Set the AC circuit breaker to off (0) and wait one minute before turning on the system.
Make sure the air intake is unobstructed and that the room temperature does not exceed maximum requirement as described in the DEC 4000 Site Preparation Checklist.
(continued on next page)
Power-On Diagnostics and System LEDs 2–5
Table 2–1 (Cont.) Interpreting Power Supply LEDs
Indicator Meaning Action on Error DC–DC Converter (DC3)
DC3 OK When lit, indicates that all the
DC3 output voltages are within specified tolerances.
DC3 Failure When lit, indicates that one of
the output voltages is outside
Replace the DC3 converter (Chapter 5).
specified tolerances.
DC–DC Converter (DC5)
DC5 OK When lit, indicates the DC5
output voltage is within specified tolerances.
DC5 Failure When lit, indicates the DC5
output voltage is outside
Replace the DC5 converter (Chapter 5).
specified tolerances.
Figure 2–2 LDC and Fan Unit Locations and Error Codes
Fan Error Codes
Local Disk Converter A
Local Disk Converter B
Local Disk Converter C
Local Disk Converter D
Fan 3 Fan 4 Fan 1
3
1 - Rear left 2 - Rear right 3 - Front left 4 - Front right
9 - A cable guide is not
properly secured or two or more fans have failed.
4
Fans are located behind the cable guides
Fan 2
1
2
MLO-010872
2–6 Power-On Diagnostics and System LEDs
Loading...
+ 212 hidden pages