Sun Microsystems Fire T2000 Service Manual

Sun Fire™T2000 Server
Service Manual
Sun Microsystems, Inc. www.sun.com
Part No. 819-2548-10 October 2005, Revision A
Submit comments about this document at: http://www.sun.com/hwdocs/feedback
Copyright 2005Sun Microsystems,Inc., 4150 Network Circle, SantaClara, California95054, U.S.A. All rights reserved. Sun Microsystems, Inc.has intellectual property rights relating to technologythat isdescribed in this document. Inparticular, and without
limitation, theseintellectual propertyrights may include one ormore ofthe U.S. patents listed athttp://www.sun.com/patentsand one or more additional patents orpending patent applications in theU.S. and in other countries.
This documentand the product to which it pertainsare distributedunder licenses restricting their use,copying, distribution,and decompilation. Nopart of the product or of thisdocument may be reproduced in any formby anymeans without priorwritten authorizationof Sun andits licensors, if any.
Third-party software, including fonttechnology, iscopyrighted andlicensed from Sun suppliers. Parts ofthe productmay be derived from BerkeleyBSD systems,licensed from the University ofCalifornia. UNIXis a registered trademark in
the U.S.and in other countries, exclusivelylicensed throughX/Open Company,Ltd. Sun, Sun Microsystems, the Sun logo,AnswerBook2, docs.sun.com,Java, OpenBoot,SunSolve, SunVTS,Sun Fire,and Solarisare trademarksor
registered trademarks of SunMicrosystems, Inc.in the U.S. and inother countries. All SPARCtrademarks areused under license and are trademarksor registeredtrademarks ofSPARCInternational, Inc. in the U.S.and in other
countries. Products bearingSPARCtrademarks arebased upon an architecture developed by SunMicrosystems, Inc. The OPENLOOK and Sun™ Graphical UserInterface wasdeveloped by SunMicrosystems, Inc.for its users and licensees.Sun acknowledges
the pioneeringefforts ofXerox inresearching anddeveloping the concept of visualor graphical user interfaces forthe computer industry.Sun holds anon-exclusive license from Xeroxto the Xerox GraphicalUser Interface,which license also covers Sun’slicensees who implement OPEN LOOK GUIsand otherwise comply with Sun’swritten licenseagreements.
U.S. GovernmentRights—Commercial use.Government users are subject tothe SunMicrosystems, Inc. standard license agreement and applicable provisions ofthe FAR and its supplements.
DOCUMENTATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANYIMPLIED WARRANTY OFMERCHANTABILITY,FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
Copyright 2005Sun Microsystems,Inc., 4150 Network Circle, SantaClara, Californie95054, Etats-Unis. Tous droits réservés. Sun Microsystems, Inc.a les droits de propriété intellectuels relatants àla technologiequi est décrit dans cedocument. En particulier, etsans la
limitation, cesdroits depropriété intellectuelspeuvent inclure un ou plusdes brevetsaméricains énumérés à http://www.sun.com/patents et un oules brevetsplus supplémentairesou les applicationsde breveten attente dans les Etats-Uniset dansles autres pays.
Ce produit oudocument est protégé par un copyright etdistribué avec des licences quien restreignentl’utilisation, lacopie, la distribution,et la décompilation. Aucunepartie de ce produit ou document nepeut êtrereproduite sousaucune forme, par quelque moyenque ce soit, sans l’autorisation préalableet écrite de Sun etde sesbailleurs de licence,s’il yena.
Le logicieldétenu par des tiers, etqui comprendla technologie relative aux polices de caractères,est protégépar un copyright et licenciépar des fournisseurs deSun.
Des partiesde ce produit pourrontêtre dérivées des systèmes BerkeleyBSD licenciés par l’Université deCalifornie. UNIXest une marque déposée auxEtats-Unis et dans d’autres pays et licenciéeexclusivement par X/Open Company, Ltd.
Sun, SunMicrosystems, lelogo Sun, AnswerBook2, docs.sun.com, Java,SunVTS, Sun Fire, et Solaris sont desmarques defabrique ou des marques déposées de SunMicrosystems, Inc.aux Etats-Unis etdans d’autrespays.
Toutes lesmarques SPARCsont utiliséessous licence et sont desmarques defabrique ou desmarques déposéesde SPARCInternational, Inc. aux Etats-Uniset dans d’autres pays. Les produits portantles marquesSPARCsont basés sur une architecture développée par Sun Microsystems, Inc.
L’interfaced’utilisation graphiqueOPEN LOOK et Sun™ aété développée par Sun Microsystems,Inc. pourses utilisateurs et licenciés. Sun reconnaît les efforts depionniers deXerox pour la rechercheet le développementdu conceptdes interfaces d’utilisationvisuelle ougraphique pour l’industriede l’informatique. Sun détient unelicense nonexclusive de Xerox surl’interface d’utilisationgraphique Xerox, cette licence couvrant égalementles licenciées de Sun quimettent enplace l’interface d’utilisation graphiqueOPEN LOOK et qui enoutre seconforment aux licencesécrites de Sun.
LA DOCUMENTATION EST FOURNIE "EN L’ÉTAT" ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES OU TACITESSONT FORMELLEMENT EXCLUES, DANS LAMESURE AUTORISEEPARLA LOIAPPLICABLE, Y COMPRIS NOTAMMENT TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A L’APTITUDE A UNE UTILISATION PARTICULIERE OU A L’ABSENCE DE CONTREFAÇON.
Please
Recycle
Contents
Preface ix
1. Sun Fire T2000 Server Overview 1
Sun Fire T2000 Server Features 2
Chip-Multitheaded (CMT) Multicore Processor and Memory Technology 2
Performance Enhancements 4
Remote Manageability With ALOM 5
System Reliability, Availability, and Serviceability 5
Hot-Pluggable and Hot-Swappable Components 6
Power Supply Redundancy 6
Fan Redundancy 6
Environmental Monitoring 7
Error Correction and Parity Checking 7
Predictive Self Healing 7
Chassis Identification 9
Additional Service Related Information 10
2. Sun Fire T2000 Server Diagnostics 11
Overview of Sun Fire T2000 Server Diagnostics 12
Using LEDs to Identify the State of Devices 16
iii
Front and Rear Panel LEDs 16
Hard Drive LEDs 19
Power Supply LEDs 20
Fan LEDs 21
Blower Unit LED 21
Using ALOM For Diagnosis and Repair Verification 22
Running ALOM Service-Related Commands 24
Connecting to ALOM 24
Switching Between the System Console and ALOM 24
Service-Related ALOM Commands 25
To Run the showfaults Command 26
To Run the showenvironment Command 27
To Run the showfru Command 29
Running POST 31
Controlling How POST Runs 31
To Change POST Parameters 34
Reasons to Run POST 35
Routine Sanity Check of the Hardware 35
Diagnosing the System Hardware 35
To Run POST 35
Using the Solaris Predictive Self-Healing Feature 40
To Use the fmdump Command to Identify Faults 41
Collecting Information From Solaris OS Files and Commands 43
To Check the Message Buffer 43
To View System Message Log Files 43
Managing Components with Automatic System Recovery (ASR) Commands 44
To Run the showcomponent Command 46
To Run the disablecomponent Command 47
iv Sun Fire T2000 Server Service Manual • October 2005
To Run the enablecomponent Command 47
Exercising the System With SunVTS 48
Checking Whether SunVTS Software Is Installed 48
To Check Whether SunVTS Software Is Installed 48
Exercising the System Using SunVTS Software 50
To Exercise the System Using SunVTS Software 50
For further information, refer to the manuals that accompany the SunVTS
software 53
3. Replacing Hot-Swappable and Hot-Pluggable FRUs 55
Devices That Are Hot-Swappable and Hot-Pluggable 56
Hot-Swapping a Fan 56
To Remove a Fan 57
To Replace a Fan 58
Hot-Swapping a Power Supply 58
To Remove a Power Supply 58
To Replace a Power Supply 60
Hot-Swapping the Rear Blower 61
To Remove the Rear Blower 61
To Replace the Rear Blower 61
Hot-Plugging a Hard Drive 63
To Remove a Hard Drive 63
To Replace a Hard Drive 64
4. Replacing Cold Swappable FRUs 65
Safety Information 66
Safety Symbols 66
Electrostatic Discharge Safety 67
Use an Antistatic Wrist Strap 67
Use an Antistatic Mat 67
Contents v
Common Procedures for Parts Replacement 67
Required Tools 68
To Shut the System Down 68
To Extend the Server to the Maintenance Position 69
To Remove the Server From the Rack 70
To Disconnect Power From the Server 72
To Perform Electrostatic Discharge (ESD) Prevention Measures 72
To Remove the Top Cover 72
To Remove the Front Bezel and Top Front Cover 73
Removing and Replacing FRUs 74
To Remove PCI-E and PCI-X Cards 75
To Replace PCI Cards 77
To Remove DIMMs 77
To Replace DIMMs 79
To Remove the System Controller 82
To Replace the System Controller Board 83
To Remove the Motherboard Assembly 84
To Replace the Motherboard Assembly 88
To Remove the Power Distribution Board 90
To Replace the Power Distribution Board 92
To Remove the LED Board 93
To Replace the LED Board 94
To Remove the Fan Power Board 95
To Replace the Fan Power Board 96
To Remove the Front I/O Board 96
To Replace the Front I/O Board 97
To Remove the DVD Drive 98
To Replace the DVD Drive 99
vi Sun Fire T2000 Server Service Manual • October 2005
To Remove the SAS Disk Backplane 99
To Replace the SAS Disk Backplane 100
To Remove the Battery on the System Controller 101
To Replace the Battery on the System Controller 101
Common Procedures for Finishing Up 103
To Replace the Top Front Cover and Front Bezel 103
To Replace the Top Cover 104
To Reinstall Server Chassis in the Rack 104
To Return the Server to the Normal Rack Position 105
To Apply Power to the Server 107
5. Adding New Components and Devices 109
Adding Hot-Pluggable and Hot-Swappable Devices 110
To Add a Hard Drive to the Server 110
To Add a USB Device 111
Adding Components Inside the Chassis 113
To Add DIMMs 113
To Add a PCI-E or PCI-X Card 116
A. Field-Replaceable Units 119
Contents vii
viii Sun Fire T2000 Server Service Manual • October 2005
Preface
The Sun Fire T2000 Service Manual provides information to aid in diagnosing hardware problems and describes how to replace components within the Sun Fire™ T2000 server. This guide also describes how to add components such as hard drives and memory to the server.
This manual is written for technicians, service personnel, and system administrators who service and repair computer systems. The person qualified to use this manual:
Can open a system chassis, identify, and replace internal components.
Understands the Solaris Operating System and the command-line interface.
Has superuser privileges for the system being serviced.
Understands typical hardware troubleshooting tasks.
ix
How This Book Is Organized
This guide is organized into the following chapters:
Chapter 1 describes the main features of the Sun Fire T2000 server.
Chapter 2 describes the diagnostics that are available for monitoring and diagnosing
the Sun Fire T2000 server.
Chapter 3 explains how to remove and replace hot-swappable and hot-pluggable
field replaceable units (FRUs).
Chapter 4 describes how to remove and replace the FRUs that cannot be hot-
swapped.
Chapter 5 explains how to add new components such as hard drives, memory, and
PCI cards to the Sun Fire T2000 server.
Appendix A provides an illustrated breakdown of parts and lists the field
replaceable units (FRUs).
x Sun Fire T2000 Server Service Manual • October 2005
Sun Fire T2000 Server Documentation
You can view and print the following manuals from the Sun documentation web site at: http://www.sun.com/documentation
Part
Title Description
Number
Sun Fire T2000 Server Site Planning Guide
Sun Fire T2000 Server Product Notes Late-breaking information about the
Sun Fire T2000 Server Overview Overview of the features of this server 819-2543
Sun Fire T2000 Server Getting Started Guide
Sun Fire T2000 Server Installation Guide
Sun Fire T2000 Server Administration Guide
Sun Fire T2000 Server Advanced Lights Out Manager (ALOM) Guide
Site planning information for the Sun Fire T2000 server
server.
Information about where to find documentation to get your system installed and running quickly
Detailed rackmounting, cabling, power­on, and configuration information
How to perform administrative tasks that are specific to the Sun Fire T2000 server
How to use the Advanced Lights Out Manager (ALOM) software on the Sun Fire T2000 server
819-2545
819-2544
819-2542
819-2546
819-2549
819-2550
Preface xi
Typographic Conventions
Typeface
AaBbCc123 The names of commands, files,
AaBbCc123
AaBbCc123 Book titles, new words or terms,
1 The settings on your browser might differ from these settings.
1
Meaning Examples
Edit your.login file. and directories; on-screen computer output
What you type, when contrasted with on-screen computer output
words to be emphasized. Replace command-line variables with real names or values.
Use ls -a to list all files.
% You have mail.
% su
Password:
Read Chapter 6 in the User’s Guide.
These are called class options.
You must be superuser to do this.
To delete a file, type rm filename.
Shell Prompts
Shell Prompt
C shell machine-name%
C shell superuser machine-name#
Bourne shell and Korn shell $
Bourne shell and Korn shell superuser #
Accessing Sun Documentation
You can view, print, or purchase a broad selection of Sun documentation, including localized versions, at:
http://www.sun.com/documentation
xii Sun Fire T2000 Server Service Manual • October 2005
Third-Party Web Sites
Sun is not responsible for the availability of third-party web sites mentioned in this document. Sun does not endorse and is not responsible or liable for any content, advertising, products, or other materials that are available on or through such sites or resources. Sun will not be responsible or liable for any actual or alleged damage or loss caused by or in connection with the use of or reliance on any such content, goods, or services that are available on or through such sites or resources.
Contacting Sun Technical Support
If you have technical questions about this product that are not answered in this document, go to:
http://www.sun.com/service/contacting
Sun Welcomes Your Comments
Sun is interested in improving its documentation and welcomes your comments and suggestions. You can submit your comments by going to:
http://www.sun.com/hwdocs/feedback
Please include the title and part number of your document with your feedback:
Sun Fire T2000 Server Service Manual, part number 819-2548-10
Preface xiii
xiv Sun Fire T2000 Server Service Manual • October 2005
CHAPTER
1
Sun Fire T2000 Server Overview
This chapter provides an overview of the features of the Sun Fire T2000 server.
The following topics are covered:
“Sun Fire T2000 Server Features” on page 2
“Chassis Identification” on page 9
1
Sun Fire T2000 Server Features
The Sun Fire T2000 server is a high-performance entry-level server that is highly scalable and extremely reliable.
FIGURE 1-1 Sun Fire T2000 Server
Chip-Multitheaded (CMT) Multicore Processor and Memory Technology
The UltraSPARC®T1 multicore processor is the basis of the Sun Fire T2000 server. The UltraSPARC T1 processor is based on chip multithreading (CMT) technology that is optimized for highly threaded transactional processing. The UltraSPARC T1 processor improves throughput while using less power and dissipating less heat than conventional processor designs.
Depending on the model purchased, the processor has four, six, or eight UltraSPARC cores. Each core equates to a 64-bit execution pipeline capable of running four threads. The result is that the 8-core processor handles up to 32 active threads concurrently.
2 Sun Fire T2000 Server Service Manual • October 2005
Additional processor components, such as L1 cache, L2 cache, memory access crossbar, DDR2 memory controllers, and a JBus I/O interface have been carefully tuned for optimal performance.
UltraSPARC T1 multicore processor
FIGURE 1-2 Motherboard and UltraSPARC T1 Multicore Processor
Chapter 1 Sun Fire T2000 Server Overview 3
Performance Enhancements
The Sun Fire T2000 server introduces several new technologies with its sun4v architecture and multithreaded UltraSPARC T1 multicore processor.
Some of these enhancements are:
Large page optimization
Reduction on TLB misses
Optimized block copy
TABLE 1-1 lists feature specifications for the Sun Fire T2000 server.
TABLE 1-1 Sun Fire T2000 System Features at a Glance
Feature Description
Processor 1 UltraSPARC T1 multicore processor (4, 6, or 8 cores)
Memory 16 slots that can be populated with one of the following types of
DDR-2 DIMMS:
• 512 MB (8 GB maximum)
• 1 GB (16 GB maximum)
• 2 GB (32 GB maximum)
Ethernet ports 4 ports, 10/100/1000 Mb autonegotiating
Internal hard disk
1-4 SFF SAS drives, 2.5-inch form factor
drives
Other internal
1 slimline DVD drive
peripherals
USB ports 4 USB 1.1 ports (2 in front and 2 in rear)
Cooling 3 hot-swappable and redundant system fans and 1 blower unit
PCI interfaces 3 PCI-Express (PCI-E) slots for low-profile cards (supports 1x, 4x,
and 8x width cards) 2 PCI-X slots for 64-bit 133 MHz low-profile cards Note: One PCI-X slot is occupied by a SAS disk controller card.
Power 2 hot-swappable and redundant power supplies
Remote management
ALOM management controller with a serial and 10/100 Mb Ethernet port
Firmware OpenBoot Prom (OBP) for reset and POST support
ALOM for remote management administration
*
Operating system
Solaris 10 3/05 HW2 Operating System preinstalled on disk 0
Other software Java™ Enterprise System with a 90-day trial license
* Check the Sun Fire T2000 ProductNotes for the latest informationabout supported releases of the Solaris OS.
4 Sun Fire T2000 Server Service Manual • October 2005
Remote Manageability With ALOM
The Sun Advanced Lights Out Manager (ALOM) feature is a system controller (SC) that enables you to remotely manage and administer the Sun Fire T2000 server.
The ALOM software is preinstalled as firmware, and it initializes as soon as you apply power to the system. You can customize ALOM to work with your particular installation.
ALOM enables you to monitor and control your server over a network, or by using a dedicated serial port for connection to a terminal or terminal server. ALOM provides a command-line interface that you can use to remotely administer geographically distributed or physically inaccessible machines. In addition, ALOM enables you to run diagnostics (such as POST) remotely that would otherwise require physical proximity to the server’s serial port.
You can configure ALOM to send email alerts of hardware failures, hardware warnings, and other events related to the server or to ALOM. The ALOM circuitry runs independently of the server, using the server’s standby power. Therefore, ALOM firmware and software continue to function when the server operating system goes offline or when the server is powered off. ALOM monitors the following Sun Fire T2000 server components:
CPU temperature conditions
Hard drive status
Enclosure thermal conditions
Fan speed and status
Power supply status
Voltage levels
Faults detected by POST (Power-On Self-Test)
Solaris Predictive Self Healing (PSH) diagnostic facilities
For information about configuring and using the ALOM system controller, refer to the Sun Fire T2000 Server Advanced Lights Out Manager (ALOM) Guide.
System Reliability, Availability, and Serviceability
Reliability, availability, and serviceability (RAS) are aspects of a system’s design that affect its ability to operate continuously and to minimize the time necessary to service the system. Reliability refers to a system’s ability to operate continuously without failures and to maintain data integrity. System availability refers to the ability of a system to recover to an operational state after a failure, with minimal impact. Serviceability relates to the time it takes to restore a system to service following a system failure. Together, reliability, availability, and serviceability features provide for near continuous system operation.
Chapter 1 Sun Fire T2000 Server Overview 5
To deliver high levels of reliability, availability, and serviceability, the Sun Fire T2000 server offers the following features:
Hot-pluggable hard drives
Redundant, hot-swappable power supplies (two)
Redundant hot-swappable fan units (three)
Environmental monitoring
Error detection and correction for improved data integrity
Easy access for most component replacements
Extensive POST tests that automatically deletes faulty components from the
configuration.
PSH automated run time diagnosis capability that takes faulty components off
line.
For more information about using RAS features, refer to the Sun Fire T2000 Server System Administration Guide.
Hot-Pluggable and Hot-Swappable Components
Sun Fire T2000 hardware supports hot-plugging or hot-swapping of the chassis­mounted hard drives, fans, power supplies, and the rear blower. Using the proper software commands, you can install or remove these components while the system is running. Hot-plug and hot-swap technology significantly increases the system’s serviceability and availability by providing the ability to replace hard drives, fan units, rear blower, and power supplies without service disruption.
Power Supply Redundancy
The Sun Fire T2000 server features two hot-swappable power supplies which enable the system to continue operating should one of the power supplies fail or if one power source fails.
The Sun Fire T2000 server also has a single hot-swappable blower unit that works in conjunction with the power supply fans to provide cooling for the internal disk drives. If the blower unit fails, the two power supply fan units provide enough cooling for the disk drive bay to keep the system running.
Fan Redundancy
The Sun Fire T2000 server features three hot-swappable system fans. Multiple fans enable the system to continue operating with adequate cooling in the event that one of the fans fails.
6 Sun Fire T2000 Server Service Manual • October 2005
Environmental Monitoring
The Sun Fire T2000 server features an environmental monitoring subsystem designed to protect the server and its components against:
Extreme temperatures
Lack of adequate airflow through the system
Power supply failures
Hardware faults
Temperature sensors located throughout the system monitor the ambient temperature of the system and internal components. The software and hardware ensure that the temperatures within the enclosure do not exceed predetermined safe operating ranges. If the temperature observed by a sensor falls below a low­temperature threshold or rises above a high-temperature threshold, the monitoring subsystem software lights the amber Service Required LEDs on the front and back panel. If the temperature condition persists and reaches a critical threshold, the system initiates a graceful system shutdown.
All error and warning messages are sent to the system controller (SC), console, and are logged in the ALOM log file. Additionally, some FRUs such as power supplies provide LEDs that indicate a failure within the FRU.
Error Correction and Parity Checking
The UltraSPARC T1 multicore processor provides parity protection on its internal cache memories, including tag parity and data parity on the D-cache and I-cache. The internal 3MB L2 cache has parity protection on the tags, and ECC protection of the data.
Advanced ECC, also called chipkill, corrects up to 4-bits in error on nibble boundaries, as long as they are all in the same DRAM. If a DRAM fails, the DIMM continues to function.
Predictive Self Healing
The Sun Fire T2000 server features the latest fault management technologies. With the Solaris 10 Operating System (OS), Sun is introducing a new architecture for building and deploying systems and services capable of predictive self-healing. Self­healing technology enables Sun systems to accurately predict component failures and mitigate many serious problems before they actually occur. This technology is incorporated into both the hardware and software of the Sun Fire T2000 server.
Chapter 1 Sun Fire T2000 Server Overview 7
At the heart of the predictive self-healing capabilities is the Solaris Fault Manager, a service that receives data relating to hardware and software errors, and automatically and silently diagnoses the underlying problem. Once a problem is diagnosed, a set of agents automatically responds by logging the event, and if necessary, takes the faulty component offline. By automatically diagnosing problems, business-critical applications and essential system services can continue uninterrupted in the event of software failures, or major hardware component failures.
8 Sun Fire T2000 Server Service Manual • October 2005
Chassis Identification
FIGURE 1-3 and FIGURE 1-4 show the physical characteristics of the Sun Fire T2000
server.
Indicators and buttons
USB ports
3
2
FIGURE 1-3 Sun Fire T2000 Server Front Panel
SC serial mgt
port
port
Drive 2
Drive 0
SC net mgt
port
Drive 3
Drive 1
GBE ports
2
3
0 1
DVD drive
Hard drives
PCI-X slotsTTYA serial
Power Power
FIGURE 1-4 Sun Fire T2000 Server Rear Panel
Slot 0
PCI-E slot
supply 1supply 0
Indicators
USB ports
1
Slot 2
Slot 1
Slot 0
PCI-E slots
Slot 1
0
Chapter 1 Sun Fire T2000 Server Overview 9
Additional Service Related Information
In addition to this service manual, the following resources are available to help you keep your server running optimally:
Product Notes – The Sun Fire T2000 Server Product Notes (819-2544) contain late
breaking information about the system including required software patches, updated hardware and compatibility information, and solutions to know issues. The product notes are available online at:
http://www.sun.com/documentation
Release Notes – The Solaris OS release Notes contain important information
about the Solaris OS. The release notes are available online at:
http://www.sun.com/documentation
SunSolve Online – Provides a collection of support resources. Depending on the
level of your service contract, you have access to Sun patches, the Sun System Handbook, the SunSolve™ knowledge base, the Sun Support Forum, and additional documents, bulletins, and related links. Access this site at:
http://sunsolve.sun.com
Predictive Self-Healing Knowledge Database – You can access the knowledge
article corresponding to a self-healing message by taking the Sun Message Identifier (SUNW-MSG-ID) and entering it into the field on this page:
http://www.sun.com/msg
10 Sun Fire T2000 Server Service Manual • October 2005
CHAPTER
2
Sun Fire T2000 Server Diagnostics
This chapter describes the diagnostics that are available for monitoring and troubleshooting the Sun Fire T2000 server. This chapter does not provide troubleshooting methods, but instead describes the Sun Fire T2000 server diagnostics facilities and describes how to use them.
This chapter is intended for technicians, service personnel, and system administrators who service and repair computer systems.
The following topics are covered:
“Overview of Sun Fire T2000 Server Diagnostics” on page 12
“Using LEDs to Identify the State of Devices” on page 16
“Using ALOM For Diagnosis and Repair Verification” on page 22
“Running POST” on page 31
“Using the Solaris Predictive Self-Healing Feature” on page 40
“Collecting Information From Solaris OS Files and Commands” on page 43
“Managing Components with Automatic System Recovery (ASR) Commands” on
page 44
“Exercising the System With SunVTS” on page 48
11
Overview of Sun Fire T2000 Server Diagnostics
There are a variety of diagnostic tools, commands, and indicators you can use to monitor and troubleshoot a Sun Fire T2000 server:
LEDs provide a quick visual notification of the status of the server and of some
of the FRUs.
ALOM firmware –This system firmware runs on the system controller. In
addition to providing the interface between the hardware and OS, ALOM also tracks and reports the health of key server components. ALOM works closely with POST and Solaris predictive self-healing technology to keep the system up and running even when there is a faulty component.
Power-on self-test (POST) – POST performs diagnostics on system components
upon system reset to ensure the integrity of those components. POST is configureable and works with ALOM to take faulty components offline if needed.
Solaris OS predictive self healing (PSH) This technology continuously monitors
the health of the CPU and memory, and works with ALOM to take a faulty component offline if needed. The predictive self-healing technology enables Sun systems to accurately predict component failures and mitigate many serious problems before they occur.
Log files and console messages Provide the standard Solaris OS log files and
investigative commands that can be accessed and displayed on the device of your choice.
SunVTS An application that exercises the system, provides hardware validation,
and discloses possible faulty components with recommendations for repair.
The LEDs, ALOM, Solaris OS PSH, and many of the log files and console messages are integrated. For example, a fault detected by the Solaris software will display the fault, log it, pass information to ALOM where it is logged, and depending on the fault, might light one or more LEDs.
The diagnostic flowchart in
FIGURE 2-1 and TABLE 2-1 describe an approach for using
the server diagnostics to identify a faulty field replaceable unit (FRU). The diagnostics you use, and the order in which you use them, depend on the nature of the problem you are troubleshooting, so you might perform some actions and not others.
The flowchart assumes that you have already performed some rudimentary troubleshooting such as verification of proper installation, visual inspection of cables and power, and possibly performed a reset of the server (refer to the Sun Fire T2000 Server Installation Guide and Sun Fire T2000 Server Administration Guide for details).
12 Sun Fire T2000 Server Service Manual • October 2005
Faulty
hardware
suspected
Use this flow chart to understand what diagnostics are available to troubleshoot faulty hardware, and use
TABLE 2-1 to find more information about each diagnostic in
this chapter.
Numbers in this flow chart correspond to the Action numbers in Table 2-1.
1.
Are any
faults reported by
the showfaults
command?
Ye s
2.
Is a
fault message
ID (MSG-ID)
displayed?
Ye s
3. Enter the
message ID into
the Sun Knowl-
edge Article
web site for
recommended
actions
6.
Do the
Solaris logs
No No
5.
Do fault
No No
LEDs indicate
a faulty
FRU?
Ye s
9.
Replace Faulty
FRU
Ye s
Ye s
indicate a
faulty FRU?
Ye s
7.
Does POST
report any faulty
devices?
No
8.
Does SunVTS
report any faulty
devices?
4.
Did the
article recom-
mend a FRU
replacement?
No
FIGURE 2-1 Diagnostic Flow Chart
Ye s
10
Verify the
repair
No
11
Perform recom­mended corrective actions. If needed,
contact Sun for
Support
Chapter 2 Sun Fire T2000 Server Diagnostics 13
TABLE 2-1 Diagnostic Flow Chart Actions
Action No. Diagnostic Action Resulting Action
1.
Run the ALOM
showfaults
command.
The showfaults command displays faults detected by the system firmware.
• If faults are displayed, go to Action 2.
• If no faults are displayed, go to Action 6.
2.
Check fault message for a Sun Message ID.
Sun Message IDs (SUNW-MSG-ID) indicate that information is available from Sun’s knowledge article database.
• If you have a message ID number, go to Action 3.
• If you do not have a message ID number, go to Action 5.
3.
Enter the Sun Message ID into the Sun
Enter the Sun Message ID number into the knowledge article web site at:
http:www.sun.com/msg and go to Action 4.
Knowledge Article web site.
4.
Analyze the suggested actions.
In some cases, fault related messages are identified with suggested actions.
• If the suggested action recommends replacing a FRU, go to Action 9.
If the suggested action does not recommend replacing a FRU, perform the suggested action. Contact Sun for additional support, if needed
For more information, see these sections
“To Run the showfaults Command” on page 26
“Using the Solaris Predictive Self-Healing Feature” on page 40
Sun Support information:
http://www.sun.com/ service/contacting
5.
Do any of the fault LEDs indicate a faulty FRU?
The first LED to check is the Service Required LED. Additional LEDs on specific FRUs (fans, blower, power supplies, and hard disk drives) can pinpoint the faulty FRU.
• If an LED indicates a faulty FRU, go to Action 9.
• If FRU LEDs do not indicate a fault, go to Action 6.
6.
Check the Solaris log files for fault information.
The Solaris message buffer and log files record system events and can provide information about faults.
• If system messages indicate a faulty device, replace the FRU (Action 9).
• To obtain more diagnostic information, got to Action 7.
14 Sun Fire T2000 Server Service Manual • October 2005
“Using LEDs to Identify the State of Devices” on page 16
“Collecting Information From Solaris OS Files and Commands” on page 43
TABLE 2-1 Diagnostic Flow Chart Actions (Continued)
Action No. Diagnostic Action Resulting Action
7.
Run POST. POST performs basic tests of the server components
and reports faulty FRUs.
• If POST indicates a faulty FRU, replace the FRU (Action 9).
• If POST does not indicate a faulty FRU, go to Action 8
8.
Run SunVTS. SunVTS provides tests used to exercise and
diagnose FRUs. To run SunVTS, the server must be running the Solaris OS.
• If SunVTS reports a faulty device replace the FRU (Action 9).
• If SunVTS does not report a faulty device, go to Action 11.
9.
Replace faulty FRU.
The fans, blower, power supplies, and hard drives are hot-swappable.
The other FRUs require that you shut down the server to perform a cold-swap.
After replacing the faulty FRU, go to Action 10.
For more information, see these sections
“Running POST” on page 31
“Exercising the System With SunVTS” on page 48
“Replacing Hot­Swappable and Hot­Pluggable FRUs” on page 55
“Replacing Cold Swappable FRUs” on page 65
10.
11.
Verify the repair. Various commands and utilities can be used to
verify the functionality of the system components. Two useful commands are:
• The ALOM showfaults command
• The ASR showcomponents command
If the FRU is blacklisted, you can manually remove it from the black list with the enablecomponent command.
If the fault is cleared, and the component is not blacklisted, the repair is verified well enough to boot the server. For added assurance, you can run the SunVTS diagnostic software.
Contact Sun for Support.
The majority of hardware faults are detected by the server’s diagnostics. In rare cases it is possible that a problem requires additional troubleshooting. If you are unable to determine the cause of the problem, contact Sun for support.
“To Run the showfaults Command” on page 26
“Managing Components with Automatic System Recovery (ASR) Commands” on page 44
“Exercising the System With SunVTS” on page 48
Sun Support information:
http://www.sun.com/ service/contacting
Chapter 2 Sun Fire T2000 Server Diagnostics 15
Using LEDs to Identify the State of Devices
The Sun Fire T2000 server provides the following groups of LEDs:
Front and Rear Panel LEDs (TABLE 2-2)
Power Supply LEDs (TABLE 2-4)
Fan LEDs (TABLE 2-5)
Hard Drive LEDs (TABLE 2-3)
These LEDs provide a quick visual check of the state of the system.
Front and Rear Panel LEDs
The six front panel LEDs (FIGURE 2-2) are located in the upper left corner of the server chassis. Three of these LEDs are also provided on the rear panel (
FIGURE 2-3).
Locator
LED/button
FIGURE 2-2 Front Panel LEDs
Service
Required
LED
Power OK
LED
Power On/Off button
Rear-FRUFault
Top Fan
LED
LED
Over Temp
LED
16 Sun Fire T2000 Server Service Manual • October 2005
Locator
LED/button
Service
Required
Power OK
LED
LED
FIGURE 2-3 Rear Panel LEDs
TABLE 2-2 lists and describes the front and rear panel LEDs.
TABLE 2-2 Front and Rear Panel LEDs
LED Color Description
Locator
*
and
LED button
Service Required LED*
Power OK LED*
White Enables you to identify a particular server. The LED is activated
using one of the following methods:
• Issuing the setlocator on or off command.
• Pressing the button to toggle the indicator on or off. This LED provides the following indications:
• Off– Normal operating state.
• Fast blink – The server received a signal as a result of one of the preceding methods and is indicating here I am.
Amber If on, indicates that service is required. The ALOM showfaults
command provides details about any faults that cause this indicator to be lit.
Green The LED provides the following indications:
• Off – The system is unavailable. Either it has no power or ALOM is not running.
• Steady on – Indicates that the system is powered on and is running it its normal operating state.
• Standby blink – Indicates that the service processor is running while the system is running at a minimum level in standby mode and ready to be returned to its normal operating state.
• Slow blink – Indicates that a normal transitory activity is taking place. It might mean that the system diagnostics are running, or that the system is booting.
Chapter 2 Sun Fire T2000 Server Diagnostics 17
TABLE 2-2
LED Color Description
Front and Rear Panel LEDs (Continued)
Power on/off button
Turns the host system on and off. This button is recessed to prevent accidental server power-off. Use the tip of a pen to operate this button.
Top fan LED Amber Provides the following operational fan indications:
• Off – Indicates a steady state, no service action is required.
• Steady on – Indicates a fan failure event has been acknowledged and a service action is required on at least one of the three fans. Use the fan LEDs to determine which fan requires service.
Rear-FRU FAULT LED
Amber Provides the following indications:
• Off – Indicates a steady state, no service action is required.
• Steady on – Indicates a failure of a rear-access FRU (a power supply or the rear blower). Use the FRU LEDs to determine which FRU requires service.
OverTemp LED
Amber Provides the following operational temperature indications:
• Off – Indicates a steady state, no service action is required.
• Steady on – Indicates a temperature failure event has been acknowledged and a service action is required. View the ALOM reports for further information on this event.
* Provided on the front and rear panel, otherwise the LED is only located on the front panel.
18 Sun Fire T2000 Server Service Manual • October 2005
Hard Drive LEDs
The hard drive LEDs (FIGURE 2-4 and TABLE 2-3) are located on the front of each hard drive that is installed in the Sun Fire T2000 server chassis.
OK to Remove
unused
Activity
FIGURE 2-4 Hard Drive LEDs
TABLE 2-3 Hard Drive LEDs
LED Color Description
OK to Remove
Unused Amber
Activity Green On – Drive is receiving power. Solidly lit if drive is idle. Flashes
Blue On – The drive is ready for hot-plug removal.
Off – Normal operation.
while the drive processes a command. Off – Power is off.
Chapter 2 Sun Fire T2000 Server Diagnostics 19
Power Supply LEDs
The power supply LEDs (FIGURE 2-5 and TABLE 2-4) are located on the back of each power supply.
Power OK
Failure
AC OK
FIGURE 2-5 .Power Supply LEDs
TABLE 2-4 Power Supply LEDs
LED Color Description
Power OK Green On – Normal operation. DC output voltage is within normal limits.
Off – Power is off.
Failure Amber On – Power supply has detected a failure.
Off – Normal operation.
AC OK Green On – Normal operation. Input power is within normal limits.
Off – No input voltage, or input voltage is below limits.
.
20 Sun Fire T2000 Server Service Manual • October 2005
Fan LEDs
The fan LEDs are located on the top of each fan unit (TABLE 2-5). These LEDs are visible when you open the top fan door.
TABLE 2-5 Fan LEDs
LED Color Description
Fan LEDs Amber On – This fan is faulty.
Off – Normal operation. Note: When a fan fault is detected the front panel Top Fan LED is
lit.
Blower Unit LED
The blower unit LED is located on the back of the blower unit and visible from the rear of the server (
TABLE 2-6 Blower Unit LED
TABLE 2-6).
LED Color Description
Blower Unit LED
Amber On – The blower unit is faulty.
Off – Normal operation. Note: When a blower fault is detected the Rear-FRU Fault LED is
lit.
Chapter 2 Sun Fire T2000 Server Diagnostics 21
Using ALOM For Diagnosis and Repair Verification
The Sun Advanced Lights Out Manager (ALOM) is a system controller in the Sun Fire T2000 server that enables you to remotely manage and administer your server.
ALOM enables you to run diagnostics remotely such as power-on self test (POST), that would otherwise require physical proximity to the server’s serial port. You can also configure ALOM to send email alerts of hardware failures, hardware warnings, and other events related to the server or to ALOM.
The ALOM circuitry runs independently of the server, using the server’s standby power. Therefore, ALOM firmware and software continue to function when the server operating system goes offline or when the server is powered off.
Note – Refer to the Sun Fire T2000 Server Advanced Lights Out Manager (ALOM)
Guide for comprehensive ALOM information.
Faults detected by ALOM, POST, and the Solaris Predictive Self-healing (PSH) technology are forwarded to the ALOM for fault handling (
In the event of a system fault, ALOM ensures that the Service Required LED is lit, FRU ID PROMs are updated, the fault is logged, and alerts are displayed.
FIGURE 2-6).
FIGURE 2-6 ALOM Fault Management
ALOM sends alerts to all ALOM users that are logged in, sending the alert through email to a configured email address, and writing the event to the ALOM event log.
22 Sun Fire T2000 Server Service Manual • October 2005
Service Required LED
FRU LEDs
FRUID PROMs
Logs
Alerts
ALOM can detect when a fault is no longer present and clears the fault in several ways:
Fault recovery – The system automatically detects that the fault condition is no
longer present. ALOM extinguishes the Service Required LED and updates the FRU’s PROM, indicating that the fault is no longer present.
Fault repair – The fault has been repaired by human intervention. In most cases,
ALOM detects the repair and extinguishes the Service Required LED In the event that ALOM does not perform these actions, you must perform these tasks manually with clearfault or enablecomponent commands.
ALOM can detect the removal of a FRU, in many cases even if the FRU is removed while ALOM is powered off. This enables ALOM to know that a fault, diagnosed to a specific FRU, has been repaired. The ALOM clearfault command enables you to manually clear certain types of faults without a FRU replacement or if ALOM was unable to automatically detect the FRU replacement. ALOM does not automatically detect hard drive replacement.
Many environmental faults can automatically recover. A temperature that is exceeding a threshold may return to normal limits. An unplugged a power supply can be plugged in and so on. Recovery of environmental faults is automatically detected. Recovery events are reported using one of two forms:
fru at location is OK.
sensor at location is within normal range.
Environmental faults can be repaired through hot removal of the faulty FRU. FRU removal is automatically detected by the environmental monitoring and all faults associated with the removed FRU are cleared. The message for that case, and the alert sent for all FRU removals is:
fru at location has been removed.
There is no ALOM command to manually repair an environmental fault.
ALOM does not handle hard drive faults. Use the Solaris message files to view hard drive faults. See “Collecting Information From Solaris OS Files and Commands” on
page 43.
Chapter 2 Sun Fire T2000 Server Diagnostics 23
Running ALOM Service-Related Commands
This section describes the ALOM commands that are commonly used for service­related activities.
Connecting to ALOM
Before you can run ALOM commands, you must connect to the ALOM. There are several ways to connect to the system controller:
Connect an ASCII terminal directly to the serial management port.
Use the telnet command to connect to ALOM through an Ethernet connection
on the network management port.
Connect an external modem to the network management port and dial-in to the
modem.
Note – Refer to the Sun Fire T2000 Server Advanced Lights Out Manager (ALOM)
Guide for instructions on configuring and connecting to ALOM.
Switching Between the System Console and ALOM
To switch from the console output to the ALOM sc> prompt, type #. (Pound
Period).
To switch from the sc> prompt to the console, type console.
24 Sun Fire T2000 Server Service Manual • October 2005
Service-Related ALOM Commands
TABLE 2-7 describes the typical ALOM commands for servicing a Sun Fire T2000
server. For descriptions of all ALOM commands, issue the help command or refer to the Sun Fire T2000 Server Advanced Lights Out Management (ALOM) Guide.
TABLE 2-7 Service-Related ALOM Commands
ALOM Command Description
help [command] Displays a list of all ALOM commands with syntax and descriptions.
Specifying a command name as an option displays help for that command.
clearfault UUID Manually clears system faults. UUID is the unique fault ID of the fault to
be cleared.
powercycle [-f] Performs a poweroff followed by poweron. The -f option forces an
immediate poweroff, otherwise the command attempts a graceful shutdown.
poweroff [-y][-f] Removes the main power from the host server. The -y option enables you
to skip the confirmation question. The -f option forces an immediate shutdown.
poweron Applies the main power to the host server.
removefru Indicates if it is OK to perform a hot-swap of a power supply.
reset [-y] Generates a hardware reset on the host server. The -y option enables you
to skip the confirmation question.
resetsc [-y] Reboots the system controller. The -y option enables you to skip the
confirmation question.
setkeyswitch [normal | stby | diag | locked]
setlocator [on | off] Turns the Locator LED on the server on or off.
showenvironment Displays the environmental status of the host server. This information
showfaults [-v] Displays current system faults. See “To Run the showfaults Command”
showfru [-g lines][-s | -d]
[FRU]
Sets the virtual keyswitch.
includes system temperatures, power supply, front panel LED, hard drive, fan, voltage, and current sensor status. See “To Run the
showenvironment Command” on page 27.
on page 26.
Displays information about the FRUs in the server.
• The -g lines option specifies the number of lines to display before pausing the output to the screen.
• The -s option displays static information about system FRUs (defaults to all FRUs, unless one is specified).
• The -d displays dynamic information about system FRUs (defaults to all
FRUs, unless one is specified). See “To Run the showfaults Command”
on page 26.
Chapter 2 Sun Fire T2000 Server Diagnostics 25
TABLE 2-7 Service-Related ALOM Commands (Continued)
ALOM Command Description
showkeyswitch Displays the status of the virtual keyswitch.
showlocator Displays the current state of the Locator LED as either on or off.
showlogs [-b lines | -e lines][- g lines][-v]
showplatform [-v] Displays information about the host system’s hardware configuration, and
Displays the history of all events logged in the ALOM event buffer.
whether the hardware is providing service.
Note – See TABLE 2-10 for the ALOM ASR commands.
To Run the showfaults Command
The showfaults command displays faults handled by ALOM. Use the showfaults command for the following reasons:
To see if any faults have been passed to, or detected by ALOM.
To obtain the fault message ID (SUNW-MSG-ID).
To verify that the replacement of a FRU has cleared the fault and not generated
any additional faults.
1. At the
In the following example, a fault is displayed for the front I/O board (FIOBD) and the motherboard (MB).
sc> prompt, type the showfaults command.
sc> showfaults ID FRU Fault
0 FIOBD Host detected fault, MSGID: SUNW-TEST07 1 MB Host detected fault, MSGID: SUNW-TEST07
2. Use the Sun message ID to obtain more information about the fault.
In a browser, go to the Predictive Self-Healing Knowledge Article web site: http://www.sun.com/msg and enter the Sun message ID in the lookup field.
26 Sun Fire T2000 Server Service Manual • October 2005
To Run the showenvironment Command
The showenvironment command displays a snapshot of the server’s environmental status. This command displays system temperatures, hard disk drive status, power supply and fan status, front panel LED status, voltage and current sensors. The output uses a format similar to the Solaris OS command prtdiag (1m).
At the sc> prompt, type the showenvironment command.
The output differs according to your system’s model and configuration.
Example:
sc> showenvironment =============== Environmental Status ===============
------------------------------------------------------------------------------
-­System Temperatures (Temperatures in Celsius):
------------------------------------------------------------------------------
-­Sensor Status Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard
------------------------------------------------------------------------------
-­PDB/T_AMB OK 23 -10 -5 0 45 50 55 MB/T_AMB OK 26 -10 -5 0 50 55 60 MB/CMP0/T_TCORE OK 44 -10 -5 0 85 95 100 MB/CMP0/T_BCORE OK 45 -10 -5 0 85 95 100 IOBD/IOB/TCORE OK 41 -10 -5 0 95 100 105 IOBD/T_AMB OK 30 -10 -5 0 45 50 55
-------------------------------------------------------­System Indicator Status:
-------------------------------------------------------­SYS/LOCATE SYS/SERVICE SYS/ACT OFF ON ON
-------------------------------------------------------­SYS/REAR_FAULT SYS/TEMP_FAULT SYS/TOP_FAN_FAULT OFF OFF OFF
--------------------------------------------------------
-------------------------------------------­System Disks:
-------------------------------------------­Disk Status Service OK2RM
Chapter 2 Sun Fire T2000 Server Diagnostics 27
-------------------------------------------­HDD0 OK OFF OFF HDD1 OK OFF OFF HDD2 OK OFF OFF HDD3 OK OFF OFF
--------------------------------------------------­Fans Status:
--------------------------------------------------­Fans (Speeds Revolution Per Minute): Sensor Status Speed Warn Low
---------------------------------------------------------­FT0/FM0 OK 3618 -- 1920 FT0/FM1 OK 3437 -- 1920 FT0/FM2 OK 3556 -- 1920 FT2 OK 2578 -- 1900
----------------------------------------------------------
------------------------------------------------------------------------------
-­Voltage sensors (in Volts):
------------------------------------------------------------------------------
-­Sensor Status Voltage LowSoft LowWarn HighWarn HighSoft
------------------------------------------------------------------------------
-­MB/V_+1V5 OK 1.48 1.36 1.39 1.60 1.63 MB/V_VMEML OK 1.78 1.69 1.72 1.87 1.90 MB/V_VMEMR OK 1.78 1.69 1.72 1.87 1.90 MB/V_VTTL OK 0.87 0.84 0.86 0.93 0.95 MB/V_VTTR OK 0.87 0.84 0.86 0.93 0.95 MB/V_+3V3STBY OK 3.33 3.13 3.16 3.53 3.59 MB/V_VCORE OK 1.30 1.20 1.24 1.36 1.39 IOBD/V_+1V5 OK 1.48 1.27 1.35 1.65 1.72 IOBD/V_+1V8 OK 1.78 1.53 1.62 1.98 2.07 IOBD/V_+3V3MAIN OK 3.38 2.80 2.97 3.63 3.79 IOBD/V_+3V3STBY OK 3.33 2.80 2.97 3.63 3.79 IOBD/V_+1V OK 1.11 0.93 0.99 1.21 1.26 IOBD/V_+1V2 OK 1.17 1.02 1.08 1.32 1.38 IOBD/V_+5V OK 5.09 4.25 4.50 5.50 5.75 IOBD/V_-12V OK -12.11 -13.80 -13.20 -10.80 -10.20 IOBD/V_+12V OK 12.18 10.20 10.80 13.20 13.80 SC/BAT/V_BAT OK 3.03 -- 2.69 -- --
----------------------------------------------------------­System Load (in amps):
----------------------------------------------------------­Sensor Status Load Warn Shutdown
----------------------------------------------------------­MB/I_VCORE OK 25.280 80.000 88.000 MB/I_VMEML OK 4.680 60.000 66.000 MB/I_VMEMR OK 4.680 60.000 66.000
28 Sun Fire T2000 Server Service Manual • October 2005
-----------------------------------------------------------
---------------------­Current sensors:
---------------------­Sensor Status
---------------------­IOBD/I_USB0 OK IOBD/I_USB1 OK FIOBD/I_USB OK
-----------------------------------------------------------------------------­Power Supplies:
-----------------------------------------------------------------------------­Supply Status Underspeed Overtemp Overvolt Undervolt Overcurrent
-----------------------------------------------------------------------------­PS0 OK OFF OFF OFF OFF OFF PS1 OK OFF OFF OFF OFF OFF
sc>
Note – Some environmental information might not be available when the server is
in standby mode.
To Run the showfru Command
The showfru command displays information about the FRUs in the server. Use this command to see information about an individual FRU, or for all the FRUs.
Note – By default, the output of the showfru command for all FRUs is very long.
Chapter 2 Sun Fire T2000 Server Diagnostics 29
At the sc> prompt, enter the showfru command.
In the following example, the showfru command is used to get information about the motherboard (MB).
sc> showfru MB.SEEPROM SEGMENT: SD /ManR /ManR/UNIX_Timestamp32: WED OCT 12 18:24:28 2005 /ManR/Description: ASSY,Sun-Fire-T2000,CPU Board /ManR/Manufacture Location: Sriracha,Chonburi,Thailand /ManR/Sun Part No: 5016843 /ManR/Sun Serial No: NC00OD /ManR/Vendor: Celestica /ManR/Initial HW Dash Level: 06 /ManR/Initial HW Rev Level: 02 /ManR/Shortname: T2000_MB /SpecPartNo: 885-0483-04 SEGMENT: FL /Configured_LevelR /Configured_LevelR/UNIX_Timestamp32: WED OCT 12 18:24:28 2005 /Configured_LevelR/Sun_Part_No: 5410827 /Configured_LevelR/Configured_Serial_No: N4001A /Configured_LevelR/HW_Dash_Level: 03 . . .
30 Sun Fire T2000 Server Service Manual • October 2005
Running POST
Power on self test (POST) is a group of PROM-based tests that run when the server is powered on or reset. POST checks the basic integrity of the critical hardware components in the server (CPU, memory, and I/O buses).
If POST detects a faulty component, it is disabled automatically, preventing faulty hardware from potentially harming any software. If the system is capable of running without the disabled component, the system will boot when POST is complete. For example, if one of the processor cores is deemed faulty by POST, the core will be disabled, and the system will boot and run using the remaining cores.
POST faults are automatically repaired if the fault is not detected on subsequent POST runs. Any devices that pass POST are enabled, even if they were previously disabled. Devices can be manually enabled or disabled using ASR commands (see
“Managing Components with Automatic System Recovery (ASR) Commands” on page 44).
Controlling How POST Runs
The server can be configured for normal, extensive, or no POST execution. You can also control the level of tests that run, the amount of POST output that is displayed, and which reset events trigger POST by using ALOM variables.
TABLE 2-8 lists the ALOM variables used to configure POST and FIGURE 2-7 shows
how the variables work together.
TABLE 2-8 ALOM Parameters Used For POST Configuration
Parameter Values Description
setkeyswitch
diag_mode off POST does not run.
*
normal The system can power on and run POST (based
on the other parameter settings). For details see
FIGURE 2-7. This parameter overrides all other
commands.
diag The system runs POST based on predetermined
settings.
stby The system cannot power on.
locked The system can power on and run POST, but no
flash updates can be made.
normal Runs POST according to diag_level value.
Chapter 2 Sun Fire T2000 Server Diagnostics 31
TABLE 2-8
Parameter Values Description
ALOM Parameters Used For POST Configuration
service Runs POST with preset values for diag_level
and diag_verbosity.
diag_level min If diag_mode = normal, run minimum set of
tests.
max If diag_mode = normal, runs all the minimum
tests plus extensive CPU and memory tests.
diag_trigger none Do not run POST on reset.
user_reset Runs POST upon user initiated resets.
power_on_reset Only run POST for the first power on. This is the
default.
error_reset Runs POST if fatal errors are detected.
all_reset Runs POST after any reset.
diag_verbosity none No POST output is displayed.
min POST output displays functional tests with a
banner and pinwheel.
normal POST output displays all test and informational
messages.
max POST displays all test, informational, and some
debugging messages.
* All of these parameters are set using the ALOM setsc command except for the setkeyswitch command.
32 Sun Fire T2000 Server Service Manual • October 2005
FIGURE 2-7 Flowchart of ALOM Variables for POST Configuration
Chapter 2 Sun Fire T2000 Server Diagnostics 33
TABLE 2-9 shows typical combinations of ALOM variables and associated POST
modes.
TABLE 2-9 ALOM Parameters and POST Modes
Parameter Normal Diagnostic
Mode (default settings)
diag_mode normal off service normal
setkeyswitch
diag_level min n/a max max
diag_trigger power-on-reset
diag_verbosity normal n/a max max
Description of POST execution
* The setkeyswitch parameter, when set to diag, overrides all the other ALOM POST variables.
*
normal normal normal diag
error-reset
This is the default POST configuration and provides a reasonable compromise between testing thoroughness and quick server initialization.
No POST Execution
none all-resets all-resets
POST does not run, resulting in quick system initialization, but this is not a suggested configuration.
Diagnostic Service Mode
POST runs the full spectrum of tests with the maximum output displayed.
Keyswitch Diagnostic preset values
POST runs the full spectrum of tests with the maximum output displayed.
To Change POST Parameters
1. Access the ALOM sc> prompt:
At the console, issue the #. key sequence:
#.
2. At the ALOM sc> prompt, use the setsc command to set the POST parameter:
Example:
sc> setsc diag_mode service
The setkeyswitch parameter is a command that sets the virtual keyswitch, so it does not use the setsc command. Example:
sc> setkeyswitch diag
34 Sun Fire T2000 Server Service Manual • October 2005
Reasons to Run POST
You can use POST for basic sanity checking of the server hardware and for troubleshooting as described in the following sections.
Routine Sanity Check of the Hardware
POST tests critical hardware components to verify functionality before the system boots and accesses software. If POST detects an error, the faulty component is disabled automatically, preventing faulty hardware from potentially harming software.
Under normal operating conditions, the server is usually configured to run POST in minimum mode for all power-on or error-generated resets. This enables the system to initialize quickly, and still have hardware checkups to ensure a healthy system.
Diagnosing the System Hardware
You can use POST as an initial diagnostic tool for the system hardware. In this case, configure POST to run in diagnostic service mode for maximum test coverage and verbose output.
To Run POST
This procedure describes how to run POST when you want maximum testing, as in the case when you are troubleshooting a system.
1. Switch from the system console prompt to the SC console prompt by issuing the #. escape sequence.
ok #. sc>
2. Set the virtual keyswitch to diag so that POST will run in service mode.
sc> setkeyswitch diag
Chapter 2 Sun Fire T2000 Server Diagnostics 35
3. Reset the system so that POST runs.
There are several ways to initiate a reset. The following example uses the
powercycle command. For other methods, refer to the Sun Fire T2000 Server Administration Guide.
sc> powercycle Are you sure you want to powercycle the system [y/n]? y Powering host off at MON JAN 10 02:52:02 2000
Waiting for host to Power Off; hit any key to abort.
SC Alert: SC Request to Power Off Host.
SC Alert: Host system has shut down. Powering host on at MON JAN 10 02:52:13 2000
SC Alert: SC Request to Power On Host.
4. Switch to the system console to view the post output:
sc> console
Example of POST output:
SC Alert: Host System has Reset Note: some output omitted.
0:0>
0:0>Copyright © 2005 Sun Microsystems, Inc. All rights reserved
SUN PROPRIETARY/CONFIDENTIAL.
Use is subject to license terms.
0:0>VBSC selecting POST MAX Testing.
0:0>VBSC enabling L2 Cache.
0:0>VBSC enabling Full Memory Scrub.
0:0>VBSC enabling threads: fffff00f
0:0>Init CPU
0:0>Start Selftest.....
0:0>CPU =: 0
0:0>DMMU Registers Access
0:0>IMMU Registers Access
0:0>Init mmu regs
0:0>D-Cache RAM
0:0>Init MMU.....
0:0>DMMU TLB DATA RAM Access
0:0>DMMU TLB TAGS Access
36 Sun Fire T2000 Server Service Manual • October 2005
0:0>DMMU CAM
0:0>IMMU TLB DATA RAM Access
0:0>IMMU TLB TAGS Access
0:0>IMMU CAM
0:0>Setup and Enable DMMU
0:0>Setup DMMU Miss Handler
0:0>Niagara, Version 2.0
0:0>Serial Number 00000098.00000820 = fffff231.17422755
0:0>Init JBUS Config Regs
0:0>IO-Bridge unit 1 init test
0:0>sys 150 MHz, CPU 600 MHz, mem 150 MHz.
0:0>Integrated POST Testing
0:0>Setup L2 Cache
0:0>L2 Cache Control = 00000000.00300000
0:0>Scrub and Setup L2 Cache
0:0>L2 Directory clear
0:0>L2 Scrub VD & UA
0:0>L2 Scrub Tags
0:0>Test Memory.....
0:0>Scrub 00000000.00600000->00000001.00000000 on Memory Channel [0 1 2 3 ] Rank 0 Stack 0
0:0>Scrub 00000001.00000000->00000002.00000000 on Memory Channel
[0 1 2 3 ] Rank 1 Stack 0
3:0>IMMU Functional
7:0>IMMU Functional
7:0>DMMU Functional
0:0>IMMU Functional
0:0>DMMU Functional
0:0>Print Mem Config
0:0>Caches : Icache is ON, Dcache is ON.
0:0>Bank 0 4096MB : 00000000.00000000 -> 00000001.00000000.
0:0>Bank 2 4096MB : 00000001.00000000 -> 00000002.00000000.
0:0>Block Mem Test
0:0>Test 4288675840 bytes at 00000000.00600000 Memory Channel [ 0
1 2 3 ] Rank 0 Stack 0
0:0>........
0:0>Test 4294967296 bytes at 00000001.00000000 Memory Channel [ 0
1 2 3 ] Rank 1 Stack 0
0:0>........
0:0>IO-Bridge Tests.....
0:0>IO-Bridge Quick Read
0:0>
0:0>------------------------------------------------------------
--
Chapter 2 Sun Fire T2000 Server Diagnostics 37
0:0>--------- IO-Bridge Quick Read Only of CSR and ID ----------
-----
0:0>------------------------------------------------------------
--
0:0>fire 1 JBUSID 00000080.0f000000 =
0:0>IO-Bridge unit 1 Config MB bridges
0:0>Config port A, bus 2 dev 0 func 0, tag IOBD/PCI-SWITCH0
0:0>Config port A, bus 3 dev 1 func 0, tag IOBD/GBE0
0:0>INFO:Master Abort for probe, device IOBD/PCIE1 looks like it
is not present!
0:0>INFO:Master Abort for probe, device IOBD/PCIE2 looks like it
is not present!
0:0>INFO:
0:0>POST Passed all devices.
0:0>
0:0>DEMON: (Diagnostics Engineering MONitor)
0:0>Select one of the following functions
0:0>POST:Return to OBP.
0:0>INFO:
0:0>POST Passed all devices.
0:0>Master set ACK for vbsc runpost command and spin...
5. Perform further investigation if needed.
When POST is finished running, and if no faults were detected, the system will boot.
If POST detects a faulty device, the fault is displayed and the fault information is passed to ALOM for fault handling.
a. Interpret the POST messages:
POST error messages use the following syntax:
c:s > ERROR: TEST = failing_test c:s > H/W under test = FRU c:s > Repair Instructions: Replace items in order listed by H/W
under test above
c:s > MSG = test_error_message c:s > END_ERROR
In this syntax, c = the core number, s = the strand number. Warning and informational messages use the following syntax:
INFO or WARNING: message
38 Sun Fire T2000 Server Service Manual • October 2005
The following example shows a POST error message.
7:2> 7:2>ERROR: TEST = Data Bitwalk 7:2>H/W under test = MB/CMP0/CH2/R0/D0/S0 (MB/CMP0/CH2/R0/D0) 7:2>Repair Instructions: Replace items in order listed by 'H/W under test' above. 7:2>MSG = Pin 149 failed on MB/CMP0/CH2/R0/D0 (J1601) 7:2>END_ERROR
7:2>Decode of Dram Error Log Reg Channel 2 bits
60000000.0000108c 7:2> 1 MEC 62 R/W1C Multiple corrected errors, one or more CE not logged 7:2> 1 DAC 61 R/W1C Set to 1 if the error was a DRAM access CE 7:2> 108c SYND 15:0 RW ECC syndrome. 7:2> 7:2> Dram Error AFAR channel 2 = 00000000.00000000 7:2> L2 AFAR channel 2 = 00000000.00000000
In this example, POST is reporting a memory error at DIMM location
MB/CMP0/CH2/R0/D0. It was detected by POST running on core 7, strand 2.
b. Run the showfaults command to obtain additional fault information.
The fault is captured by ALOM, where the fault is logged, the Service Required LED is lit, and the faulty component is disabled.
Example:
ok .# sc> showfaults -v ID Time FRU Fault 1 APR 24 12:47:27 MB/CMP0/CH2/R0/D0 MB/CMP0/CH2/R0/D0 deemed faulty and disabled
In this example,
MB/CMP0/CH2/R0/D0 (DIMM 13) is disabled. The system can
boot using memory that was not disabled until the faulty component is replaced.
Note – You can use ASR commands to display and control disabled components.
See “Managing Components with Automatic System Recovery (ASR) Commands”
on page 44.
Chapter 2 Sun Fire T2000 Server Diagnostics 39
Using the Solaris Predictive Self-Healing Feature
The Solaris predictive self-healing (PSH) technology enables Sun Fire T2000 server to diagnose problems while the Solaris OS is running, and mitigate many problems before they occur.
The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. Once diagnosed, the fault manager daemon assigns the problem a unique identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use message ID to get additional information about the problem from Sun’s knowledge article database.
The predictive self-healing technology covers the following Sun Fire T2000 server components:
UltraSPARC T1 multicore processor
Memory
I/O bus
The PSH console message provides the following information:
Type
Severity
Description
Automated Response
Impact
Suggested Action for System Administrator
Details
If the Solaris PSH facility has detected a faulty component, use the fmdump command to identify the fault.
Note – Additional predictive self-healing information is available at:
http://www.sun.com/msg
40 Sun Fire T2000 Server Service Manual • October 2005
To Use the fmdump Command to Identify Faults
The fmdump command displays the list of faults detected by the Solaris PSH facility. Use this command for the following reasons:
To see if any faults have been detected by the Solaris PSH facility.
If you need to obtain the fault message ID (SUNW-MSG-ID) for detected faults.
To verify that the replacement of a FRU has cleared the fault and not generated
any additional faults.
If you already have a fault message ID, go to Step 2 to obtain more information about the fault from Suns Predictive Self-Healing Knowledge Article web site.
1. Check the event log using the fmdump command with -v for verbose output:
# fmdump -v TIME UUID SUNW-MSG­ID Apr 24 06:54:08.2005 lce22523-lc80-6062-e61d-f3b39290ae2c SUN4U­8000-6H 100% fault.cpu.ultraSPARCT1l2cachedata
FRU:hc:///component=MB rsrc: cpu:///cpuid=0/serial=22D1D6604A
In this example, a fault is displayed, indicating the following details:
Date and time of the fault (Apr 24 06:54:08.2005)
Universal Unique Identifier (UUID) that is unique for every fault (lce22523-
lc80-6062-e61d-f3b39290ae2c)
Sun message identifier (SUNW4V-8000-6H) that can be used to obtain additional
fault information
Faulted FRU (FRU:hc:///component=MB), that in this example is identified as
MB, indicating that the motherboard requires replacement.
2. Use the Sun message ID to obtain more information about this type of fault.
a. In a browser, go to the Predictive Self-Healing Knowledge Article web site:
http://www.sun.com/msg
Chapter 2 Sun Fire T2000 Server Diagnostics 41
b. Enter the message ID in the SUNW-MSG-ID field, and press Lookup.
In this example, the message ID SUN4U-8000-6H returns the following information for corrective action:
CPU errors exceeded acceptable levels
Type Fault Severity Major Description The number of errors associated with this CPU has exceeded acceptable levels. Automated Response
The fault manager will attempt to remove the affected CPU from service. Impact System performance may be affected.
Suggested Action for System Administrator
Schedule a repair procedure to replace the affected CPU, the
identity of which can be determined using fmdump -v -u <EVENT_ID>.
Details The Message ID: SUN4U-8000-6H indicates diagnosis has determined that a CPU is faulty. The Solaris fault manager arranged an automated attempt to disable this CPU. The recommended action for the system administrator is to contact Sun support so a Sun service technician can replace the affected component.
c. Follow the suggested actions to repair the fault.
42 Sun Fire T2000 Server Service Manual • October 2005
Collecting Information From Solaris OS Files and Commands
With the Solaris OS running on the Sun Fire T2000 server, you have the full compliment of Solaris OS files and commands available for collecting information and for troubleshooting.
In the event that POST, ALOM, or the Solaris PSH features did not indicate the source of a fault, check the message buffer and log files for notifications for faults. Hard drive faults are usually captured by the Solaris message files.
Use the dmesg command to view the most recent system message. To view the system messages log file, view the contents of the /var/adm/messages file.
To Check the Message Buffer
1. Log in as superuser.
2. Issue the dmesg command:
# dmesg
The dmesg command displays the most recent messages generated by the system.
To View System Message Log Files
The error logging daemon, syslogd automatically records various system warnings, errors, and faults in message files. These messages can alert you to system problems such as a device that is about to fail.
The /var/adm directory contains several message files. The most recent messages are in the /var/adm/messages file. After a period of time (usually every ten days), a new messages file is automatically created. The original contents of the messages file are rotated to a file named messages.1. Over a period of time, the messages are further rotated to messages.2 and messages.3, and then deleted.
1. Log in as superuser.
Chapter 2 Sun Fire T2000 Server Diagnostics 43
2. Issue the following command:
# more /var/adm/messages
3. If you want to view all logged messages, issue the following command:
# more /var/adm/messages*
Managing Components with Automatic System Recovery (ASR) Commands
The Automatic System Recovery (ASR) feature enables the server to automatically configure failed components out of operation until they can be replaced. In the Sun Fire T2000 server, the following components managed by the ASR feature:
UltraSPARC T1 processor strands
Memory DIMMS
I/O bus
The database that contains the list of disabled components is called the ASR blacklist (asr-db).
In most cases, POST and ALOM automatically manage the disabling of faulty comments and automatically enables them when the faulty FRU is replaced. In some situations, it is necessary to manually manage the blacklist.
Example: A component appears faulty and is automatically disabled. The problem is due to a loose connector, and no FRU replacement is required to fix the problem. ALOM, which would normally detect a FRU replacement and enable the FRU, does not do so. In this case, after the loose cable is reseated, the disabled component must be manually enabled.
44 Sun Fire T2000 Server Service Manual • October 2005
The Automatic System Recovery (ASR) commands (TABLE 2-10) enable you to view, and manually add or remove components from the ASR blacklist. These commands are run from the ALOM sc> prompt.
TABLE 2-10 ASR Commands
Command Description
showcomponent
enablecomponent asrkey Removes a component from the asr-db blacklist,
disablecomponent asrkey Adds a component to the asr-db blacklist, where
clearasrdb Removes all entries from the asr-db blacklist.
* The showcomponent command may not report all blacklisted DIMMS.
*
Displays system components and their current state.
where asrkey is the component to enable.
asrkey is the component to disable.
Note – The components (asrkeys) vary from system to system, depending on how
many cores and memory are present. Use the showcomponent command to see the asrkeys on a given system.
Note – A reset or powercycle is required after disabling or enabling a
component. If the status of a component is changed with power on there is no effect to the system until the next reset or powercycle.
Chapter 2 Sun Fire T2000 Server Diagnostics 45
To Run the showcomponent Command
The showcomponent command displays the system components (asrkeys) and reports their status.
1. At the sc> prompt, enter the showcomponent command.
Example with no disabled components:
sc> showcomponent
Keys:
MB/CMP0/P0 MB/CMP0/P1 MB/CMP0/P2 MB/CMP0/P3 MB/CMP0/P8 MB/CMP0/P9 MB/CMP0/P10 MB/CMP0/P11 MB/CMP0/P12 MB/CMP0/P13 MB/CMP0/P14 MB/CMP0/P15 MB/CMP0/P16 MB/CMP0/P17 MB/CMP0/P18 MB/CMP0/P19 MB/CMP0/P20 MB/CMP0/P21 MB/CMP0/P22 MB/CMP0/P23 MB/CMP0/P28 MB/CMP0/P29 MB/CMP0/P30 MB/CMP0/P31 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D1 MB/CMP0/CH0/R1/D0 MB/CMP0/CH0/R1/D1 MB/CMP0/CH1/R0/D0 MB/CMP0/CH1/R0/D1 MB/CMP0/CH1/R1/D0 MB/CMP0/CH1/R1/D1 MB/CMP0/CH2/R0/D0 MB/CMP0/CH2/R0/D1 MB/CMP0/CH2/R1/D0 MB/CMP0/CH2/R1/D1 MB/CMP0/CH3/R0/D0 MB/CMP0/CH3/R0/D1 MB/CMP0/CH3/R1/D0 MB/CMP0/CH3/R1/D1 IOBD/PCIEa IOBD/PCIEb PCIX1 PCIX0 PCIE2 PCIE1 PCIE0 TTYA
ASR state: clean
Example showing a disabled component:
sc> showcomponent . . . ASR state: Disabled Devices MB/CMP0/CH3/R1/D1 : dimm15 deemed faulty
46 Sun Fire T2000 Server Service Manual • October 2005
To Run the disablecomponent Command
The disablecomponent command disables a component by adding it to the ASR blacklist.
1. At the sc> prompt, enter the disablecomponent command
sc> disablecomponent MB/CMP0/CH3/R1/D1
sc>SC Alert:MB/CMP0/CH3/R1/D1 disabled
2. After receiving confirmation that the disablecomponent command is complete, reset the server for so that the ASR command takes effect.
sc> reset
.
To Run the enablecomponent Command
The enablecomponent command enables a disabled component by removing it from the ASR blacklist.
1. At the sc> prompt, enter the enablecomponent command.
sc> enablecomponent MB/CMP0/CH3/R1/D1
sc>SC Alert:MB/CMP0/CH3/R1/D1 reenabled
2. After receiving confirmation that the enablecomponent command is complete, reset the server for so that the ASR command takes effect.
sc> reset
Chapter 2 Sun Fire T2000 Server Diagnostics 47
Exercising the System With SunVTS
Sometimes a server exhibits a problem that cannot be isolated definitively to a particular hardware or software component. In such cases, it may be useful to run a diagnostic tool that stresses the system by continuously running a comprehensive battery of tests. Sun provides the SunVTS software for this purpose.
This chapter describes the tasks necessary to use SunVTS software to exercise your Sun Fire T2000 server:
“Checking Whether SunVTS Software Is Installed” on page 48
“Exercising the System Using SunVTS Software” on page 50
Checking Whether SunVTS Software Is Installed
This procedure assumes that the Solaris OS is running on the Sun Fire T2000 server, and that you have access to the Solaris command line.
To Check Whether SunVTS Software Is Installed
1. Check for the presence of SunVTS packages using the pkginfo command.
% pkginfo -l SUNWvts SUNWvtsr SUNWvtsts SUNWvtsmn
If SunVTS software is loaded, information about the packages is displayed.
If SunVTS software is not loaded, you see an error message for each missing
package.
ERROR: information for "SUNWvts" was not found ERROR: information for "SUNWvtsr" was not found ...
48 Sun Fire T2000 Server Service Manual • October 2005
The following table lists SunVTS packages:
Package Description
SUNWvts SunVTS framework
SUNWvtsr SunVTS Framework (root)
SUNWvtsts SunVTS for tests
SUNWvtsmn SunVTS man pages
If SunVTS is not installed, you can obtain the installation packages from the following:
Solaris Operating System DVDs
From the Sun Download Center: http://www.sun.com/oem/products/vts
The SunVTS 6.0 PS3 software, and future compatible versions, are supported on the Sun Fire T2000 server.
SunVTS installation instructions are described in the SunVTS User’s Guide.
Chapter 2 Sun Fire T2000 Server Diagnostics 49
Exercising the System Using SunVTS Software
Before you begin, the Solaris OS must be running. You also need to ensure that SunVTS validation test software is installed on your system. See “Checking Whether
SunVTS Software Is Installed” on page 48.
The SunVTS installation process requires that you specify one of two security schemes from which to use when running SunVTS. The security scheme you choose must be properly configured in the Solaris OS for you to run SunVTS. For details, refer to the SunVTS User’s Guide.
SunVTS software features both character-based and graphics-based interfaces. This procedure assumes that you are using the graphical user interface (GUI) on a system running the Common Desktop Environment (CDE). For more information about the character-based SunVTS TTY interface, and specifically for instructions on accessing it by TIP or telnet commands, refer to the SunVTS User’s Guide.
SunVTS software can be run in several modes. This procedure assumes that you are using the default mode.
This procedure also assumes that the Sun Fire T2000 server is headless—that is, it is not equipped with a monitor capable of displaying bit mapped graphics. In this case, you access the SunVTS GUI by logging in remotely from a machine that has a graphics display.
Finally, this procedure describes how to run SunVTS tests in general. Individual tests may presume the presence of specific hardware, or may require specific drivers, cables, or loopback connectors. For information about test options and prerequisites, refer to the following documentation:
SunVTS Test Reference Manual
SunVTS 6.0 PS3 Doc Supplement (SPARC)
To Exercise the System Using SunVTS Software
1. Log in as superuser to a system with a graphics display.
The display system should be one with a frame buffer and monitor capable of displaying bit-mapped graphics such as those produced by the SunVTS GUI.
2. Enable the remote display.
On the display system, type:
# /usr/openwin/bin/xhost + test-system
where test-system is the name of the Sun Fire T2000 server you plan to test.
50 Sun Fire T2000 Server Service Manual • October 2005
3. Remotely log in to the Sun Fire T2000 server as superuser.
Use a command such as rlogin or telnet.
4. Start SunVTS software.
If you have installed SunVTS software in a location other than the default /opt directory, alter the path in the preceding command accordingly.
# /opt/SUNWvts/bin/sunvts -display display-system:0
where display-system is the name of the machine through which you are remotely logged in to the Sun Fire T2000 server.
The SunVTS GUI is displayed (
FIGURE 2-8).
FIGURE 2-8 SunVTS GUI
Chapter 2 Sun Fire T2000 Server Diagnostics 51
5. Expand the test lists to see the individual tests.
The test selection area lists tests in categories, such as Network, as shown in
FIGURE 2-9. To expand a category, left-click the icon (expand category icon) to the
+
left of the category name.
FIGURE 2-9 SunVTS Test Selection Panel
6. (Optional) Select the tests you want to run.
Certain tests are enabled by default, and you can choose to accept these. Alternatively, you can enable and disable individual tests or blocks of tests by
clicking the checkbox next to the test name or test category name. Tests are enabled when checked, and disabled when not checked.
TABLE 2-11 lists tests that are especially useful to run on a Sun Fire T2000 server.
TABLE 2-11 Useful SunVTS Tests to Run on a Sun Fire T2000 Server
SunVTS Tests FRUs Exercised by Tests
cmttest, cputest, fputest, iutest, l1dcachetest, dtlbtest,
and l2sramtest—indirectly: mptest, and systest
disktest Disks, cables, disk backplane
cddvdtest CD/DVD device, cable, motherboard
nettest, netlbtest Network interface, network cable, CPU
pmemtest, vmemtest, ramtest memory DIMMs, motherboard
serialtest I/O (serial port interface)
usbkbtest, disktest USB devices, cable, CPU motherboard (USB
hsclbtest Motherboard, system controller
memory DIMMS, CPU motherboard
motherboard
controller)
(Host to System Controller interface)
52 Sun Fire T2000 Server Service Manual • October 2005
7. (Optional) Customize individual tests.
You can customize individual tests by right-clicking on the name of the test. For example, in
FIGURE 2-9, right-clicking on the text string ce0(nettest) brings up a
menu that enables you to configure this Ethernet test.
8. Start testing.
Click the Start button that is located at the top left of the SunVTS window. Status and error messages appear in the test messages area located across the bottom of the window. You can stop testing at any time by clicking the Stop button.
During testing, SunVTS software logs all status and error messages. To view these, click the Log button or select Log Files from the Reports menu. This opens a log window from which you can choose to view the following logs:
Information Detailed versions of all the status and error messages that appear
in the test messages area.
Test Error – Detailed error messages from individual tests.
VTS Kernel Error Error messages pertaining to SunVTS software itself. You
should look here if SunVTS software appears to be acting strangely, especially when it starts up.
Solaris OS Messages (/var/adm/messages) A file containing messages
generated by the operating system and various applications.
Log Files (/var/opt/SUNWvts/logs) A directory containing the log files.
For further information, refer to the manuals that accompany the SunVTS software
Chapter 2 Sun Fire T2000 Server Diagnostics 53
54 Sun Fire T2000 Server Service Manual • October 2005
CHAPTER
3
Replacing Hot-Swappable and Hot­Pluggable FRUs
This chapter describes how to remove and replace the hot-swappable and hot­pluggable field replaceable units (FRUs) in the Sun Fire T2000 Server.
The following topics are covered:
“Devices That Are Hot-Swappable and Hot-Pluggable” on page 56
“Hot-Swapping a Fan” on page 56
“Hot-Swapping a Power Supply” on page 58
“Hot-Swapping the Rear Blower” on page 61
“Hot-Plugging a Hard Drive” on page 63
55
Devices That Are Hot-Swappable and Hot-Pluggable
Hot swappable devices are those devices that you can remove and install while the system is running without affecting the rest of the systems capabilities. In a Sun Fire T2000 server, the following devices are hot swappable:
Fans
Power supplies
Rear blower
Hot-pluggable devices are those devices that can be removed and installed while the system is running, but you must perform administrative tasks beforehand. In a Sun Fire T2000 server, the chassis mounted hard drives can be hot-swappable (depending on how they are configured).
Hot-Swapping a Fan
Three hot-swappable fans are located under the fan door.
Two working fans are required to provide adequate cooling for the Sun Fire T2000 server. If a fan fails, replace it as soon as possible to ensure system availability.
The following LEDs are lit when a fan fault is detected:
Front and rear Service Required LEDs.
Top Fan LED on the front of the server
LED on the faulty fan
If an overtemperature conditions occurs, the front panel OverTemp LED lights.
A message is displayed on the console and logged by ALOM. Use the showfaults command at the sc> prompt to view the current faults.
56 Sun Fire T2000 Server Service Manual • October 2005
To Remove a Fan
1. Gain access to the top of the server where the fan door is located (FIGURE 3-1).
You might need to extend the server to a maintenance position. See “To Extend the
Server to the Maintenance Position” on page 69
.
FN2
FN1
FN0
LED
Latch
Fan door
FIGURE 3-1 Removing a Fan
2. Unpackage the replacement fan and place it near the server.
3. Lift the latch on the top of the fan door (“Removing a Fan” on page 57), and lift the fan door open.
The fan door is spring loaded, and you must hold it in the open position.
4. Identify the faulty fan.
A lighted LED on the top of a fan (
FIGURE 3-1) indicates that the fan is faulty.
Chapter 3 Replacing Hot-Swappable and Hot-Pluggable FRUs 57
5. Pull up on the fan strap handle until the fan is removed from the fan bay.
To Replace a Fan
1. With the fan door held open, slide the replacement fan into the fan bay.
2. Apply firm pressure to fully seat the fan.
3. Verify that the LED on the replaced fan and the Top fan, Service Required, and Locator LEDs are not lit.
4. Close the fan door.
5. If necessary, return the server to its normal position in the rack.
Hot-Swapping a Power Supply
The Sun Fire T2000 server’s redundant hot-swappable power supplies enable you to remove and replace a power supply without shutting the server down provided that the other power supply is online and working.
The following LEDs are lit when a power supply fault is detected:
Front and rear Service Required LEDs.
Rear-FRU Fault LED on the front of the server
Amber Failure LED on the faulty power supply
If a power supply fails and you do not have a replacement available, leave the failed power supply installed to ensure proper air flow in the server.
To Remove a Power Supply
1. Identify which power supply (0 or 1) requires replacement (FIGURE 3-2).
A lighted amber LED on a power supply indicates that a failure was detected. You can also use the showfaults command at the sc> prompt.
58 Sun Fire T2000 Server Service Manual • October 2005
Latches
PS1
PS0
FIGURE 3-2 Locating Power Supplies and Release Latch
2. At the sc> prompt, issue the removefru command.
The removefru command prepares the server for the hot swap operation. For instructions on how to access the sc> prompt, refer to the Sun Fire T2000 Server
Advanced Lights Out Manager (ALOM) Guide. Example:
sc> removefru -y PSn Are you sure you want to remove PS0 [y/n]? y <PS0> is safe to remove.
Where
PSn is the power supply identifier for the power supply you plan to remove,
either PS0 or PS1.
3. Gain access to the rear of the server where the faulty power supply is located.
Chapter 3 Replacing Hot-Swappable and Hot-Pluggable FRUs 59
4. At the rear of the server, release the cable management arm (CMA) tab (FIGURE 3-3) and swing the CMA out of the way so you can access the power supply.
FIGURE 3-3 Rotating the Cable Management Arm
5. Disconnect the power cord from the faulty power supply.
6. Grasp the power supply handle and push the power supply latch to the right.
7. Pull the power supply out of the chassis.
To Replace a Power Supply
1. Align the replacement power supply with the empty power supply bay.
2. Slide the power supply into bay until it is fully seated.
3. Reconnect the power cord to the power supply.
4. Close the CMA, inserting the end of the CMA into the rear left rail bracket.
5. Verify that the amber LED on the replaced power supply, the Service required, and Rear-FRU Fault LEDs are not lit.
6. At the sc> prompt, issue the showenvironment command to verify the status of the power supplies.
60 Sun Fire T2000 Server Service Manual • October 2005
Hot-Swapping the Rear Blower
The rear blower on the Sun Fire T2000 server is hot-swappable.
The following LEDs are lit when a blower unit fault is detected:
Front and rear Service Required LEDs
LED on the blower is lit.
To Remove the Rear Blower
1. Gain access to the rear of the server where the faulty blower unit is located.
2. Release cable management arm tab (
FIGURE 3-3) and swing the cable management
arm out of the way so you can access the power supply.
3. Unscrew the two thumbscrews (
FIGURE 3-4) that secure the rear blower to the
chassis.
LED
FIGURE 3-4 Removing the Rear Blower
4. Grasp the thumbscrews and slowly slide the blower out of the chassis, keeping the blower level as you remove it.
To Replace the Rear Blower
1. Unpackage the replacement blower.
2. Slide the blower into the chassis until it locks into the power connector at the front of the blower compartment (
FIGURE 3-5).
Chapter 3 Replacing Hot-Swappable and Hot-Pluggable FRUs 61
FN2
FIGURE 3-5 Replacing the Blower Unit
3. Tighten the two thumbscrews to secure the blower (FN2)to the chassis.
4. Verify that the Rear Blower and Service Required LEDs are not lit.
5. Close the CMA, inserting the end of the CMA into the rear left rail bracket.
62 Sun Fire T2000 Server Service Manual • October 2005
Hot-Plugging a Hard Drive
The hard disk drives in the Sun Fire T2000 server are hot-pluggable, but this capability depends on how the hard drives are configured. To hot plug a drive you must be able to take the drive offline (prevent any applications from accessing it, and remove the logical software links to it) before you can safely remove it.
The following situations inhibit the ability to perform hot-plugging of a drive:
The hard drive provides the operating system, and the operating system is not
mirrored on another drive.
The hard drive cannot be logically isolated from the online operations of the
server
If your drive falls into these conditions, you must shut the system down before you replace the hard drive. See“To Shut the System Down” on page 68.
To Remove a Hard Drive
1. Identify the location of the hard drive that you want to replace (FIGURE 3-6).
HDD2
Latch
FIGURE 3-6 Locating the Hard Drive Release Button and Latch
Latch release
button
HDD0
Chapter 3 Replacing Hot-Swappable and Hot-Pluggable FRUs 63
HDD3
HDD1
2. Issue the Solaris OS commands required to stop using the hard drive.
Exact commands required depend on the configuration of your hard drives. You might need to unmount file systems or perform RAID commands.
Example:
cfgadm -c unconfigure c0t0d0s0
3. On the drive you plan to remove, push the latch release button (
FIGURE 3-6).
The latch opens.
Caution – The latch is not an ejector. Do not bend it too far to the left. Doing so can
damage the latch.
4. Grasp the latch and pull the drive out of the drive slot.
To Replace a Hard Drive
1. Align the replacement drive to the drive slot.
The hard drive is physically addressed according to the slot in which it is installed. See
FIGURE 3-6. It is important to install a replacement drive in the same slot as the
drive that was removed.
2. Slide the drive into the bay until it is fully seated.
3. Close the latch to lock the drive in place.
4. Perform administrative tasks to reconfigure the hard disk drive.
The procedures that you perform at this point depend on how your data is configured. You might need to partition the drive, create file systems, load data from backups, or have it updated from a RAID configuration.
Example:
cfgadm -c configure c0t0d0s0
64 Sun Fire T2000 Server Service Manual • October 2005
CHAPTER
4
Replacing Cold Swappable FRUs
This chapter describes how to remove and replace field replaceable units (FRUs) in the Sun Fire T2000 server that must be cold swapped.
The following topics are covered:
“Safety Information” on page 66
“Common Procedures for Parts Replacement” on page 67
“Removing and Replacing FRUs” on page 74
“Common Procedures for Finishing Up” on page 103
For a list of FRUs, see Appendix A, “Field-Replaceable Units” on page 119.
Note – Never attempt to run the system with the cover removed. The cover must be
in place for proper air flow. The cover interlock switch immediately shuts the system down when the cover is removed.
65
Safety Information
This section describes important safety information you need to know prior to removing or installing parts in the Sun Fire T2000 server.
For your protection, observe the following safety precautions when setting up your equipment:
Follow all Sun standard cautions, warnings, and instructions marked on the
equipment and described in Important Safety Information for Sun Hardware Systems, 816-7190.
Make sure that the voltage and frequency of your power source match the voltage
and frequency inscribed on the equipment s electrical rating label.
Follow the electrostatic discharge safety practices as described in this section.
Safety Symbols
The following symbols might appear in this book, note their meanings:
Caution – There is a risk of personal injury and equipment damage. To avoid
personal injury and equipment damage, follow the instructions.
Caution – Hot surface. Avoid contact. Surfaces are hot and might cause personal
injury if touched.
Caution – Hazardous voltages are present. To reduce the risk of electric shock and
danger to personal health, follow the instructions.
66 Sun Fire T2000 Server Service Manual • October 2005
Electrostatic Discharge Safety
Electrostatic discharge (ESD) sensitive devices, such as the motherboard, PCI cards, hard drives, and memory cards require special handling.
Caution – The boards and hard drives contain electronic components that are
extremely sensitive to static electricity. Ordinary amounts of static electricity from clothing or the work environment can destroy components. Do not touch the components along their connector edges.
Use an Antistatic Wrist Strap
Wear an antistatic wrist strap and use an antistatic mat when handling components such as drive assemblies, boards, or cards. When servicing or removing server components, attach an antistatic strap to your wrist and then to a metal area on the chassis. Do this after you disconnect the power cords from the server. Following this practice equalizes the electrical potentials between you and the server.
Use an Antistatic Mat
Place ESD-sensitive components such as the motherboard, memory, and other PCB cards on an antistatic mat.
Common Procedures for Parts Replacement
Before you can remove and replace parts that are inside the Sun Fire T2000 server, you must perform the following procedures:
“To Shut the System Down” on page 68
“To Extend the Server to the Maintenance Position” on page 69
“To Perform Electrostatic Discharge (ESD) Prevention Measures” on page 72
“To Disconnect Power From the Server” on page 72
“To Remove the Top Cover” on page 72
“To Remove the Front Bezel and Top Front Cover” on page 73
Chapter 4 Replacing Cold Swappable FRUs 67
Note – These procedures do not apply to the hot-pluggable and hot-swappable
devices (fans, power supplies, hard drives and rear blower) described in the preceding chapter.
The corresponding procedures that you perform when maintenance is complete are described in “Common Procedures for Finishing Up” on page 103.
Required Tools
The Sun Fire T2000 server can be serviced with the following tools:
Antistatic wrist strap
Antistatic mat
No. 2 Phillips screwdriver
To Shut the System Down
Performing a graceful shutdown makes sure all of your data is saved and the system is ready for restart.
1. Log in as superuser or equivalent.
Depending on the nature of the problem, you might want to view the system status, the log files, or run diagnostics before you shut down the system. Refer to the Sun Fire T2000 Server Administration Guide for log file information.
2. Notify affected users.
Refer to your Solaris system administration documentation for additional information.
3. Save any open files and quit all running programs.
Refer to your application documentation for specific information on these processes.
4. Shut down the Solaris OS.
Refer to the Solaris system administration documentation for additional information.
5. Switch from the system console to the ALOM sc> prompt by typing the #. (Pound Period) key sequence.
d. At the ALOM sc> prompt, issue the poweroff command.
sc> poweroff -fy SC Alert: SC Request to Power Off Host Immediately.
68 Sun Fire T2000 Server Service Manual • October 2005
Note – You can also use the Power ON/OFF button on the front of the server to
initiate a graceful system shutdown. This button is recessed to prevent accidental server power-off. Use the tip of a pen to operate this button.
Refer to the Sun Fire T2000 Server Advanced Lights Out Management (ALOM) Guide for more information about the ALOM poweroff command.
To Extend the Server to the Maintenance
Position
If the server is installed in a rack with the extendable slide rails that were supplied with it, use this procedure to extend the server to the maintenance position.
Note – Removing the server from the rack is recommended for all cold-swappable
FRU replacement procedures except the DIMMs, PCI cards, and the system controller.
1. (Optional) Issue the following command from the ALOM sc> prompt to locate the system that requires maintenance.
sc> setlocator on Locator LED is on.
Once you have located the server, press the Locator LED button to turn it off.
2. Check to see that no cables will be damaged or interfere when the server is extended.
Although the cable management arm (CMA) that is supplied with the server is hinged to accommodate extending the server, you should make sure that all cables and cords are capable of extending.
3. From the front of the server, release the slide rail latches on each side.
Pinch the green latches as shown in
FIGURE 4-1.
Chapter 4 Replacing Cold Swappable FRUs 69
FIGURE 4-1 Slide Release Latches
4. While pinching the release latches, slowly pull the server forward until the slide rails latch.
To Remove the Server From the Rack
Removing the server from the rack is recommended for all cold swappable FRU replacement procedures except the DIMMs, PCI cards, and the system controller.
Caution – The server weighs approximately 40 lb. (18 kg). Two people are required
to dismount and carry the chassis.
1. Disconnect all the cables and power cords from the server.
2. Extend the server to the maintenance position as described in “To Extend the
Server to the Maintenance Position” on page 69.
3. Press the metal lever ( disconnect the CMA from the rail assembly (on the right side from the back of the rack).
This leaves the CMA still attached to the cabinet, but the server chassis is now disconnected from the CMA.
70 Sun Fire T2000 Server Service Manual • October 2005
FIGURE 4-2) that is located on the inner side of the rail to
FIGURE 4-2 Locating the Metal Lever
Caution – The server weighs approximately 40 lb. (18 kg). The next step requires
two people to dismount and carry the chassis.
4. From the front of the server, pull the release tabs forward and pull the server forward until it is free of the rack rails.
The release tabs are located on each rail, about midway on the server.
5. Set the server on a sturdy work surface.
Chapter 4 Replacing Cold Swappable FRUs 71
To Disconnect Power From the Server
Caution – The system supplies standby power to the circuit boards even when the
system is powered off.
Disconnect both power cords from the power supplies.
Note – The following FRU replacements do not require that power be removed:
DIMMs and PCI cards.
To Perform Electrostatic Discharge (ESD)
Prevention Measures
1. Prepare an antistatic surface by which to set parts during removal and installation.
Place ESD-sensitive components such as the printed circuit boards on an antistatic mat. The following items can be used as an antistatic mat:
Antistatic bag used to wrap a Sun replacement part
Sun ESD mat, part number 250-1088
Disposable ESD mat (shipped with some replacement parts or optional system
components)
2. Attach an Antistatic Wrist Strap.
When servicing or removing server components, attach an antistatic strap to your wrist and then to a metal area on the chassis. Do this after you disconnect the power cords from the server.
To Remove the Top Cover
All field replaceable units (FRUs) that are not hot swappable require the removal of the top cover.
72 Sun Fire T2000 Server Service Manual • October 2005
1. Press the Top cover release button (FIGURE 4-3).
Top cover
Fan cover
Fan
Top cover release button
cover latch
FIGURE 4-3 Top Cover and Release Button
Top front cover
2. While pressing the top cover release button, slide the cover toward the rear of the server about half of an inch.
3. Lift the cover off the chassis.
To Remove the Front Bezel and Top Front Cover
The following field-replaceable units (FRUs) require the removal of the top front cover and front bezel:
Motherboard
SAS disk backplane
LED board
Front I/O board
Fan power board
DVD
1. Remove the top cover as described in the previous procedure.
2. Lift the fan cover latch (
3. Loosen the captive screw (near the right-most fan) that secures the bezel to the chassis (
FIGURE 4-4).
FIGURE 4-3) and open the fan cover.
Chapter 4 Replacing Cold Swappable FRUs 73
FIGURE 4-4 Removing the Front Bezel from the Server Chassis
4. Remove the front bezel from the chassis (FIGURE 4-4).
The bezel is held in place by a mounting tab and four fasteners that clamp the bezel to the chassis.
5. While holding the fan cover open, slide the top front cover forward to disengage it from the chassis
.
6. Lift the top front cover from the chassis.
Removing and Replacing FRUs
This section provides procedures for replacing the following field replaceable parts (FRUs) inside the server chassis:
“To Remove PCI-E and PCI-X Cards” on page 75 and “To Replace PCI Cards” on
page 77
“To Remove DIMMs” on page 77 and “To Replace DIMMs” on page 79
“To Remove the System Controller” on page 82 and “To Replace the System
Controller Board” on page 83
“To Remove the Motherboard Assembly” on page 84 and “To Replace the
Motherboard Assembly” on page 88
“To Remove the Power Distribution Board” on page 90 and “To Replace the
Power Distribution Board” on page 92
74 Sun Fire T2000 Server Service Manual • October 2005
“To Remove the LED Board” on page 93 and “To Remove the LED Board” on
page 93
“To Remove the Fan Power Board” on page 95 and “To Replace the Fan Power
Board” on page 96
“To Remove the DVD Drive” on page 98 and “To Replace the DVD Drive” on
page 99
“To Remove the SAS Disk Backplane” on page 99 and “To Replace the SAS Disk
Backplane” on page 100
“To Remove the Battery on the System Controller” on page 101 and “To Replace
the Battery on the System Controller” on page 101
To locate these FRUs, refer to Appendix A, “Field-Replaceable Units” on page 119.
To Remove PCI-E and PCI-X Cards
Use this procedure to remove the optional PCI-E and PCI-X cards from the server.
1. Perform the procedures described in “Common Procedures for Parts
Replacement” on page 67.
2. Locate the PCI card that you want to remove.
To locate the PCI card slots, refer to
FIGURE 4-5 and FIGURE 4-6. The PCI card slots are
located on the I/O portion of the motherboard assembly.
PCI-E slots PCI-X slots
Slot 0
FIGURE 4-5 Location of PCI-E and PCI-X Card Slots
Slot 1
Slot 2
Slot 1
Slot 0
3. Make a note of where the PCI card is installed and note any cables so you know where to reinstall the card and cables.
Chapter 4 Replacing Cold Swappable FRUs 75
PCI-E slots 0, 1, 2
PCI-X slots 0, 1
FIGURE 4-6 Location of PCI-E and PCI-X Card Slots
4. Make note of and remove any cables that are attached to the card.
5. Rotate the PCI hold-down bracket 90 degrees so it no longer covers the PCI card (
FIGURE 4-7).
PCI hold-down bracket
FIGURE 4-7 PCI Card and Hold-down Bracket
76 Sun Fire T2000 Server Service Manual • October 2005
6. Carefully work the card out of the socket.
7. Place the card on an antistatic mat.
8. Rotate the hold-down bracket so that it does not protrude into the chassis.
To Replace PCI Cards
Use this procedure to replace PCI-E and PCI-X cards.
1. Unpackage the replacement PCI-E or PCI-X card and place it on an antistatic mat.
2. Locate the proper socket for the card you are replacing.
3. Rotate the PCI hold-down bracket 90 degrees so you can install the card.
4. Insert the card into the socket.
5. Rotate the PCI hold-down bracket 90 degrees to lock the card in place.
6. Perform the procedures described in “Common Procedures for Finishing Up” on
page 103.
To Remove DIMMs
Caution – This procedure requires that you handle components that are sensitive to
static discharges that can cause the component to fail. To avoid this problem, ensure that you follow antistatic practices as described in “To Perform Electrostatic
Discharge (ESD) Prevention Measures” on page 72.
1. Perform the procedures described in “Common Procedures for Parts
Replacement” on page 67.
2. Locate the DIMM (
FIGURE 4-8 and TABLE 4-1 to identify the DIMM you want to remove.
Use
Note – For memory configuration information see “To Add DIMMs” on page 113.
FIGURE 4-8) that you want to replace.
Chapter 4 Replacing Cold Swappable FRUs 77
Front of board
FIGURE 4-8 DIMM Locations
Use FIGURE 4-8 and TABLE 4-1 to map DIMM names that are displayed in faults to socket numbers that identify the location of the DIMM on the motherboard.
TABLE 4-1 DIMM Names and Socket Numbers
DIMM Name Used in Messages
CH0/R1/D1 J0901 DIMM 1
CH0/R0/D1 J0701 DIMM 2
CH0/R1/D0 J0801 DIMM 3
CH0/R0/D0 J0601 DIMM 4
CH1/R0/D1 J1401 DIMM 5
78 Sun Fire T2000 Server Service Manual • October 2005
*
Socket No. DIMM No.
TABLE 4-1
DIMM Names and Socket Numbers (Continued)
DIMM Name Used in Messages
CH1/R1/D1 J1201 DIMM 6
CH1/R1/D0 J1301 DIMM 7
CH1/R0/D0 J1101 DIMM 8
CH2/R1/D1 J1901 DIMM 16
CH2/R0/D1 J1701 DIMM 15
CH2/R1/D0 J1801 DIMM 14
CH2/R0/D0 J1601 DIMM 13
CH3/R1/D1 J2401 DIMM 12
CH3/R0/D1 J2201 DIMM 11
CH3/R1/D0 J2301 DIMM 10
CH3/R0/D0 J2101 DIMM 9
* DIMM names in messages are displayed with the full name such as MB/CMP0/CH1/R1/D1, but this table lists
the DIMM namein an abbreviated way(the preceding MB/CMP0 is omitted) for clarity.
*
Socket No. DIMM No.
3. Make note of the DIMM location so you can install the replacement DIMM in the same socket.
4. Push down on the ejector levers on each side of the DIMM until the DIMM is released.
5. Grasp the top corners of the faulty DIMM and remove it from the system.
6. Place the DIMM on an antistatic mat.
To Replace DIMMs
1. Unpackage the replacement DIMMs and place them on an antistatic mat.
2. Ensure that the connector ejector tabs are in the open position.
3. Line up the replacement DIMM with the connector.
Align the DIMM notch with the key in the connector. This ensures that the DIMM is oriented correctly.
4. Push the DIMM into the connector until the ejector tabs lock the DIMM in place.
5. Perform the procedures described in “Common Procedures for Finishing Up” on
page 103.
Chapter 4 Replacing Cold Swappable FRUs 79
6. Perform the following steps to clear the memory fault.
a. Gain access to the ALOM sc> prompt.
Refer to the Sun Fire T2000 Server Advanced Lights Out Management (ALOM) Guide for instructions.
b. Run the showfaults -v command to determine how to clear the fault:
If the fault is a Host-detected fault (displays a UUID), such as the following:
sc> showfaults -v ID Time FRU Fault 0 SEP 09 11:09:26 MB/CMP0/CH0/R0/D0 Host detected fault, MSGID: SUN4U-8000-2S UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86
Run the clearfault command with the UUID provided in the showfaults output:
sc> clearfault 7ee0e46b-ea64-6565-e684-e996963f7b86 Clearing fault from all indicted FRUs... Fault cleared.
If the fault resulted in the DIMM being disabled, such as the following:
sc> showfaults -v ID Time FRU Fault 1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled
Run the enablecomponent command to enable the FRU:
sc> enablecomponent MB/CMP0/CH0/R0/D0
7. Perform the following steps to verify that there are no faults:
a. Set the virtual keyswitch to diag mode so that POST will run in service mode.
sc> setkeyswitch diag
80 Sun Fire T2000 Server Service Manual • October 2005
b. Issue the poweron command.
sc> poweron
c. Switch to the system console to view POST output.
sc> console
Watch the POST output for possible fault messages. The following output is a sign that POST did not detect any faults:
. . . 0:0>POST Passed all devices. 0:0> 0:0>DEMON: (Diagnostics Engineering MONitor) 0:0>Select one of the following functions 0:0>POST:Return to OBP. 0:0>INFO: 0:0>POST Passed all devices. 0:0>Master set ACK for vbsc runpost command and spin...
Note – Depending on the configuration of ALOM POST variables (see “Flowchart of
ALOM Variables for POST Configuration” on page 33) and whether POST detected
faults or not, the system might boot, or the system might remain at the ok prompt. If the system is at the ok prompt, type boot.
d. Issue the Solaris OS fmadm faulty command.
# fmadm faulty
No memory or DIMM faults should be displayed. If faults are reported, return to the “Diagnostic Flow Chart” on page 13 for an
approach to diagnosing the fault.
Chapter 4 Replacing Cold Swappable FRUs 81
To Remove the System Controller
Caution – The system controller can be hot. To avoid injury, handle it carefully.
1. Perform the procedures described in “Common Procedures for Parts
Replacement” on page 67.
2. Locate the system controller card.
See Appendix A for an illustration of the servers FRUs that shows the system controller card.
3. Push down on the ejector levers on each side of the system controller until the card is released from the socket.
FIGURE 4-9 Ejecting and Removing the System Controller Card
4. Grasp the top corners of the card and pull it out of the socket.
5. Place the system controller card on an antistatic mat.
6. Remove the system configuration PROM (
FIGURE 4-10) from the system controller
and place it on an antistatic mat.
The system controller contains the persistent storage for the host ID and Ethernet MAC addresses of the system, as well as the ALOM configuration including the IP addresses and ALOM user accounts, if configured. This information will be lost unless the system configuration PROM is removed and installed in the replacement system controller. The PROM does not hold the fault data, and this data will no longer be accessible when the system controller is replaced.
82 Sun Fire T2000 Server Service Manual • October 2005
System configuration PROM
FIGURE 4-10 Locating the System Configuration PROM
To Replace the System Controller Board
1. Unpackage the replacement system controller board and place it on an antistatic mat.
2. Install the system configuration PROM that you removed from the faulty system controller board.
The PROM is keyed to ensure proper orientation.
3. Locate the system controller slot on the motherboard assembly.
4. Ensure that the ejector levers are open.
5. Holding the bottom edge of the system controller parallel to its socket, carefully align the system controller so that each of its contacts is centered on a socket pin.
Ensure that the system controller is correctly oriented. A notch along the bottom of the system controller corresponds to a tab on the socket.
6. Push firmly and evenly on both ends of the system controller until it is firmly seated in the socket.
You hear a click when the ejector levers lock into place.
7. Perform the procedures described in “Common Procedures for Finishing Up” on
page 103.
Chapter 4 Replacing Cold Swappable FRUs 83
To Remove the Motherboard Assembly
Although the CPU and the I/O board are two distinct boards, they must be removed and replaced as a single motherboard assembly (
Caution – The flexible cable that connects the motherboard to the I/O board is
fragile. Handle these parts very carefully to prevent damage.
Caution – This procedure requires that you handle components that are sensitive to
static discharges that can cause the component to fail. To avoid this problem, ensure that you follow antistatic practices as described in “To Perform Electrostatic
Discharge (ESD) Prevention Measures” on page 72.
CPU board
I/O board
FIGURE 4-11).
f
FIGURE 4-11 Motherboard Assembly
1. Perform the procedures described in “Common Procedures for Parts
Replacement” on page 67.
2. Remove all cables from the rear of the server.
Ensure that you remove all cables as well as the power cords.
3. Remove any PCI option cards that are installed and then rotate the hold-down brackets so they do not protrude into the chassis.
84 Sun Fire T2000 Server Service Manual • October 2005
4. Remove all DIMMs. See “To Remove DIMMs” on page 77 from the motherboard assembly.
Make note of the memory configuration so you can reinstall the memory in the replacement board.
5. Remove the system controller board from the motherboard assembly See “To
Remove the System Controller” on page 82.
6. Disconnect cables from the motherboard assembly:
The gray ribbon cable that runs along the left side of the chassis and
motherboard.
The cable marked P8 (FIGURE 4-12).
Disconnect the hard drive data cables and carefully pull them through the interior
wall of the chassis. The SAS hard drive and the cable marked P8 pass through a cut out in the interior
wall of the chassis. Before removing the motherboard assembly by lifting it over the interior wall, ensure that these cables are out of the way. The SAS hard drive cables can readily be folded back over the interior wall or passed through the cutout (
FIGURE 4-12). However, the cable marked P8 is large and contains a
number of small wires. The cable will not easily pass through the cutout. While pushing and pulling the cables through the cutout be careful not to damage the wires.
FIGURE 4-12 Cable Cutout
Chapter 4 Replacing Cold Swappable FRUs 85
7. Remove the screws and nylon washers that secure the motherboard assembly to the chassis (
FIGURE 4-13).
Caution – Do not remove the screws that hold the flexible cable in place. These
screws must be installed at the factory, and they must not be removed.
1
2
Bus bar screws
7
3
9
10
8
4
Flexible cable (do not remove flex cable screws)
FIGURE 4-13 Location of the Screws in the Motherboard Assembly
6
5
8. Slide the motherboard assembly forward to disengage the connectors at the rear of the motherboard assembly from the cutouts in the rear of the chassis.
86 Sun Fire T2000 Server Service Manual • October 2005
Loading...